OpenGame: Open Agentic Coding for Games
Paper: arXiv:2604.18394 Code: leigest519/OpenGame Code reference:
main@c54307ef(2026-04-22)
1. Motivation (研究动机)
现有 Code LLM / code agent 能写局部函数、样板代码或单文件 demo,但端到端网页游戏生成会同时牵涉 update loop、physics、event handling、asset pipeline、scene wiring、configuration 和跨文件状态。论文指出 frontier models 在完整游戏任务中反复出现三类失败:全局状态丢失导致逻辑不连贯,误用 Phaser 等引擎抽象导致 engine-specific gap,以及 asset key、scene、config、initialization order 之间的跨文件不一致。
这篇论文要解决的不是“把 prompt 翻译成一段游戏代码”,而是“从自然语言设计规格自动产出可构建、可运行、可交互的 2D web game project”。问题值得研究,因为游戏比静态编程题更接近真实复杂应用:即使代码能编译,视觉、交互、胜负状态和用户指定 mechanic 仍可能失败,因此需要面向执行和可玩性的 agentic software engineering。
Figure 1 解读:这张图展示 OpenGame 的目标任务形态:用户只给自然语言游戏想法,系统要自动生成不同类型的 2D playable games,并把代码、视觉资产、音频资产和完整 game lifecycle 连接起来。它强调了本文和普通 code generation benchmark 的差异:输出不是单个函数,而是完整交互式项目。
2. Idea (核心思想)
核心洞察:端到端游戏生成的主要失败不是单点语法错误,而是长期任务中的结构漂移和跨文件不一致;因此 OpenGame 把“可复用结构先验”和“可累积调试知识”显式做成 agent skill,而不是只依赖更大的通用模型。
关键创新有三层:第一,Game Skill 由 Template Skill 与 Debug Skill 组成,前者沉淀可复用项目骨架,后者沉淀已验证修复协议;第二,GameCoder-27B 通过 CPT、SFT、execution-grounded RL 获得游戏引擎与逻辑先验;第三,OpenGame-Bench 用 headless browser execution + VLM judging 评估 Build Health、Visual Usability、Intent Alignment。
与 qwen-code / Cursor 这类通用 agentic coding framework 相比,OpenGame 的根本差异是把游戏结构约束前置:先用 Physics-First Classification 选 template family,再通过 Template Method Pattern 覆写 hook,而不是让模型从空白项目自由生成所有文件。
3. Method (方法)
3.1 Overall framework
Figure 2 解读:OpenGame 有三个耦合模块:左侧是 GameCoder-27B 的三阶段训练 pipeline,用 CPT 建立 Phaser / JavaScript / TypeScript 游戏先验,用 SFT 对齐复杂游戏设计指令,用 execution-grounded RL 强化可执行逻辑;中间是六阶段 autonomous agent workflow,从分类、scaffold、GDD、asset synthesis、code implementation 到 verification;右侧是 agent evolution,Template Skill 扩展 template library,Debug Skill 扩展 living debug protocol。
直觉上,OpenGame 把“长程生成”拆成结构化接口:classification 决定物理与视角,template 决定生命周期骨架,GDD 决定可实现的 mechanic,asset registry 决定资源 key,debug protocol 决定修复路径。这样 agent 需要自由生成的部分被压缩到有限 hook 和配置字段中,跨文件一致性更容易被检查和修复。
3.2 GameCoder-27B training pipeline
GameCoder-27B 基于 Qwen3.5-27B。论文描述三阶段训练,但未公开训练脚本、GPU、LR、batch size、训练步数等 launch config:
- CPT:从 GitHub 上 Phaser 与 JavaScript/TypeScript 游戏仓库、官方文档、社区教程构造预训练语料,学习 game loop、physics、asset usage、state management。
- SFT:用 gpt-codex5.1 生成复杂多步骤游戏设计 prompt,再用 minimax2.5 产出 target solution,使模型学会从抽象创意到具体代码结构。
- RL:不直接生成整局游戏,而是在 component-level 生成单文件 gameplay logic / functional modules,用 unit tests 的 execution success 与 aggregate test pass rate 给 reward。
概念上可写作:
论文没有给出 的精确权重,因此笔记不伪造 、batch size 或 optimizer 设置。
3.3 Autonomous agent workflow
OpenGame 的 agent workflow 有六个阶段:initialization/classification、scaffolding、design generation、asset synthesis、code implementation、verification。关键机制如下:
- Physics-First Classification:
classify-game-type按物理约束和视角,而不是 genre 名称,映射到platformer、top_down、grid_logic、tower_defense、ui_heavy。 - Scaffolding + GDD:复制 shared core、
modules/{archetype}codebase 和对应 docs;generate-gdd读取 archetype-specific API constraints,产出 technical GDD 与文件级 roadmap。 - Multimodal Asset Synthesis:
generate-game-assets依据 GDD asset registry 生成 background、character animation、items、audio;generate-tilemap把 ASCII layout 转为 Phaser Tilemap JSON;再读取asset-pack.json固定 texture keys。 - Three-Layer Reading + Template Method Pattern:实现前按顺序读取 API summary、target source file、implementation guide;代码不是从零写完整场景,而是在
_Template*.ts/ base class hook 中覆写例如setupCustomCollisions()这类 extension point。
3.4 Game Skill: Template Skill + Debug Skill
问题设定:给定用户规格 、meta template 、template library 和 debug protocol ,系统要生成可构建、可运行项目 。
Template Skill 从一个 game-agnostic skeleton 开始,完成任务后抽取稳定、通用、安全的代码片段并合入 。论文观察到 最终形成五类 template families:gravity-based side view、top-down continuous motion、discrete grid logic、path-and-wave dynamics、UI-driven gameplay。
Debug Skill 维护 living protocol 。每次 build/test/runtime 失败后记录 (error signature, root cause, verified fix);高频模式会变成 reusable rule,novel failure 会扩展协议。这样调试知识以协议形式累积,而不是每次从 prompt 里重新推理。
3.5 Pseudocode grounded in released code
def physics_first_classify(game_description: str, llm) -> dict:
"""Grounded in packages/core/src/tools/game-type-classifier.ts."""
system_prompt = build_physics_first_prompt()
raw = llm.complete(
system=system_prompt,
user={"game_description": game_description},
temperature=0.3,
timeout_ms=15000,
)
parsed = parse_json_or_fallback(raw)
archetype = parsed.get("archetype", "platformer")
if archetype not in {"platformer", "top_down", "grid_logic", "tower_defense", "ui_heavy"}:
archetype = heuristic_extract_archetype(raw) or "platformer"
return {
"archetype": archetype,
"perspective": parsed.get("perspective"),
"movement_type": "grid" if archetype == "grid_logic" else "continuous",
"reasoning": parsed.get("reasoning", "physics-first fallback"),
}def scaffold_and_generate_gdd(workspace, user_requirement: str, archetype: str, llm) -> str:
"""Grounded in packages/core/src/tools/generate-gdd.ts and agent-test/prompts/custom.md."""
copy_tree("agent-test/templates/core", workspace / "src")
copy_tree(f"agent-test/templates/modules/{archetype}/src", workspace / "src")
copy_tree(f"agent-test/docs/modules/{archetype}", workspace / f"docs/modules/{archetype}")
core_rules = read_text(workspace / "docs/gdd/core.md")
design_rules = read_text(workspace / f"docs/modules/{archetype}/design_rules.md")
template_api = read_text(workspace / f"docs/modules/{archetype}/template_api.md")
prompt = build_gdd_prompt(user_requirement, archetype, core_rules, design_rules, template_api)
gdd = llm.complete(prompt=prompt, temperature=0.5, timeout_ms=60000)
write_text(workspace / "GAME_DESIGN.md", gdd)
return gdddef implement_with_template_method(workspace, roadmap: list[str]) -> None:
"""Grounded in agent-test/prompts/custom.md and BaseLevelScene.ts."""
config = read_json(workspace / "src/gameConfig.json")
config.update(extract_config_fields(roadmap))
write_json(workspace / "src/gameConfig.json", config)
api_summary = read_text(workspace / "docs/modules/platformer/template_api.md")
for target_file in select_target_files(roadmap):
source = read_text(workspace / target_file)
guide = read_text(workspace_guide_for(target_file))
patch = generate_patch(api_summary, source, guide, roadmap)
assert patch.only_overrides_existing_hooks()
apply_patch(target_file, patch)def debug_skill_loop(project_dir, protocol, max_iterations: int = 5) -> bool:
"""Grounded in agent-test/debug-skill/src/debug-loop.ts, validator.ts, repairer.ts."""
for iteration in range(max_iterations):
violations = validate_project(project_dir, protocol)
if violations:
repair_with_protocol(project_dir, violations[0], protocol)
continue
build = run_stage(project_dir, "build")
if not build.success:
diagnosis = diagnose_errors(build.errors, protocol)
repair = apply_repair(diagnosis, build.errors[0], project_dir)
record_outcome(protocol, build.errors[0], diagnosis, repair)
continue
test = run_stage(project_dir, "test")
if not test.success:
diagnosis = diagnose_errors(test.errors, protocol)
repair = apply_repair(diagnosis, test.errors[0], project_dir)
record_outcome(protocol, test.errors[0], diagnosis, repair)
continue
return True
return Falsedef evolve_template_skill(completed_project, library):
"""Grounded in agent-test/template-skill/src/evolve.ts and library-manager.ts."""
snapshot = collect_project_snapshot(completed_project)
classification = classify_project_by_physics(snapshot, library)
reusable_patterns = extract_patterns(snapshot, classification)
abstracted_templates = abstract_into_reusable_templates(reusable_patterns)
library = merge_into_library(library, abstracted_templates, reusable_patterns)
save_library_manifest_and_family_files(library)
return libraryCode reference:
main@c54307ef(2026-04-22) — pseudocode and mapping based on this commit
| Paper Concept | Source File | Key Class/Function |
|---|---|---|
| Physics-First Classification | packages/core/src/tools/game-type-classifier.ts | GameTypeClassifierTool, GameArchetype |
| GDD generation with archetype docs | packages/core/src/tools/generate-gdd.ts | GenerateGDDTool, GenerateGDDInvocation |
| Multimodal asset generation | packages/core/src/tools/generate-assets.ts | GenerateAssetsTool, asset-pack.json output |
| ASCII-to-Phaser tilemap | packages/core/src/tools/generate-tilemap.ts | GenerateTilemapTool |
| Six-phase workflow prompt | agent-test/prompts/custom.md | ordered calls to classify-game-type, scaffold, generate-gdd, asset tools, verification |
| Template Method hooks | agent-test/templates/modules/platformer/src/scenes/BaseLevelScene.ts | BaseLevelScene, setupCustomCollisions() hook, abstract scene methods |
| Template Skill evolution | agent-test/template-skill/src/evolve.ts | evolveFromProject(), mergeIntoLibrary() |
| Debug Skill loop | agent-test/debug-skill/src/debug-loop.ts | build/test/diagnose/repair/record loop |
| Pre-execution consistency checks | agent-test/debug-skill/src/validator.ts | asset key, gameConfig.json, scene registration validators |
| Repair protocol application | agent-test/debug-skill/src/repairer.ts | applyRepair(), known fix vs LLM-generated fix |
论文公式与 released code 实现差异:论文报告了 GameCoder-27B 的 CPT/SFT/RL 训练 pipeline 与 OpenGame-Bench 动态评测 pipeline,但 main@c54307ef 中只有 README/docs 级说明,未公开 GameCoder 训练脚本、训练超参、OpenGame-Bench evaluator 源码或论文表格复现实验配置;released code 主要覆盖 OpenGame agent runtime、Game Skill、template/debug protocol 和工具实现。因此本笔记中的训练与 benchmark 数字来自论文,而非 repo launch config。
4. Experimental Setup (实验设置)
- Benchmark:OpenGame-Bench,150 个任务,来自 150 条 unique natural-language prompts;覆盖 platformers、top-down shooters、puzzle games、arcade classics、strategy。每条 prompt 是唯一输入,没有 reference implementation 或 starter code;任务来自 curated public game-jam repositories 与 AI-assisted design briefs,并人工确认可在 2D web framework 中实现。
- Evaluation protocol:每个生成项目必须能 build、通过本地 HTTP server 运行且无 fatal runtime error,并在 automated play 中产生至少一个 non-empty screenshot;每个任务用 3 个 random seeds 评估并报告 mean scores。
- Metrics:Build Health (BH) 衡量 compilation/runtime/rendering stability;Visual Usability (VU) 结合 pixel-level heuristic(frame entropy、motion detection)与 VLM judge;Intent Alignment (IA) 是 VLM judge 对 prompt requirements 的 weighted pass rate。三者均缩放到 。
- Baselines:Direct LLMs 包括 Qwen-3.5-Max、MiniMax m2.5、GLM-4.5、Kimi K2.5、DeepSeek V3.2、Claude Sonnet 4.6、GPT-5.1、Gemini 3.1 Pro;Agentic frameworks 包括 qwen-code(多 backend)与 Cursor(Kimi K2.5 / Claude Sonnet 4.6)。所有 baseline prompt 均显式要求使用 Phaser 3,以避免退化成单文件 vanilla JS。
训练配置可追溯性:论文给出 GameCoder-27B backbone 为 Qwen3.5-27B,训练阶段为 CPT/SFT/RL;但没有报告 GPU type/count、training steps、learning rate、batch size。released repo main@c54307ef 也没有对应训练 launch script,因此不能把 README 或基础配置当作实验训练配置来源。
5. Experimental Results (实验结果)
5.1 Main results on OpenGame-Bench
| Category | System / Model | BH | VU | IA |
|---|---|---|---|---|
| Direct LLM | Qwen-3.5-Max | 51.8 | 35.5 | 38.9 |
| Direct LLM | MiniMax m2.5 | 39.7 | 39.3 | 31.8 |
| Direct LLM | GLM-4.5 | 46.5 | 45.0 | 31.2 |
| Direct LLM | Kimi K2.5 | 45.6 | 46.8 | 44.6 |
| Direct LLM | DeepSeek V3.2 | 57.0 | 38.9 | 33.5 |
| Direct LLM | Claude Sonnet 4.6 | 58.5 | 50.8 | 50.3 |
| Direct LLM | GPT-5.1 | 57.4 | 52.9 | 49.4 |
| Direct LLM | Gemini 3.1 Pro | 53.6 | 60.2 | 42.1 |
| Agentic | qwen-code + Qwen-3.5-Max | 57.7 | 41.3 | 40.2 |
| Agentic | qwen-code + MiniMax m2.5 | 48.1 | 39.1 | 34.6 |
| Agentic | qwen-code + Kimi K2.5 | 59.6 | 52.1 | 49.9 |
| Agentic | qwen-code + Claude Sonnet 4.6 | 63.2 | 54.3 | 57.8 |
| Agentic | Cursor + Kimi K2.5 | 57.1 | 55.2 | 54.2 |
| Agentic | Cursor + Claude Sonnet 4.6 | 66.8 | 61.4 | 58.9 |
| Ours | OpenGame + Qwen-3.5-27B | 62.8 | 53.8 | 49.8 |
| Ours | OpenGame + GameCoder-27B | 63.9 | 57.0 | 54.1 |
| Ours | OpenGame + Claude Sonnet 4.6 | 72.4 | 67.2 | 65.1 |
OpenGame + Claude Sonnet 4.6 达到 BH=72.4、VU=67.2、IA=65.1,相比最强 baseline Cursor + Claude Sonnet 4.6 分别提升 +5.6、+5.8、+6.2。OpenGame + GameCoder-27B 达到 63.9/57.0/54.1,在 BH 与 IA 上超过所有 direct LLM baseline,并比 qwen-code + Claude Sonnet 4.6 高 BH +0.7、VU +2.7,但 IA 低 3.7。
5.2 Ablation: GameCoder-27B training pipeline
| Stage | Training Components | BH | VU | IA |
|---|---|---|---|---|
| Base Model | Qwen-3.5-27B in OpenGame | 62.8 | 53.8 | 49.8 |
| Stage 1 | + CPT | 63.2 | 54.7 | 50.6 |
| Stage 2 | + CPT + SFT | 63.5 | 55.7 | 52.5 |
| Stage 3 | + CPT + SFT + RL | 63.9 | 57.0 | 54.1 |
SFT 对 IA 的增量最大(+1.9),说明高质量 synthetic QA 对 creative game specification alignment 关键;RL 继续提升 VU 和 IA,但 headline gain 主要仍来自 agent framework。
5.3 Ablation: agent workflow mechanisms
| Agent Configuration | BH | VU | IA |
|---|---|---|---|
| OpenGame Full Workflow | 72.4 | 67.2 | 65.1 |
| w/o Hook-Driven Implementation | 62.3 | 57.6 | 53.5 |
| w/o Three-Layer Reading | 67.8 | 61.9 | 56.5 |
| w/o Physics-First Classification | 70.2 | 64.6 | 61.6 |
Hook-Driven Implementation 最关键:去掉后 BH 降 10.1、IA 降 11.6,常导致生命周期管理错误。Three-Layer Reading 去掉后 IA 降 8.6,说明多文件合成中 progressive salience control 仍必要。
5.4 Ablation: Agent Evolution / Game Skills
| Template Architecture | Debugging Strategy | BH | VU | IA |
|---|---|---|---|---|
| Static Skeleton | Static Rule Checklist | 60.5 | 54.8 | 51.2 |
| Static Skeleton | Full Living Protocol | 65.4 | 59.2 | 56.3 |
| Partial Evolved Library (2 Families) | Static Rule Checklist | 63.1 | 57.3 | 53.8 |
| Full Evolved Library (5 Families) | Static Rule Checklist | 66.3 | 60.7 | 57.9 |
| Full Evolved Library (5 Families) | Post-Execution Fixes Only | 69.5 | 63.8 | 61.4 |
| Full Evolved Library (5 Families) | Full Living Protocol | 72.4 | 67.2 | 65.1 |
完整 evolved library + living protocol 同时带来 scaffold 和 debugging 的收益;只靠静态 skeleton 会把 generation quality 限制在 BH=60.5、IA=51.2。
Figure 3 解读:横轴是最大 automated debugging iterations ,纵轴是 BH/VU/IA。 时 BH 只有 58.4,说明单次生成复杂 multi-file Phaser 项目很脆弱;从 到 收益最陡,之后到 逐步平台化,最终对应 full system 的 72.4/67.2/65.1。
Figure 4 解读:OpenGame 的优势在不同游戏类型上不均匀。IA 在 Platformers 为 76.8、Top-Down Shooters 为 71.4、Arcade classics 为 66.5、Strategy 为 58.2、Puzzle/UI 为 52.6。物理和空间约束强的游戏更适合 template families;抽象策略或 UI 逻辑更容易出现静默状态不同步。
5.5 Limitations and conclusions
论文作者没有单独列 limitation,但结果暴露出两点边界:第一,即使 full OpenGame 仍有约 34.9% weighted mechanical requirements 部分或完全未满足;第二,Strategy 与 Puzzle/UI 的 IA 明显低于 Platformers / Top-Down Shooters,说明抽象逻辑状态、inventory、match-three rules 这类弱视觉耦合 mechanic 更难由当前模板和调试协议稳定覆盖。
总体结论:可靠 game generation 不只是更强 code model,还需要持久化结构先验、可复用 debug knowledge,以及基于真实执行的 evaluation protocol。OpenGame 把这些机制组合成面向 games 的 agentic coding system,在 OpenGame-Bench 上优于 direct LLM 和通用 agentic coding baselines。