OpenGame: Open Agentic Coding for Games

Paper: arXiv:2604.18394 Code: leigest519/OpenGame Code reference: main @ c54307ef (2026-04-22)

1. Motivation (研究动机)

现有 Code LLM / code agent 能写局部函数、样板代码或单文件 demo，但端到端网页游戏生成会同时牵涉 update loop、physics、event handling、asset pipeline、scene wiring、configuration 和跨文件状态。论文指出 frontier models 在完整游戏任务中反复出现三类失败：全局状态丢失导致逻辑不连贯，误用 Phaser 等引擎抽象导致 engine-specific gap，以及 asset key、scene、config、initialization order 之间的跨文件不一致。

这篇论文要解决的不是“把 prompt 翻译成一段游戏代码”，而是“从自然语言设计规格自动产出可构建、可运行、可交互的 2D web game project”。问题值得研究，因为游戏比静态编程题更接近真实复杂应用：即使代码能编译，视觉、交互、胜负状态和用户指定 mechanic 仍可能失败，因此需要面向执行和可玩性的 agentic software engineering。

Figure 1 解读：这张图展示 OpenGame 的目标任务形态：用户只给自然语言游戏想法，系统要自动生成不同类型的 2D playable games，并把代码、视觉资产、音频资产和完整 game lifecycle 连接起来。它强调了本文和普通 code generation benchmark 的差异：输出不是单个函数，而是完整交互式项目。

2. Idea (核心思想)

核心洞察：端到端游戏生成的主要失败不是单点语法错误，而是长期任务中的结构漂移和跨文件不一致；因此 OpenGame 把“可复用结构先验”和“可累积调试知识”显式做成 agent skill，而不是只依赖更大的通用模型。

关键创新有三层：第一，Game Skill 由 Template Skill 与 Debug Skill 组成，前者沉淀可复用项目骨架，后者沉淀已验证修复协议；第二，GameCoder-27B 通过 CPT、SFT、execution-grounded RL 获得游戏引擎与逻辑先验；第三，OpenGame-Bench 用 headless browser execution + VLM judging 评估 Build Health、Visual Usability、Intent Alignment。

与 qwen-code / Cursor 这类通用 agentic coding framework 相比，OpenGame 的根本差异是把游戏结构约束前置：先用 Physics-First Classification 选 template family，再通过 Template Method Pattern 覆写 hook，而不是让模型从空白项目自由生成所有文件。

3. Method (方法)

3.1 Overall framework

Figure 2 解读：OpenGame 有三个耦合模块：左侧是 GameCoder-27B 的三阶段训练 pipeline，用 CPT 建立 Phaser / JavaScript / TypeScript 游戏先验，用 SFT 对齐复杂游戏设计指令，用 execution-grounded RL 强化可执行逻辑；中间是六阶段 autonomous agent workflow，从分类、scaffold、GDD、asset synthesis、code implementation 到 verification；右侧是 agent evolution，Template Skill 扩展 template library，Debug Skill 扩展 living debug protocol。

直觉上，OpenGame 把“长程生成”拆成结构化接口：classification 决定物理与视角，template 决定生命周期骨架，GDD 决定可实现的 mechanic，asset registry 决定资源 key，debug protocol 决定修复路径。这样 agent 需要自由生成的部分被压缩到有限 hook 和配置字段中，跨文件一致性更容易被检查和修复。

3.2 GameCoder-27B training pipeline

GameCoder-27B 基于 Qwen3.5-27B。论文描述三阶段训练，但未公开训练脚本、GPU、LR、batch size、训练步数等 launch config：

CPT：从 GitHub 上 Phaser 与 JavaScript/TypeScript 游戏仓库、官方文档、社区教程构造预训练语料，学习 game loop、physics、asset usage、state management。
SFT：用 gpt-codex5.1 生成复杂多步骤游戏设计 prompt，再用 minimax2.5 产出 target solution，使模型学会从抽象创意到具体代码结构。
RL：不直接生成整局游戏，而是在 component-level 生成单文件 gameplay logic / functional modules，用 unit tests 的 execution success 与 aggregate test pass rate 给 reward。

概念上可写作：

r = f (1 [execution succeeds], \frac{1}{N} i = 1 \sum N 1 [test_{i} passes])

论文没有给出 $f$ 的精确权重，因此笔记不伪造 $α$ 、batch size 或 optimizer 设置。

3.3 Autonomous agent workflow

OpenGame 的 agent workflow 有六个阶段：initialization/classification、scaffolding、design generation、asset synthesis、code implementation、verification。关键机制如下：

Physics-First Classification：classify-game-type 按物理约束和视角，而不是 genre 名称，映射到 platformer、top_down、grid_logic、tower_defense、ui_heavy。
Scaffolding + GDD：复制 shared core、modules/{archetype} codebase 和对应 docs；generate-gdd 读取 archetype-specific API constraints，产出 technical GDD 与文件级 roadmap。
Multimodal Asset Synthesis：generate-game-assets 依据 GDD asset registry 生成 background、character animation、items、audio；generate-tilemap 把 ASCII layout 转为 Phaser Tilemap JSON；再读取 asset-pack.json 固定 texture keys。
Three-Layer Reading + Template Method Pattern：实现前按顺序读取 API summary、target source file、implementation guide；代码不是从零写完整场景，而是在 _Template*.ts / base class hook 中覆写例如 setupCustomCollisions() 这类 extension point。

3.4 Game Skill: Template Skill + Debug Skill

问题设定：给定用户规格 $x$ 、meta template $M_{0}$ 、template library $L$ 和 debug protocol $P$ ，系统要生成可构建、可运行项目 $y$ 。

y = RepairUntilRunnable (Generate (x, Select (x, L)), P)

Template Skill 从一个 game-agnostic skeleton $M_{0}$ 开始，完成任务后抽取稳定、通用、安全的代码片段并合入 $L$ 。论文观察到 $L$ 最终形成五类 template families：gravity-based side view、top-down continuous motion、discrete grid logic、path-and-wave dynamics、UI-driven gameplay。

Debug Skill 维护 living protocol $P$ 。每次 build/test/runtime 失败后记录 (error signature, root cause, verified fix)；高频模式会变成 reusable rule，novel failure 会扩展协议。这样调试知识以协议形式累积，而不是每次从 prompt 里重新推理。

3.5 Pseudocode grounded in released code

def physics_first_classify(game_description: str, llm) -> dict:
    """Grounded in packages/core/src/tools/game-type-classifier.ts."""
    system_prompt = build_physics_first_prompt()
    raw = llm.complete(
        system=system_prompt,
        user={"game_description": game_description},
        temperature=0.3,
        timeout_ms=15000,
    )
    parsed = parse_json_or_fallback(raw)
    archetype = parsed.get("archetype", "platformer")
    if archetype not in {"platformer", "top_down", "grid_logic", "tower_defense", "ui_heavy"}:
        archetype = heuristic_extract_archetype(raw) or "platformer"
    return {
        "archetype": archetype,
        "perspective": parsed.get("perspective"),
        "movement_type": "grid" if archetype == "grid_logic" else "continuous",
        "reasoning": parsed.get("reasoning", "physics-first fallback"),
    }

def scaffold_and_generate_gdd(workspace, user_requirement: str, archetype: str, llm) -> str:
    """Grounded in packages/core/src/tools/generate-gdd.ts and agent-test/prompts/custom.md."""
    copy_tree("agent-test/templates/core", workspace / "src")
    copy_tree(f"agent-test/templates/modules/{archetype}/src", workspace / "src")
    copy_tree(f"agent-test/docs/modules/{archetype}", workspace / f"docs/modules/{archetype}")
    core_rules = read_text(workspace / "docs/gdd/core.md")
    design_rules = read_text(workspace / f"docs/modules/{archetype}/design_rules.md")
    template_api = read_text(workspace / f"docs/modules/{archetype}/template_api.md")
    prompt = build_gdd_prompt(user_requirement, archetype, core_rules, design_rules, template_api)
    gdd = llm.complete(prompt=prompt, temperature=0.5, timeout_ms=60000)
    write_text(workspace / "GAME_DESIGN.md", gdd)
    return gdd

def implement_with_template_method(workspace, roadmap: list[str]) -> None:
    """Grounded in agent-test/prompts/custom.md and BaseLevelScene.ts."""
    config = read_json(workspace / "src/gameConfig.json")
    config.update(extract_config_fields(roadmap))
    write_json(workspace / "src/gameConfig.json", config)
    api_summary = read_text(workspace / "docs/modules/platformer/template_api.md")
    for target_file in select_target_files(roadmap):
        source = read_text(workspace / target_file)
        guide = read_text(workspace_guide_for(target_file))
        patch = generate_patch(api_summary, source, guide, roadmap)
        assert patch.only_overrides_existing_hooks()
        apply_patch(target_file, patch)

def debug_skill_loop(project_dir, protocol, max_iterations: int = 5) -> bool:
    """Grounded in agent-test/debug-skill/src/debug-loop.ts, validator.ts, repairer.ts."""
    for iteration in range(max_iterations):
        violations = validate_project(project_dir, protocol)
        if violations:
            repair_with_protocol(project_dir, violations[0], protocol)
            continue
        build = run_stage(project_dir, "build")
        if not build.success:
            diagnosis = diagnose_errors(build.errors, protocol)
            repair = apply_repair(diagnosis, build.errors[0], project_dir)
            record_outcome(protocol, build.errors[0], diagnosis, repair)
            continue
        test = run_stage(project_dir, "test")
        if not test.success:
            diagnosis = diagnose_errors(test.errors, protocol)
            repair = apply_repair(diagnosis, test.errors[0], project_dir)
            record_outcome(protocol, test.errors[0], diagnosis, repair)
            continue
        return True
    return False

def evolve_template_skill(completed_project, library):
    """Grounded in agent-test/template-skill/src/evolve.ts and library-manager.ts."""
    snapshot = collect_project_snapshot(completed_project)
    classification = classify_project_by_physics(snapshot, library)
    reusable_patterns = extract_patterns(snapshot, classification)
    abstracted_templates = abstract_into_reusable_templates(reusable_patterns)
    library = merge_into_library(library, abstracted_templates, reusable_patterns)
    save_library_manifest_and_family_files(library)
    return library

Code reference: main @ c54307ef (2026-04-22) — pseudocode and mapping based on this commit

Paper Concept	Source File	Key Class/Function
Physics-First Classification	`packages/core/src/tools/game-type-classifier.ts`	`GameTypeClassifierTool`, `GameArchetype`
GDD generation with archetype docs	`packages/core/src/tools/generate-gdd.ts`	`GenerateGDDTool`, `GenerateGDDInvocation`
Multimodal asset generation	`packages/core/src/tools/generate-assets.ts`	`GenerateAssetsTool`, `asset-pack.json` output
ASCII-to-Phaser tilemap	`packages/core/src/tools/generate-tilemap.ts`	`GenerateTilemapTool`
Six-phase workflow prompt	`agent-test/prompts/custom.md`	ordered calls to `classify-game-type`, scaffold, `generate-gdd`, asset tools, verification
Template Method hooks	`agent-test/templates/modules/platformer/src/scenes/BaseLevelScene.ts`	`BaseLevelScene`, `setupCustomCollisions()` hook, abstract scene methods
Template Skill evolution	`agent-test/template-skill/src/evolve.ts`	`evolveFromProject()`, `mergeIntoLibrary()`
Debug Skill loop	`agent-test/debug-skill/src/debug-loop.ts`	build/test/diagnose/repair/record loop
Pre-execution consistency checks	`agent-test/debug-skill/src/validator.ts`	asset key, `gameConfig.json`, scene registration validators
Repair protocol application	`agent-test/debug-skill/src/repairer.ts`	`applyRepair()`, known fix vs LLM-generated fix

论文公式与 released code 实现差异：论文报告了 GameCoder-27B 的 CPT/SFT/RL 训练 pipeline 与 OpenGame-Bench 动态评测 pipeline，但 main@c54307ef 中只有 README/docs 级说明，未公开 GameCoder 训练脚本、训练超参、OpenGame-Bench evaluator 源码或论文表格复现实验配置；released code 主要覆盖 OpenGame agent runtime、Game Skill、template/debug protocol 和工具实现。因此本笔记中的训练与 benchmark 数字来自论文，而非 repo launch config。

4. Experimental Setup (实验设置)

Benchmark：OpenGame-Bench，150 个任务，来自 150 条 unique natural-language prompts；覆盖 platformers、top-down shooters、puzzle games、arcade classics、strategy。每条 prompt 是唯一输入，没有 reference implementation 或 starter code；任务来自 curated public game-jam repositories 与 AI-assisted design briefs，并人工确认可在 2D web framework 中实现。
Evaluation protocol：每个生成项目必须能 build、通过本地 HTTP server 运行且无 fatal runtime error，并在 automated play 中产生至少一个 non-empty screenshot；每个任务用 3 个 random seeds 评估并报告 mean scores。
Metrics：Build Health (BH) 衡量 compilation/runtime/rendering stability；Visual Usability (VU) 结合 pixel-level heuristic（frame entropy、motion detection）与 VLM judge；Intent Alignment (IA) 是 VLM judge 对 prompt requirements 的 weighted pass rate。三者均缩放到 $[0, 100]$ 。
Baselines：Direct LLMs 包括 Qwen-3.5-Max、MiniMax m2.5、GLM-4.5、Kimi K2.5、DeepSeek V3.2、Claude Sonnet 4.6、GPT-5.1、Gemini 3.1 Pro；Agentic frameworks 包括 qwen-code（多 backend）与 Cursor（Kimi K2.5 / Claude Sonnet 4.6）。所有 baseline prompt 均显式要求使用 Phaser 3，以避免退化成单文件 vanilla JS。

训练配置可追溯性：论文给出 GameCoder-27B backbone 为 Qwen3.5-27B，训练阶段为 CPT/SFT/RL；但没有报告 GPU type/count、training steps、learning rate、batch size。released repo main@c54307ef 也没有对应训练 launch script，因此不能把 README 或基础配置当作实验训练配置来源。

5. Experimental Results (实验结果)

5.1 Main results on OpenGame-Bench

Category	System / Model	BH	VU	IA
Direct LLM	Qwen-3.5-Max	51.8	35.5	38.9
Direct LLM	MiniMax m2.5	39.7	39.3	31.8
Direct LLM	GLM-4.5	46.5	45.0	31.2
Direct LLM	Kimi K2.5	45.6	46.8	44.6
Direct LLM	DeepSeek V3.2	57.0	38.9	33.5
Direct LLM	Claude Sonnet 4.6	58.5	50.8	50.3
Direct LLM	GPT-5.1	57.4	52.9	49.4
Direct LLM	Gemini 3.1 Pro	53.6	60.2	42.1
Agentic	qwen-code + Qwen-3.5-Max	57.7	41.3	40.2
Agentic	qwen-code + MiniMax m2.5	48.1	39.1	34.6
Agentic	qwen-code + Kimi K2.5	59.6	52.1	49.9
Agentic	qwen-code + Claude Sonnet 4.6	63.2	54.3	57.8
Agentic	Cursor + Kimi K2.5	57.1	55.2	54.2
Agentic	Cursor + Claude Sonnet 4.6	66.8	61.4	58.9
Ours	OpenGame + Qwen-3.5-27B	62.8	53.8	49.8
Ours	OpenGame + GameCoder-27B	63.9	57.0	54.1
Ours	OpenGame + Claude Sonnet 4.6	72.4	67.2	65.1

OpenGame + Claude Sonnet 4.6 达到 BH=72.4、VU=67.2、IA=65.1，相比最强 baseline Cursor + Claude Sonnet 4.6 分别提升 +5.6、+5.8、+6.2。OpenGame + GameCoder-27B 达到 63.9/57.0/54.1，在 BH 与 IA 上超过所有 direct LLM baseline，并比 qwen-code + Claude Sonnet 4.6 高 BH +0.7、VU +2.7，但 IA 低 3.7。

5.2 Ablation: GameCoder-27B training pipeline

Stage	Training Components	BH	VU	IA
Base Model	Qwen-3.5-27B in OpenGame	62.8	53.8	49.8
Stage 1	+ CPT	63.2	54.7	50.6
Stage 2	+ CPT + SFT	63.5	55.7	52.5
Stage 3	+ CPT + SFT + RL	63.9	57.0	54.1

SFT 对 IA 的增量最大（+1.9），说明高质量 synthetic QA 对 creative game specification alignment 关键；RL 继续提升 VU 和 IA，但 headline gain 主要仍来自 agent framework。

5.3 Ablation: agent workflow mechanisms

Agent Configuration	BH	VU	IA
OpenGame Full Workflow	72.4	67.2	65.1
w/o Hook-Driven Implementation	62.3	57.6	53.5
w/o Three-Layer Reading	67.8	61.9	56.5
w/o Physics-First Classification	70.2	64.6	61.6

Hook-Driven Implementation 最关键：去掉后 BH 降 10.1、IA 降 11.6，常导致生命周期管理错误。Three-Layer Reading 去掉后 IA 降 8.6，说明多文件合成中 progressive salience control 仍必要。

5.4 Ablation: Agent Evolution / Game Skills

Template Architecture	Debugging Strategy	BH	VU	IA
Static Skeleton $M_{0}$	Static Rule Checklist	60.5	54.8	51.2
Static Skeleton $M_{0}$	Full Living Protocol $P$	65.4	59.2	56.3
Partial Evolved Library (2 Families)	Static Rule Checklist	63.1	57.3	53.8
Full Evolved Library (5 Families)	Static Rule Checklist	66.3	60.7	57.9
Full Evolved Library (5 Families)	Post-Execution Fixes Only	69.5	63.8	61.4
Full Evolved Library (5 Families)	Full Living Protocol $P$	72.4	67.2	65.1

完整 evolved library + living protocol 同时带来 scaffold 和 debugging 的收益；只靠静态 skeleton 会把 generation quality 限制在 BH=60.5、IA=51.2。

Figure 3 解读：横轴是最大 automated debugging iterations $T$ ，纵轴是 BH/VU/IA。 $T = 0$ 时 BH 只有 58.4，说明单次生成复杂 multi-file Phaser 项目很脆弱；从 $T = 0$ 到 $T = 3$ 收益最陡，之后到 $T = 5$ 逐步平台化，最终对应 full system 的 72.4/67.2/65.1。

Figure 4 解读：OpenGame 的优势在不同游戏类型上不均匀。IA 在 Platformers 为 76.8、Top-Down Shooters 为 71.4、Arcade classics 为 66.5、Strategy 为 58.2、Puzzle/UI 为 52.6。物理和空间约束强的游戏更适合 template families；抽象策略或 UI 逻辑更容易出现静默状态不同步。

5.5 Limitations and conclusions

论文作者没有单独列 limitation，但结果暴露出两点边界：第一，即使 full OpenGame 仍有约 34.9% weighted mechanical requirements 部分或完全未满足；第二，Strategy 与 Puzzle/UI 的 IA 明显低于 Platformers / Top-Down Shooters，说明抽象逻辑状态、inventory、match-three rules 这类弱视觉耦合 mechanic 更难由当前模板和调试协议稳定覆盖。

总体结论：可靠 game generation 不只是更强 code model，还需要持久化结构先验、可复用 debug knowledge，以及基于真实执行的 evaluation protocol。OpenGame 把这些机制组合成面向 games 的 agentic coding system，在 OpenGame-Bench 上优于 direct LLM 和通用 agentic coding baselines。

Paper Notes

探索

OpenGame: Open Agentic Coding for Games

OpenGame: Open Agentic Coding for Games

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall framework

3.2 GameCoder-27B training pipeline

3.3 Autonomous agent workflow

3.4 Game Skill: Template Skill + Debug Skill

3.5 Pseudocode grounded in released code

4. Experimental Setup (实验设置)

5. Experimental Results (实验结果)

5.1 Main results on OpenGame-Bench

5.2 Ablation: GameCoder-27B training pipeline

5.3 Ablation: agent workflow mechanisms

5.4 Ablation: Agent Evolution / Game Skills

5.5 Limitations and conclusions

目录