OpenGame: Open Agentic Coding for Games

Paper: arXiv:2604.18394 Code: leigest519/OpenGame Code reference: main @ c54307ef (2026-04-22)

1. Motivation (研究动机)

现有 Code LLM / code agent 能写局部函数、样板代码或单文件 demo,但端到端网页游戏生成会同时牵涉 update loop、physics、event handling、asset pipeline、scene wiring、configuration 和跨文件状态。论文指出 frontier models 在完整游戏任务中反复出现三类失败:全局状态丢失导致逻辑不连贯,误用 Phaser 等引擎抽象导致 engine-specific gap,以及 asset key、scene、config、initialization order 之间的跨文件不一致。

这篇论文要解决的不是“把 prompt 翻译成一段游戏代码”,而是“从自然语言设计规格自动产出可构建、可运行、可交互的 2D web game project”。问题值得研究,因为游戏比静态编程题更接近真实复杂应用:即使代码能编译,视觉、交互、胜负状态和用户指定 mechanic 仍可能失败,因此需要面向执行和可玩性的 agentic software engineering。

Figure 1 解读:这张图展示 OpenGame 的目标任务形态:用户只给自然语言游戏想法,系统要自动生成不同类型的 2D playable games,并把代码、视觉资产、音频资产和完整 game lifecycle 连接起来。它强调了本文和普通 code generation benchmark 的差异:输出不是单个函数,而是完整交互式项目。

2. Idea (核心思想)

核心洞察:端到端游戏生成的主要失败不是单点语法错误,而是长期任务中的结构漂移和跨文件不一致;因此 OpenGame 把“可复用结构先验”和“可累积调试知识”显式做成 agent skill,而不是只依赖更大的通用模型。

关键创新有三层:第一,Game Skill 由 Template Skill 与 Debug Skill 组成,前者沉淀可复用项目骨架,后者沉淀已验证修复协议;第二,GameCoder-27B 通过 CPT、SFT、execution-grounded RL 获得游戏引擎与逻辑先验;第三,OpenGame-Bench 用 headless browser execution + VLM judging 评估 Build Health、Visual Usability、Intent Alignment。

与 qwen-code / Cursor 这类通用 agentic coding framework 相比,OpenGame 的根本差异是把游戏结构约束前置:先用 Physics-First Classification 选 template family,再通过 Template Method Pattern 覆写 hook,而不是让模型从空白项目自由生成所有文件。

3. Method (方法)

3.1 Overall framework

Figure 2 解读:OpenGame 有三个耦合模块:左侧是 GameCoder-27B 的三阶段训练 pipeline,用 CPT 建立 Phaser / JavaScript / TypeScript 游戏先验,用 SFT 对齐复杂游戏设计指令,用 execution-grounded RL 强化可执行逻辑;中间是六阶段 autonomous agent workflow,从分类、scaffold、GDD、asset synthesis、code implementation 到 verification;右侧是 agent evolution,Template Skill 扩展 template library,Debug Skill 扩展 living debug protocol。

直觉上,OpenGame 把“长程生成”拆成结构化接口:classification 决定物理与视角,template 决定生命周期骨架,GDD 决定可实现的 mechanic,asset registry 决定资源 key,debug protocol 决定修复路径。这样 agent 需要自由生成的部分被压缩到有限 hook 和配置字段中,跨文件一致性更容易被检查和修复。

3.2 GameCoder-27B training pipeline

GameCoder-27B 基于 Qwen3.5-27B。论文描述三阶段训练,但未公开训练脚本、GPU、LR、batch size、训练步数等 launch config:

  • CPT:从 GitHub 上 Phaser 与 JavaScript/TypeScript 游戏仓库、官方文档、社区教程构造预训练语料,学习 game loop、physics、asset usage、state management。
  • SFT:用 gpt-codex5.1 生成复杂多步骤游戏设计 prompt,再用 minimax2.5 产出 target solution,使模型学会从抽象创意到具体代码结构。
  • RL:不直接生成整局游戏,而是在 component-level 生成单文件 gameplay logic / functional modules,用 unit tests 的 execution success 与 aggregate test pass rate 给 reward。

概念上可写作:

论文没有给出 的精确权重,因此笔记不伪造 、batch size 或 optimizer 设置。

3.3 Autonomous agent workflow

OpenGame 的 agent workflow 有六个阶段:initialization/classification、scaffolding、design generation、asset synthesis、code implementation、verification。关键机制如下:

  • Physics-First Classificationclassify-game-type 按物理约束和视角,而不是 genre 名称,映射到 platformertop_downgrid_logictower_defenseui_heavy
  • Scaffolding + GDD:复制 shared core、modules/{archetype} codebase 和对应 docs;generate-gdd 读取 archetype-specific API constraints,产出 technical GDD 与文件级 roadmap。
  • Multimodal Asset Synthesisgenerate-game-assets 依据 GDD asset registry 生成 background、character animation、items、audio;generate-tilemap 把 ASCII layout 转为 Phaser Tilemap JSON;再读取 asset-pack.json 固定 texture keys。
  • Three-Layer Reading + Template Method Pattern:实现前按顺序读取 API summary、target source file、implementation guide;代码不是从零写完整场景,而是在 _Template*.ts / base class hook 中覆写例如 setupCustomCollisions() 这类 extension point。

3.4 Game Skill: Template Skill + Debug Skill

问题设定:给定用户规格 、meta template 、template library 和 debug protocol ,系统要生成可构建、可运行项目

Template Skill 从一个 game-agnostic skeleton 开始,完成任务后抽取稳定、通用、安全的代码片段并合入 。论文观察到 最终形成五类 template families:gravity-based side view、top-down continuous motion、discrete grid logic、path-and-wave dynamics、UI-driven gameplay。

Debug Skill 维护 living protocol 。每次 build/test/runtime 失败后记录 (error signature, root cause, verified fix);高频模式会变成 reusable rule,novel failure 会扩展协议。这样调试知识以协议形式累积,而不是每次从 prompt 里重新推理。

3.5 Pseudocode grounded in released code

def physics_first_classify(game_description: str, llm) -> dict:
    """Grounded in packages/core/src/tools/game-type-classifier.ts."""
    system_prompt = build_physics_first_prompt()
    raw = llm.complete(
        system=system_prompt,
        user={"game_description": game_description},
        temperature=0.3,
        timeout_ms=15000,
    )
    parsed = parse_json_or_fallback(raw)
    archetype = parsed.get("archetype", "platformer")
    if archetype not in {"platformer", "top_down", "grid_logic", "tower_defense", "ui_heavy"}:
        archetype = heuristic_extract_archetype(raw) or "platformer"
    return {
        "archetype": archetype,
        "perspective": parsed.get("perspective"),
        "movement_type": "grid" if archetype == "grid_logic" else "continuous",
        "reasoning": parsed.get("reasoning", "physics-first fallback"),
    }
def scaffold_and_generate_gdd(workspace, user_requirement: str, archetype: str, llm) -> str:
    """Grounded in packages/core/src/tools/generate-gdd.ts and agent-test/prompts/custom.md."""
    copy_tree("agent-test/templates/core", workspace / "src")
    copy_tree(f"agent-test/templates/modules/{archetype}/src", workspace / "src")
    copy_tree(f"agent-test/docs/modules/{archetype}", workspace / f"docs/modules/{archetype}")
    core_rules = read_text(workspace / "docs/gdd/core.md")
    design_rules = read_text(workspace / f"docs/modules/{archetype}/design_rules.md")
    template_api = read_text(workspace / f"docs/modules/{archetype}/template_api.md")
    prompt = build_gdd_prompt(user_requirement, archetype, core_rules, design_rules, template_api)
    gdd = llm.complete(prompt=prompt, temperature=0.5, timeout_ms=60000)
    write_text(workspace / "GAME_DESIGN.md", gdd)
    return gdd
def implement_with_template_method(workspace, roadmap: list[str]) -> None:
    """Grounded in agent-test/prompts/custom.md and BaseLevelScene.ts."""
    config = read_json(workspace / "src/gameConfig.json")
    config.update(extract_config_fields(roadmap))
    write_json(workspace / "src/gameConfig.json", config)
    api_summary = read_text(workspace / "docs/modules/platformer/template_api.md")
    for target_file in select_target_files(roadmap):
        source = read_text(workspace / target_file)
        guide = read_text(workspace_guide_for(target_file))
        patch = generate_patch(api_summary, source, guide, roadmap)
        assert patch.only_overrides_existing_hooks()
        apply_patch(target_file, patch)
def debug_skill_loop(project_dir, protocol, max_iterations: int = 5) -> bool:
    """Grounded in agent-test/debug-skill/src/debug-loop.ts, validator.ts, repairer.ts."""
    for iteration in range(max_iterations):
        violations = validate_project(project_dir, protocol)
        if violations:
            repair_with_protocol(project_dir, violations[0], protocol)
            continue
        build = run_stage(project_dir, "build")
        if not build.success:
            diagnosis = diagnose_errors(build.errors, protocol)
            repair = apply_repair(diagnosis, build.errors[0], project_dir)
            record_outcome(protocol, build.errors[0], diagnosis, repair)
            continue
        test = run_stage(project_dir, "test")
        if not test.success:
            diagnosis = diagnose_errors(test.errors, protocol)
            repair = apply_repair(diagnosis, test.errors[0], project_dir)
            record_outcome(protocol, test.errors[0], diagnosis, repair)
            continue
        return True
    return False
def evolve_template_skill(completed_project, library):
    """Grounded in agent-test/template-skill/src/evolve.ts and library-manager.ts."""
    snapshot = collect_project_snapshot(completed_project)
    classification = classify_project_by_physics(snapshot, library)
    reusable_patterns = extract_patterns(snapshot, classification)
    abstracted_templates = abstract_into_reusable_templates(reusable_patterns)
    library = merge_into_library(library, abstracted_templates, reusable_patterns)
    save_library_manifest_and_family_files(library)
    return library

Code reference: main @ c54307ef (2026-04-22) — pseudocode and mapping based on this commit

Paper ConceptSource FileKey Class/Function
Physics-First Classificationpackages/core/src/tools/game-type-classifier.tsGameTypeClassifierTool, GameArchetype
GDD generation with archetype docspackages/core/src/tools/generate-gdd.tsGenerateGDDTool, GenerateGDDInvocation
Multimodal asset generationpackages/core/src/tools/generate-assets.tsGenerateAssetsTool, asset-pack.json output
ASCII-to-Phaser tilemappackages/core/src/tools/generate-tilemap.tsGenerateTilemapTool
Six-phase workflow promptagent-test/prompts/custom.mdordered calls to classify-game-type, scaffold, generate-gdd, asset tools, verification
Template Method hooksagent-test/templates/modules/platformer/src/scenes/BaseLevelScene.tsBaseLevelScene, setupCustomCollisions() hook, abstract scene methods
Template Skill evolutionagent-test/template-skill/src/evolve.tsevolveFromProject(), mergeIntoLibrary()
Debug Skill loopagent-test/debug-skill/src/debug-loop.tsbuild/test/diagnose/repair/record loop
Pre-execution consistency checksagent-test/debug-skill/src/validator.tsasset key, gameConfig.json, scene registration validators
Repair protocol applicationagent-test/debug-skill/src/repairer.tsapplyRepair(), known fix vs LLM-generated fix

论文公式与 released code 实现差异:论文报告了 GameCoder-27B 的 CPT/SFT/RL 训练 pipeline 与 OpenGame-Bench 动态评测 pipeline,但 main@c54307ef 中只有 README/docs 级说明,未公开 GameCoder 训练脚本、训练超参、OpenGame-Bench evaluator 源码或论文表格复现实验配置;released code 主要覆盖 OpenGame agent runtime、Game Skill、template/debug protocol 和工具实现。因此本笔记中的训练与 benchmark 数字来自论文,而非 repo launch config。

4. Experimental Setup (实验设置)

  • Benchmark:OpenGame-Bench,150 个任务,来自 150 条 unique natural-language prompts;覆盖 platformers、top-down shooters、puzzle games、arcade classics、strategy。每条 prompt 是唯一输入,没有 reference implementation 或 starter code;任务来自 curated public game-jam repositories 与 AI-assisted design briefs,并人工确认可在 2D web framework 中实现。
  • Evaluation protocol:每个生成项目必须能 build、通过本地 HTTP server 运行且无 fatal runtime error,并在 automated play 中产生至少一个 non-empty screenshot;每个任务用 3 个 random seeds 评估并报告 mean scores。
  • Metrics:Build Health (BH) 衡量 compilation/runtime/rendering stability;Visual Usability (VU) 结合 pixel-level heuristic(frame entropy、motion detection)与 VLM judge;Intent Alignment (IA) 是 VLM judge 对 prompt requirements 的 weighted pass rate。三者均缩放到
  • Baselines:Direct LLMs 包括 Qwen-3.5-Max、MiniMax m2.5、GLM-4.5、Kimi K2.5、DeepSeek V3.2、Claude Sonnet 4.6、GPT-5.1、Gemini 3.1 Pro;Agentic frameworks 包括 qwen-code(多 backend)与 Cursor(Kimi K2.5 / Claude Sonnet 4.6)。所有 baseline prompt 均显式要求使用 Phaser 3,以避免退化成单文件 vanilla JS。

训练配置可追溯性:论文给出 GameCoder-27B backbone 为 Qwen3.5-27B,训练阶段为 CPT/SFT/RL;但没有报告 GPU type/count、training steps、learning rate、batch size。released repo main@c54307ef 也没有对应训练 launch script,因此不能把 README 或基础配置当作实验训练配置来源。

5. Experimental Results (实验结果)

5.1 Main results on OpenGame-Bench

CategorySystem / ModelBHVUIA
Direct LLMQwen-3.5-Max51.835.538.9
Direct LLMMiniMax m2.539.739.331.8
Direct LLMGLM-4.546.545.031.2
Direct LLMKimi K2.545.646.844.6
Direct LLMDeepSeek V3.257.038.933.5
Direct LLMClaude Sonnet 4.658.550.850.3
Direct LLMGPT-5.157.452.949.4
Direct LLMGemini 3.1 Pro53.660.242.1
Agenticqwen-code + Qwen-3.5-Max57.741.340.2
Agenticqwen-code + MiniMax m2.548.139.134.6
Agenticqwen-code + Kimi K2.559.652.149.9
Agenticqwen-code + Claude Sonnet 4.663.254.357.8
AgenticCursor + Kimi K2.557.155.254.2
AgenticCursor + Claude Sonnet 4.666.861.458.9
OursOpenGame + Qwen-3.5-27B62.853.849.8
OursOpenGame + GameCoder-27B63.957.054.1
OursOpenGame + Claude Sonnet 4.672.467.265.1

OpenGame + Claude Sonnet 4.6 达到 BH=72.4、VU=67.2、IA=65.1,相比最强 baseline Cursor + Claude Sonnet 4.6 分别提升 +5.6、+5.8、+6.2。OpenGame + GameCoder-27B 达到 63.9/57.0/54.1,在 BH 与 IA 上超过所有 direct LLM baseline,并比 qwen-code + Claude Sonnet 4.6 高 BH +0.7、VU +2.7,但 IA 低 3.7。

5.2 Ablation: GameCoder-27B training pipeline

StageTraining ComponentsBHVUIA
Base ModelQwen-3.5-27B in OpenGame62.853.849.8
Stage 1+ CPT63.254.750.6
Stage 2+ CPT + SFT63.555.752.5
Stage 3+ CPT + SFT + RL63.957.054.1

SFT 对 IA 的增量最大(+1.9),说明高质量 synthetic QA 对 creative game specification alignment 关键;RL 继续提升 VU 和 IA,但 headline gain 主要仍来自 agent framework。

5.3 Ablation: agent workflow mechanisms

Agent ConfigurationBHVUIA
OpenGame Full Workflow72.467.265.1
w/o Hook-Driven Implementation62.357.653.5
w/o Three-Layer Reading67.861.956.5
w/o Physics-First Classification70.264.661.6

Hook-Driven Implementation 最关键:去掉后 BH 降 10.1、IA 降 11.6,常导致生命周期管理错误。Three-Layer Reading 去掉后 IA 降 8.6,说明多文件合成中 progressive salience control 仍必要。

5.4 Ablation: Agent Evolution / Game Skills

Template ArchitectureDebugging StrategyBHVUIA
Static Skeleton Static Rule Checklist60.554.851.2
Static Skeleton Full Living Protocol 65.459.256.3
Partial Evolved Library (2 Families)Static Rule Checklist63.157.353.8
Full Evolved Library (5 Families)Static Rule Checklist66.360.757.9
Full Evolved Library (5 Families)Post-Execution Fixes Only69.563.861.4
Full Evolved Library (5 Families)Full Living Protocol 72.467.265.1

完整 evolved library + living protocol 同时带来 scaffold 和 debugging 的收益;只靠静态 skeleton 会把 generation quality 限制在 BH=60.5、IA=51.2。

Figure 3 解读:横轴是最大 automated debugging iterations ,纵轴是 BH/VU/IA。 时 BH 只有 58.4,说明单次生成复杂 multi-file Phaser 项目很脆弱;从 收益最陡,之后到 逐步平台化,最终对应 full system 的 72.4/67.2/65.1。

Figure 4 解读:OpenGame 的优势在不同游戏类型上不均匀。IA 在 Platformers 为 76.8、Top-Down Shooters 为 71.4、Arcade classics 为 66.5、Strategy 为 58.2、Puzzle/UI 为 52.6。物理和空间约束强的游戏更适合 template families;抽象策略或 UI 逻辑更容易出现静默状态不同步。

5.5 Limitations and conclusions

论文作者没有单独列 limitation,但结果暴露出两点边界:第一,即使 full OpenGame 仍有约 34.9% weighted mechanical requirements 部分或完全未满足;第二,Strategy 与 Puzzle/UI 的 IA 明显低于 Platformers / Top-Down Shooters,说明抽象逻辑状态、inventory、match-three rules 这类弱视觉耦合 mechanic 更难由当前模板和调试协议稳定覆盖。

总体结论:可靠 game generation 不只是更强 code model,还需要持久化结构先验、可复用 debug knowledge,以及基于真实执行的 evaluation protocol。OpenGame 把这些机制组合成面向 games 的 agentic coding system,在 OpenGame-Bench 上优于 direct LLM 和通用 agentic coding baselines。