Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Paper: arXiv:2604.04746v3 Code: 代码搜索未找到开源实现 Code reference: N/A（no public GitHub repo found; pseudocode and mapping are reconstructed from paper text/source only）

1. Motivation（研究动机）

当前 text-to-image / unified multimodal generation 的核心瓶颈不是“能不能生成漂亮图”，而是复杂 prompt 下的 visual logic：空间关系、对象数量、属性绑定、世界知识约束很容易在 single-pass 生成中被一次性错误提交。例如论文 teaser 中的 bear/spoon 关系，模型可能生成 plausible 但 relation 错误的图。Text-only CoT 能分解语言推理，但它看不到中间图像，因此无法根据已经画出的状态判断“对象还没画”还是“画错了”。

本文要解决的是：让 unified multimodal model 在生成图像时显式地产生一条可监督、可解释的 interleaved trajectory，而不是直接输出最终图。这个问题值得做，因为一旦中间状态可见，模型就能在生成过程中执行局部计划、草图、检查和修正，最终把原本黑盒的一次性生成变成可控的逐步构图过程。

Figure 1 解读：左侧 single-pass generation 直接把完整 prompt 压到一次生成里，关系错误后没有可见修正机会；右侧 process-driven interleaved reasoning 把图像构成拆成 Plan → Sketch → Inspect → Refine 的循环，让模型在每一步同时维护文本计划和视觉状态。

2. Idea（核心思想）

核心思想是把 image generation 重新定义为 text tokens 与 vision tokens 共同演化的轨迹：每一步先用文本说明下一笔要画什么，再生成当前视觉草图，然后用文本检查草图是否违反 prompt，最后必要时再生成修正后的视觉状态。与传统 single-shot diffusion/T2I 或 text-only CoT 的区别在于，这里的 reasoning 被中间图像 grounding：文本会决定下一步视觉变化，而视觉中间态又反过来约束下一轮文本推理。

本文的关键创新不是新增一个外部 critic，而是构造 dense step-wise supervision，让 BAGEL-7B 端到端学会发出 interleaved sequence。中间态的歧义通过两类约束处理：visual intermediate 要保持 spatial/semantic consistency；textual intermediate 要保留已有视觉知识，同时识别并修正 prompt-violating elements。

3. Method（方法）

3.1 Overall framework

给定 prompt $T$ 和 unified multimodal model $P_{θ}$ ，生成过程被写成：

T \to s_{plan}^{(1)} \to v_{sketch}^{(1)} \to s_{inspect}^{(1)} \to v_{refine}^{(1)} \to \dots \to I

每个循环包含四个阶段：

Plan：根据原始 prompt 与累计上下文生成 step-level instruction <ins>...</ins> 和全局 scene description <des>...</des>。
Sketch：条件于该 instruction，生成更新后的 draft image，即当前视觉中间态。
Inspect：同时检查文本 instruction/global description 与原始 prompt 的一致性，以及 draft image 与 step-level instruction 的一致性。
Refine：若发现冲突，输出 <refine>...</refine> 并生成 corrected visual update。

视觉输出由 <|vision_start|> 与 <|vision_end|> 标记，模型在同一 autoregressive sequence 中切换文本与视觉生成模式。直觉上，这相当于把复杂 prompt 的全局约束分摊到多个局部更新；每一步只需要解决当前局部 region / object / relation，同时保留已生成状态，从而减少一次性求解所有关系的组合压力。

Figure 2 解读：pipeline 展示统一模型如何在单条序列中交替生成 pink text tokens 与 green vision tokens。文本侧不仅生成计划，还承担 inspect/refine 的语义诊断；视觉侧则把计划落实为 draft/refined state，使下一轮推理有真实图像上下文。

3.2 Intermediate reasoning data construction

Figure 3 解读：数据管线从完整 prompt 出发，先通过 scene graph subsampling 得到无矛盾的 incremental prompts，再用 GPT/filtering 与 self-sampling 构造文本冲突和图文对齐两类 critique 监督。它的目的不是只给最终图监督，而是覆盖 plan、sketch、inspect、refine 四个阶段。

三类训练子集如下：

Subset	Scale	作用
Multi-turn Generation	32,012 samples；avg prompt length 152.8；avg 3.51 images/sample；max 5 images/sample	用 scene graph 的 object/attribute/relation 子图采样生成局部递增 prompt，并用 Flux-Kontext 合成/过滤中间图，让模型学会 step-by-step plan + sketch。
Instruction-Intermediate Conflict	15,201 samples；6,905 positive / 8,296 negative	从 multi-turn fine-tuned model self-sample 中间轨迹，用 GPT 判断 textual intermediate 与原始 prompt 是否冲突，并生成 textual analysis / corrective instruction。
Image-Instruction Alignment	15,000 samples；5,000 positive / 10,000 negative	基于 Gen-Ref 扩展，判断当前 draft image 是否与 step instruction 对齐；负例包含 error analysis 与 refinement instruction。

3.3 Training objective

文本 token 用 next-token Cross-Entropy，只在 textual segments $s^{(i)}$ 上计算，并额外监督 <|vision_start|> / <|vision_end|> 以学习模态切换：

L_{CE}^{text} = - t \in [1, i] \sum lo g P_{θ} (s_{t} ∣ y_{< t}, T)

视觉 token 继承 BAGEL 的 Rectified Flow 生成范式：

z_{t}^{(i)} = t \cdot z_{0}^{(i)} + (1 - t) \cdot z_{1}^{(i)}, t \in [0, 1]

L_{MSE}^{image} = E [P_{θ} (z_{t}^{(i)} ∣ y_{< t}, T) - (z_{0}^{(i)} - z_{1}^{(i)})^{2}]

总目标为：

L_{total} = λ_{CE} \cdot L_{CE}^{text} + L_{MSE}^{image}

3.4 Pseudocode（基于论文重构；无公开代码）

import torch
import torch.nn.functional as F
 
 
def generate_process_driven_image(model, prompt, max_steps=5):
    """Plan -> Sketch -> Inspect -> Refine interleaved inference."""
    context = [prompt]
    image_state = None
    for step in range(max_steps):
        plan_text = model.generate_text(
            context,
            required_tags=("<ins>", "</ins>", "<des>", "</des>"),
        )
        context.append(plan_text)
 
        draft = model.generate_image(
            context,
            start_token="<|vision_start|>",
            end_token="<|vision_end|>",
        )
        context.append(draft)
        image_state = draft
 
        inspect_text = model.generate_text(
            context,
            required_tags=("<refine>", "</refine>"),
            allow_empty_refine=True,
        )
        context.append(inspect_text)
 
        if "<refine>" in inspect_text and not is_empty_refine(inspect_text):
            image_state = model.generate_image(
                context,
                start_token="<|vision_start|>",
                end_token="<|vision_end|>",
            )
            context.append(image_state)
 
        if model.emitted_final_vision_end_without_next_start(context):
            break
    return image_state, context

 
def build_multi_turn_generation_subset(prompt, scene_graph, flux_kontext, gpt_filter):
    """Scene-graph subsampling creates contradiction-free incremental prompts."""
    trajectory = []
    for subgraph in ordered_subgraph_expansion(scene_graph, max_steps=5):
        step_instruction = render_step_instruction(subgraph)
        if should_augment_instruction(step_instruction):
            step_instruction = rewrite_with_gpt(
                step_instruction,
                operations=("add", "refine", "remove", "swap"),
            )
        image = flux_kontext.generate(prompt=step_instruction)
        if gpt_filter.is_consistent(image=image, instruction=step_instruction):
            trajectory.append((step_instruction, image))
    return trajectory

 
def build_instruction_intermediate_conflict_subset(model, raw_prompt, gpt_judge):
    """Text-side critique supervision from the model's own sampled traces."""
    trace = model.sample_intermediate_reasoning_trace(raw_prompt)
    examples = []
    for text_state in trace.textual_intermediates:
        verdict = gpt_judge.check_prompt_consistency(text_state, raw_prompt)
        if verdict.is_conflict:
            critique = gpt_judge.write_error_analysis(text_state, raw_prompt)
            correction = gpt_judge.write_corrective_instruction(text_state, raw_prompt)
            examples.append((text_state, critique, correction, 0))
        else:
            explanation = gpt_judge.explain_consistency(text_state, raw_prompt)
            examples.append((text_state, explanation, None, 1))
    return examples

 
def build_image_instruction_alignment_subset(gen_ref_samples, gpt_judge):
    """Vision-side critique supervision for draft-image vs step-instruction mismatch."""
    examples = []
    for image, instruction in gen_ref_samples:
        aligned = gpt_judge.check_image_instruction_alignment(image, instruction)
        if aligned:
            explanation = gpt_judge.explain_alignment(image, instruction)
            examples.append((image, instruction, explanation, None, 1))
        else:
            error = gpt_judge.describe_visual_error(image, instruction)
            refine_instruction = gpt_judge.write_refinement_instruction(image, instruction)
            examples.append((image, instruction, error, refine_instruction, 0))
    return examples

 
def train_step(model, batch, lambda_ce):
    """Joint text CE + image Rectified-Flow MSE for interleaved sequences."""
    outputs = model(batch.context_tokens, batch.image_latents, return_flow=True)
 
    text_loss = F.cross_entropy(
        outputs.text_logits[batch.text_mask],
        batch.text_targets[batch.text_mask],
    )
    switch_loss = F.cross_entropy(
        outputs.text_logits[batch.vision_boundary_mask],
        batch.boundary_targets[batch.vision_boundary_mask],
    )
    predicted_flow = outputs.flow_prediction[batch.image_mask]
    target_flow = batch.z0[batch.image_mask] - batch.z1[batch.image_mask]
    image_loss = F.mse_loss(predicted_flow, target_flow)
 
    loss = lambda_ce * (text_loss + switch_loss) + image_loss
    loss.backward()
    return {"loss": loss, "text_loss": text_loss, "image_loss": image_loss}

3.5 Code-to-paper mapping（无公开实现）

代码搜索未找到开源实现，因此没有可验证的 source file / class / training config anchor；下表只记录论文结构到应有实现模块的映射，不能视为真实代码引用。

Paper concept	Paper source / figure	Expected implementation module if released	当前代码状态
Plan → Sketch → Inspect → Refine interleaved inference	Sec. 3 Framework；Fig. 2	generation loop / tokenizer-modal switch / stopping rule	未找到公开实现
Scene-graph subsampling multi-turn data	Sec. 3 Intermediate Reasoning Collection；Fig. 3	data construction for incremental prompts	未找到公开实现
Instruction-intermediate conflict reasoning	Sec. 3；Table dataset stats	self-sampling + GPT judge text critique builder	未找到公开实现
Image-instruction alignment reasoning	Sec. 3；Gen-Ref extension	visual draft vs instruction evaluator / refine data builder	未找到公开实现
Joint CE + Rectified Flow training	Sec. 3 Training Objectives	BAGEL-7B finetuning loss / multimodal batch collator	未找到公开实现

代码搜索记录（2026-05-16）：检查 arXiv PDF/HTML、Hugging Face paper page 中的链接；用 WebSearch 查询 exact title + GitHub、method name + GitHub、author/org + GitHub；用 gh search repos/code 查询 exact title 和 “Process-Driven Image Generation”；均未发现官方或可确认的公开实现。

4. Experimental Setup（实验设置）

Base model：BAGEL-7B unified multimodal understanding/generation model；所有参数 end-to-end finetune。
Training data：process-based interleaved reasoning dataset，包括约 32K multi-turn generation、15,201 instruction-intermediate conflict、15,000 image-instruction alignment；实验段落另以 30K/15K/15K 概述，精确统计见 Table dataset stats。
Training config：1 node × 8 NVIDIA H100 GPUs；10,000 steps；packed sequence length 33,000 tokens；learning rate $2 \times 1 0^{- 5}$ ；cosine decay。
Evaluation：GenEval（object-centric compositional T2I：single object、two objects、counting、colors、position、color attributes）；WISE（world knowledge reasoning：culture、time、space、biology、physics、chemistry）。Baselines 包含 generation-only models（PixArt- $Σ$ 、SDv2.1、DALL-E 2/3、SDXL、SD3-Medium、FLUX.1-dev 等）、unified multimodal models（Janus、Janus-Pro-7B、Show-o/Show-o2、BAGEL 等）和 process baselines（BAGEL+GPT Planner/Inspector、PARM TTS/RL+TTS）。

5. Experimental Results（实验结果）

5.1 Main benchmark results

GenEval（Single / Two / Counting / Colors / Position / Color Attr. / Overall）

Model	Single	Two	Counting	Colors	Position	Color Attr.	Overall
FLUX.1-dev (12B)	0.98	0.93	0.75	0.93	0.68	0.65	0.82
Janus-Pro-7B	0.99	0.89	0.59	0.90	0.79	0.66	0.80
BAGEL-7B*	0.99	0.95	0.76	0.87	0.51	0.56	0.77
Ours (BAGEL-7B + Process-Driven)	0.99	0.95	0.75	0.87	0.72	0.69	0.83

关键点：相对 BAGEL-7B*，overall 从 0.77 提升到 0.83；Position 从 0.51 到 0.72，Color Attr. 从 0.56 到 0.69。它的 overall 也略高于 12B FLUX.1-dev 的 0.82，但不是每个子项都最高。

WISE（Culture / Time / Space / Biology / Physics / Chemistry / Overall）

Model	Culture	Time	Space	Biology	Physics	Chemistry	Overall
FLUX.1-dev	0.48	0.58	0.62	0.42	0.51	0.35	0.50
BAGEL	0.76	0.69	0.75	0.64	0.75	0.58	0.70
Ours (BAGEL + Process-driven)	0.74	0.82	0.73	0.70	0.76	0.78	0.76

关键点：overall 从 0.70 到 0.76；Time 从 0.69 到 0.82，Chemistry 从 0.58 到 0.78，说明逐步的 textual-visual reasoning 对世界知识与结构语义更有帮助。

5.2 Process baseline comparison

Reasoning Strategy	Training Dataset	Inference Cost	GenEval
BAGEL + GPT (Planner)	-	50	0.60
BAGEL + GPT (Inspector)	-	50	0.80
PARM (TTS)	400K	1000	0.67
PARM (RL + TTS)	688K	1000	0.77
Ours (SFT)	62K	131	0.83

Ours 用 62K samples 达到 0.83，比 PARM (RL + TTS) 的 688K / 1000 cost / 0.77 更省数据和推理成本。论文解释为 semantic partitioning 更可解释：监督的是具体对象/关系/属性的中间视觉状态，而不是 PARM 式 blurry latent/noise-level intermediate。

5.3 Ablations

Ablation	Color	Position	Color Attr.
w/o aug.	0.81	0.58	0.50
w/o aug. + Self-critique	0.84	0.61	0.53
w/ aug.	0.82	0.67	0.62
w/ aug. + Self-critique	0.87	0.72	0.69

Critique construction	Color	Position	Color Attr.
baseline	0.82	0.67	0.62
+ scene graph	0.83	0.70	0.67
+ self-sampling	0.87	0.72	0.69

Intermediate supervision	Counting	Colors	Position	Color Attr.
baseline	0.61	0.84	0.66	0.62
+ Instruction-intermediate conflict	0.62	0.85	0.71	0.65
+ Image-instruction alignment	0.73	0.86	0.69	0.65
w/ both (ours)	0.75	0.87	0.72	0.69

Ablation 说明三件事：多样编辑操作（refine/remove/swap）尤其提升 relation/attribute；self-sampling critique 比 scene-graph symbolic correction 更贴近模型自身错误分布；text-side conflict 和 image-side alignment 是互补的，前者更推 Position，后者明显推 Counting。

Figure 4 解读：可视化轨迹展示每一步如何从 plan/sketch 到 inspect/refine。第二、三行分别对应两类错误：step-level instruction 与 overall prompt 冲突，以及 draft image 与 instruction 不一致；前者需要改 instruction，后者主要需要改视觉结果。

Figure 5 解读：补充可视化显示最终图在 GenEval/WISE prompt 上具有较强细节和审美质量。它主要支撑 qualitative fidelity，但真正证明方法有效的仍是上面的 GenEval/WISE 与 ablation 数字。

5.4 Limitations / open questions

作者没有单列 Limitations；结论中明确的未来方向是扩展到 video、3D space 和 real-time human-in-the-loop control。基于论文方法本身，当前可复现风险包括：数据构造依赖 GPT judge / Flux-Kontext，公开代码与 checkpoint 未找到；同时实验主要集中在 T2I compositional/world-knowledge benchmarks，还没有证明同样的 interleaved process 能直接迁移到长视频、3D 或交互式编辑。

Paper Notes

探索

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

1. Motivation（研究动机）

2. Idea（核心思想）

3. Method（方法）

3.1 Overall framework

3.2 Intermediate reasoning data construction

3.3 Training objective

3.4 Pseudocode（基于论文重构；无公开代码）

3.5 Code-to-paper mapping（无公开实现）

4. Experimental Setup（实验设置）

5. Experimental Results（实验结果）

5.1 Main benchmark results

5.2 Process baseline comparison

5.3 Ablations

5.4 Limitations / open questions

目录