Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Paper: arXiv:2604.04746v3 Code: 代码搜索未找到开源实现 Code reference: N/A(no public GitHub repo found; pseudocode and mapping are reconstructed from paper text/source only)

1. Motivation(研究动机)

当前 text-to-image / unified multimodal generation 的核心瓶颈不是“能不能生成漂亮图”,而是复杂 prompt 下的 visual logic:空间关系、对象数量、属性绑定、世界知识约束很容易在 single-pass 生成中被一次性错误提交。例如论文 teaser 中的 bear/spoon 关系,模型可能生成 plausible 但 relation 错误的图。Text-only CoT 能分解语言推理,但它看不到中间图像,因此无法根据已经画出的状态判断“对象还没画”还是“画错了”。

本文要解决的是:让 unified multimodal model 在生成图像时显式地产生一条可监督、可解释的 interleaved trajectory,而不是直接输出最终图。这个问题值得做,因为一旦中间状态可见,模型就能在生成过程中执行局部计划、草图、检查和修正,最终把原本黑盒的一次性生成变成可控的逐步构图过程。

Figure 1 解读:左侧 single-pass generation 直接把完整 prompt 压到一次生成里,关系错误后没有可见修正机会;右侧 process-driven interleaved reasoning 把图像构成拆成 Plan → Sketch → Inspect → Refine 的循环,让模型在每一步同时维护文本计划和视觉状态。

2. Idea(核心思想)

核心思想是把 image generation 重新定义为 text tokens 与 vision tokens 共同演化的轨迹:每一步先用文本说明下一笔要画什么,再生成当前视觉草图,然后用文本检查草图是否违反 prompt,最后必要时再生成修正后的视觉状态。与传统 single-shot diffusion/T2I 或 text-only CoT 的区别在于,这里的 reasoning 被中间图像 grounding:文本会决定下一步视觉变化,而视觉中间态又反过来约束下一轮文本推理。

本文的关键创新不是新增一个外部 critic,而是构造 dense step-wise supervision,让 BAGEL-7B 端到端学会发出 interleaved sequence。中间态的歧义通过两类约束处理:visual intermediate 要保持 spatial/semantic consistency;textual intermediate 要保留已有视觉知识,同时识别并修正 prompt-violating elements。

3. Method(方法)

3.1 Overall framework

给定 prompt 和 unified multimodal model ,生成过程被写成:

每个循环包含四个阶段:

  • Plan:根据原始 prompt 与累计上下文生成 step-level instruction <ins>...</ins> 和全局 scene description <des>...</des>
  • Sketch:条件于该 instruction,生成更新后的 draft image,即当前视觉中间态。
  • Inspect:同时检查文本 instruction/global description 与原始 prompt 的一致性,以及 draft image 与 step-level instruction 的一致性。
  • Refine:若发现冲突,输出 <refine>...</refine> 并生成 corrected visual update。

视觉输出由 <|vision_start|><|vision_end|> 标记,模型在同一 autoregressive sequence 中切换文本与视觉生成模式。直觉上,这相当于把复杂 prompt 的全局约束分摊到多个局部更新;每一步只需要解决当前局部 region / object / relation,同时保留已生成状态,从而减少一次性求解所有关系的组合压力。

Figure 2 解读:pipeline 展示统一模型如何在单条序列中交替生成 pink text tokens 与 green vision tokens。文本侧不仅生成计划,还承担 inspect/refine 的语义诊断;视觉侧则把计划落实为 draft/refined state,使下一轮推理有真实图像上下文。

3.2 Intermediate reasoning data construction

Figure 3 解读:数据管线从完整 prompt 出发,先通过 scene graph subsampling 得到无矛盾的 incremental prompts,再用 GPT/filtering 与 self-sampling 构造文本冲突和图文对齐两类 critique 监督。它的目的不是只给最终图监督,而是覆盖 plan、sketch、inspect、refine 四个阶段。

三类训练子集如下:

SubsetScale作用
Multi-turn Generation32,012 samples;avg prompt length 152.8;avg 3.51 images/sample;max 5 images/sample用 scene graph 的 object/attribute/relation 子图采样生成局部递增 prompt,并用 Flux-Kontext 合成/过滤中间图,让模型学会 step-by-step plan + sketch。
Instruction-Intermediate Conflict15,201 samples;6,905 positive / 8,296 negative从 multi-turn fine-tuned model self-sample 中间轨迹,用 GPT 判断 textual intermediate 与原始 prompt 是否冲突,并生成 textual analysis / corrective instruction。
Image-Instruction Alignment15,000 samples;5,000 positive / 10,000 negative基于 Gen-Ref 扩展,判断当前 draft image 是否与 step instruction 对齐;负例包含 error analysis 与 refinement instruction。

3.3 Training objective

文本 token 用 next-token Cross-Entropy,只在 textual segments 上计算,并额外监督 <|vision_start|> / <|vision_end|> 以学习模态切换:

视觉 token 继承 BAGEL 的 Rectified Flow 生成范式:

总目标为:

3.4 Pseudocode(基于论文重构;无公开代码)

import torch
import torch.nn.functional as F
 
 
def generate_process_driven_image(model, prompt, max_steps=5):
    """Plan -> Sketch -> Inspect -> Refine interleaved inference."""
    context = [prompt]
    image_state = None
    for step in range(max_steps):
        plan_text = model.generate_text(
            context,
            required_tags=("<ins>", "</ins>", "<des>", "</des>"),
        )
        context.append(plan_text)
 
        draft = model.generate_image(
            context,
            start_token="<|vision_start|>",
            end_token="<|vision_end|>",
        )
        context.append(draft)
        image_state = draft
 
        inspect_text = model.generate_text(
            context,
            required_tags=("<refine>", "</refine>"),
            allow_empty_refine=True,
        )
        context.append(inspect_text)
 
        if "<refine>" in inspect_text and not is_empty_refine(inspect_text):
            image_state = model.generate_image(
                context,
                start_token="<|vision_start|>",
                end_token="<|vision_end|>",
            )
            context.append(image_state)
 
        if model.emitted_final_vision_end_without_next_start(context):
            break
    return image_state, context
 
def build_multi_turn_generation_subset(prompt, scene_graph, flux_kontext, gpt_filter):
    """Scene-graph subsampling creates contradiction-free incremental prompts."""
    trajectory = []
    for subgraph in ordered_subgraph_expansion(scene_graph, max_steps=5):
        step_instruction = render_step_instruction(subgraph)
        if should_augment_instruction(step_instruction):
            step_instruction = rewrite_with_gpt(
                step_instruction,
                operations=("add", "refine", "remove", "swap"),
            )
        image = flux_kontext.generate(prompt=step_instruction)
        if gpt_filter.is_consistent(image=image, instruction=step_instruction):
            trajectory.append((step_instruction, image))
    return trajectory
 
def build_instruction_intermediate_conflict_subset(model, raw_prompt, gpt_judge):
    """Text-side critique supervision from the model's own sampled traces."""
    trace = model.sample_intermediate_reasoning_trace(raw_prompt)
    examples = []
    for text_state in trace.textual_intermediates:
        verdict = gpt_judge.check_prompt_consistency(text_state, raw_prompt)
        if verdict.is_conflict:
            critique = gpt_judge.write_error_analysis(text_state, raw_prompt)
            correction = gpt_judge.write_corrective_instruction(text_state, raw_prompt)
            examples.append((text_state, critique, correction, 0))
        else:
            explanation = gpt_judge.explain_consistency(text_state, raw_prompt)
            examples.append((text_state, explanation, None, 1))
    return examples
 
def build_image_instruction_alignment_subset(gen_ref_samples, gpt_judge):
    """Vision-side critique supervision for draft-image vs step-instruction mismatch."""
    examples = []
    for image, instruction in gen_ref_samples:
        aligned = gpt_judge.check_image_instruction_alignment(image, instruction)
        if aligned:
            explanation = gpt_judge.explain_alignment(image, instruction)
            examples.append((image, instruction, explanation, None, 1))
        else:
            error = gpt_judge.describe_visual_error(image, instruction)
            refine_instruction = gpt_judge.write_refinement_instruction(image, instruction)
            examples.append((image, instruction, error, refine_instruction, 0))
    return examples
 
def train_step(model, batch, lambda_ce):
    """Joint text CE + image Rectified-Flow MSE for interleaved sequences."""
    outputs = model(batch.context_tokens, batch.image_latents, return_flow=True)
 
    text_loss = F.cross_entropy(
        outputs.text_logits[batch.text_mask],
        batch.text_targets[batch.text_mask],
    )
    switch_loss = F.cross_entropy(
        outputs.text_logits[batch.vision_boundary_mask],
        batch.boundary_targets[batch.vision_boundary_mask],
    )
    predicted_flow = outputs.flow_prediction[batch.image_mask]
    target_flow = batch.z0[batch.image_mask] - batch.z1[batch.image_mask]
    image_loss = F.mse_loss(predicted_flow, target_flow)
 
    loss = lambda_ce * (text_loss + switch_loss) + image_loss
    loss.backward()
    return {"loss": loss, "text_loss": text_loss, "image_loss": image_loss}

3.5 Code-to-paper mapping(无公开实现)

代码搜索未找到开源实现,因此没有可验证的 source file / class / training config anchor;下表只记录论文结构到应有实现模块的映射,不能视为真实代码引用。

Paper conceptPaper source / figureExpected implementation module if released当前代码状态
Plan → Sketch → Inspect → Refine interleaved inferenceSec. 3 Framework;Fig. 2generation loop / tokenizer-modal switch / stopping rule未找到公开实现
Scene-graph subsampling multi-turn dataSec. 3 Intermediate Reasoning Collection;Fig. 3data construction for incremental prompts未找到公开实现
Instruction-intermediate conflict reasoningSec. 3;Table dataset statsself-sampling + GPT judge text critique builder未找到公开实现
Image-instruction alignment reasoningSec. 3;Gen-Ref extensionvisual draft vs instruction evaluator / refine data builder未找到公开实现
Joint CE + Rectified Flow trainingSec. 3 Training ObjectivesBAGEL-7B finetuning loss / multimodal batch collator未找到公开实现

代码搜索记录(2026-05-16):检查 arXiv PDF/HTML、Hugging Face paper page 中的链接;用 WebSearch 查询 exact title + GitHub、method name + GitHub、author/org + GitHub;用 gh search repos/code 查询 exact title 和 “Process-Driven Image Generation”;均未发现官方或可确认的公开实现。

4. Experimental Setup(实验设置)

  • Base model:BAGEL-7B unified multimodal understanding/generation model;所有参数 end-to-end finetune。
  • Training data:process-based interleaved reasoning dataset,包括约 32K multi-turn generation、15,201 instruction-intermediate conflict、15,000 image-instruction alignment;实验段落另以 30K/15K/15K 概述,精确统计见 Table dataset stats。
  • Training config:1 node × 8 NVIDIA H100 GPUs;10,000 steps;packed sequence length 33,000 tokens;learning rate ;cosine decay。
  • Evaluation:GenEval(object-centric compositional T2I:single object、two objects、counting、colors、position、color attributes);WISE(world knowledge reasoning:culture、time、space、biology、physics、chemistry)。Baselines 包含 generation-only models(PixArt-、SDv2.1、DALL-E 2/3、SDXL、SD3-Medium、FLUX.1-dev 等)、unified multimodal models(Janus、Janus-Pro-7B、Show-o/Show-o2、BAGEL 等)和 process baselines(BAGEL+GPT Planner/Inspector、PARM TTS/RL+TTS)。

5. Experimental Results(实验结果)

5.1 Main benchmark results

GenEval(Single / Two / Counting / Colors / Position / Color Attr. / Overall)

ModelSingleTwoCountingColorsPositionColor Attr.Overall
FLUX.1-dev (12B)0.980.930.750.930.680.650.82
Janus-Pro-7B0.990.890.590.900.790.660.80
BAGEL-7B*0.990.950.760.870.510.560.77
Ours (BAGEL-7B + Process-Driven)0.990.950.750.870.720.690.83

关键点:相对 BAGEL-7B*,overall 从 0.77 提升到 0.83;Position 从 0.51 到 0.72,Color Attr. 从 0.56 到 0.69。它的 overall 也略高于 12B FLUX.1-dev 的 0.82,但不是每个子项都最高。

WISE(Culture / Time / Space / Biology / Physics / Chemistry / Overall)

ModelCultureTimeSpaceBiologyPhysicsChemistryOverall
FLUX.1-dev0.480.580.620.420.510.350.50
BAGEL0.760.690.750.640.750.580.70
Ours (BAGEL + Process-driven)0.740.820.730.700.760.780.76

关键点:overall 从 0.70 到 0.76;Time 从 0.69 到 0.82,Chemistry 从 0.58 到 0.78,说明逐步的 textual-visual reasoning 对世界知识与结构语义更有帮助。

5.2 Process baseline comparison

Reasoning StrategyTraining DatasetInference CostGenEval
BAGEL + GPT (Planner)-500.60
BAGEL + GPT (Inspector)-500.80
PARM (TTS)400K10000.67
PARM (RL + TTS)688K10000.77
Ours (SFT)62K1310.83

Ours 用 62K samples 达到 0.83,比 PARM (RL + TTS) 的 688K / 1000 cost / 0.77 更省数据和推理成本。论文解释为 semantic partitioning 更可解释:监督的是具体对象/关系/属性的中间视觉状态,而不是 PARM 式 blurry latent/noise-level intermediate。

5.3 Ablations

AblationColorPositionColor Attr.
w/o aug.0.810.580.50
w/o aug. + Self-critique0.840.610.53
w/ aug.0.820.670.62
w/ aug. + Self-critique0.870.720.69
Critique constructionColorPositionColor Attr.
baseline0.820.670.62
+ scene graph0.830.700.67
+ self-sampling0.870.720.69
Intermediate supervisionCountingColorsPositionColor Attr.
baseline0.610.840.660.62
+ Instruction-intermediate conflict0.620.850.710.65
+ Image-instruction alignment0.730.860.690.65
w/ both (ours)0.750.870.720.69

Ablation 说明三件事:多样编辑操作(refine/remove/swap)尤其提升 relation/attribute;self-sampling critique 比 scene-graph symbolic correction 更贴近模型自身错误分布;text-side conflict 和 image-side alignment 是互补的,前者更推 Position,后者明显推 Counting。

Figure 4 解读:可视化轨迹展示每一步如何从 plan/sketch 到 inspect/refine。第二、三行分别对应两类错误:step-level instruction 与 overall prompt 冲突,以及 draft image 与 instruction 不一致;前者需要改 instruction,后者主要需要改视觉结果。

Figure 5 解读:补充可视化显示最终图在 GenEval/WISE prompt 上具有较强细节和审美质量。它主要支撑 qualitative fidelity,但真正证明方法有效的仍是上面的 GenEval/WISE 与 ablation 数字。

5.4 Limitations / open questions

作者没有单列 Limitations;结论中明确的未来方向是扩展到 video、3D space 和 real-time human-in-the-loop control。基于论文方法本身,当前可复现风险包括:数据构造依赖 GPT judge / Flux-Kontext,公开代码与 checkpoint 未找到;同时实验主要集中在 T2I compositional/world-knowledge benchmarks,还没有证明同样的 interleaved process 能直接迁移到长视频、3D 或交互式编辑。