DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Paper: arXiv:2602.12205 Code: DeepGenTeam/DeepGen Code reference: main @ 7261969 (2026-03-02) RL code reference: deepgenteam/deepgen_rl main @ c4d5dc2 (2026-02-21)

1. Motivation (研究动机)

现有统一图像生成/编辑 multimodal model 往往依赖很大的参数规模：论文点名的统一模型通常超过 10B，且 LongCat-Image / HunyuanImage 3.0 级别方法使用 1.2B–5B 训练样本。DeepGen 要解决的问题不是单一 text-to-image，而是在 5B 参数预算内同时覆盖 general generation、reasoning generation、text rendering、general editing、reasoning editing，并降低训练与部署成本。

核心瓶颈在于：VLM 的最后层偏高层语义，容易丢掉 DiT 生成所需的细粒度视觉信息；如果只取单层 hidden states，条件信号还会受 layer-specific bias 影响。另一方面，RL 对齐虽然能提升偏好与文字渲染，但只靠 KL 约束会在长训练中出现能力漂移，尤其是复杂 instruction / reasoning generation。

这件事值得做，因为它把“大模型靠规模堆能力”的路线改成“VLM-DiT 结构耦合 + 数据分阶段 + 多奖励 RL”：若成立，5B 级模型就可以在开放权重场景中提供接近甚至超过 14B–80B 模型的生成、编辑和推理型视觉能力。

2. Idea (核心思想)

DeepGen 的核心 insight 是：不要只把 VLM 当作最终语义 encoder，而是把多层 VLM 表征当作一组低/中/高层条件通道，通过 Stacked Channel Bridging (SCB) 深度注入 DiT；同时用 learnable think tokens 让 VLM 在输出给 DiT 前先形成一段隐式 reasoning state。

方法上，它用 Qwen2.5-VL-3B 做理解与推理，用 SD3.5-Medium 2B DiT 做生成，并通过三阶段训练把轻量模型补齐：Stage 1 只对齐 SCB / think tokens，Stage 2 解冻 DiT 并用 VLM LoRA 做 joint SFT，Stage 3 用 MR-GRPO 的多奖励 + auxiliary SFT loss 做偏好对齐。

与 OmniGen / BAGEL / LongCat-Image 等统一生成编辑模型相比，DeepGen 的差异不只是换 backbone：它显式保留多层 VLM channel 信息并把 reference-image VAE latents 与 target noise tokens 放入同一个 DiT self-attention 序列；RL 阶段还对不同 reward 分量独立 group normalize，避免单个 reward 的尺度吞掉其它信号。

3. Method (方法)

3.1 Overall framework

Figure 3 解读：DeepGen 是 VLM-DiT 双分支结构。参考图像先走 ViT / Qwen2.5-VL 获取语义 token，同时也经 VAE 得到 DiT latent；目标图像以 noise token 形式进入 DiT。SCB 在 VLM 与 DiT 之间做多层特征融合，右侧冻结/可训练标记对应 Pre-Training、SFT、RL 三个阶段。

Figure 4 解读：训练数据覆盖 text-to-image、image editing、reasoning generation/editing、text rendering 和应用型场景。论文强调只用约 50M 样本完成三阶段训练，而不是依赖十亿级样本规模。

直觉上，DiT 需要的是“可生成”的条件，而 VLM 最终层更像“可回答”的语义摘要。SCB 把 VLM 多层 hidden states 作为 channel 拼接，等价于把细节、布局、语义和推理状态一起交给 DiT；think tokens 则提供一组固定可学习 query，在 VLM 自注意力中吸收上下文，最后变成 DiT 的 prompt embeddings。

3.2 Stacked Channel Bridging (SCB)

论文写法：给定 VLM hidden states $H^{ℓ}$ ，从六个分布在低/中/高层的 layer 取 $H^{ℓ_{1}}, \dots, H^{ℓ_{6}}$ ，沿 channel 维拼接后送入 connector：

H_{SCB} = C ([H^{ℓ_{1}}; H^{ℓ_{2}}; \dots; H^{ℓ_{6}}])

released code 的 SCB 路径在 qwen2_5_vl_sd3_hf_dynamic_fusion.py 中实现：output_hidden_states=True 后用 selected_layers = list(range(num_layers - 1, 0, -6)) 取多层 hidden states，并用 torch.cat(selected_hiddens, dim=-1) 做 channel concatenation；projector_1 因此从 llm.config.hidden_size * 6 投到 connector hidden size。

论文公式与 released code 实现差异：论文描述为“six uniformly distributed VLM layers”；代码按 num_layers - 1, num_layers - 7, ... 的步长规则选择层，并未在该行显式写死 6 个 layer，但 projector_1 的输入维度固定为 hidden_size*6，实际要求选出 6 组 hidden states。

def scb_forward(llm, connector, projector_1, projector_2, projector_3,
                input_embeds, attention_mask, position_ids, meta_queries):
    # released path: qwen2_5_vl_sd3_hf_dynamic_fusion.py
    output = llm(
        inputs_embeds=input_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_hidden_states=True,
        return_dict=True,
    )
    hidden_states = output.hidden_states
    num_layers = len(hidden_states) - 1
    selected_layers = list(range(num_layers - 1, 0, -6))  # expected six layers
    selected_hiddens = [hidden_states[i] for i in selected_layers]
    merged_hidden = torch.cat(selected_hiddens, dim=-1)
 
    x = projector_1(merged_hidden)
    x = connector(x)
    pooled_prompt_embeds = projector_2(x.mean(dim=1))
    prompt_embeds = projector_3(x)
    return pooled_prompt_embeds, prompt_embeds

3.3 Unified generation/editing diffusion loss

Text-to-image 训练中，target image 先经 VAE 得到 latent $x$ ，随机采样 flow-matching timestep / sigma，构造 noisy latent：

z_{t} = (1 - σ_{t}) x + σ_{t} ϵ, ϵ \sim N (0, I)

代码中的 diff_loss() 目标是 velocity / flow target $ϵ - x$ ，loss 为加权 MSE：

L_{FM} = E_{t} [w (t) ∥ v_{θ} (z_{t}, t, c) - (ϵ - x) ∥_{2}^{2}]

image editing 与 text-to-image 共用 DiT loss，但 editing 会把 reference images 编码为两类条件：一类是 VLM semantic image embeds，用于 Qwen2.5-VL prompt；另一类是 VAE latents，作为 cond_hidden_states 传给 transformer。

def deepgen_sft_loss(model, batch):
    # released path: qwen2_5_vl_sd3_hf_dynamic_fusion.py
    if batch["type"] == "text2image":
        target_latents = [model.pixels_to_latents(img[None])[0]
                          for img in batch["pixel_values"]]
        text_inputs = model.prepare_text2image_prompts(batch["texts"])
        query = model.meta_queries[None].expand(len(target_latents), model.num_queries, -1)
        inputs = model.prepare_forward_input(query_embeds=query, **text_inputs)
        pooled, seq = scb_forward_from_inputs(model, inputs)
        return model.diff_loss(target_latents, pooled, seq)
 
    if batch["type"] == "image2image":
        ref_latents = [[model.pixels_to_latents(ref[None])[0] for ref in refs]
                       for refs in batch["pixel_values_src"]]
        image_embeds, image_grid = model.get_semantic_features_dynamic(flatten_refs(batch))
        target_latents = [model.pixels_to_latents(img[None])[0]
                          for img in batch["pixel_values"]]
        text_inputs = model.prepare_image2image_prompts(
            batch["texts"], num_refs=[len(r) for r in batch["pixel_values_src"]],
            ref_lens=[len(x) for x in image_embeds],
        )
        query = model.meta_queries[None].expand(len(target_latents), model.num_queries, -1)
        inputs = model.prepare_forward_input(
            query_embeds=query,
            image_embeds=torch.cat(image_embeds),
            image_grid_thw=image_grid,
            **text_inputs,
        )
        pooled, seq = scb_forward_from_inputs(model, inputs)
        return model.diff_loss(target_latents, pooled, seq, cond_intput=ref_latents)

3.4 MR-GRPO with auxiliary SFT loss

MR-GRPO 对每个 prompt 采样 $G = 8$ 张图，分别计算 preference / CLIP similarity / OCR 等 reward。对第 $k$ 个 reward，先在同组内归一化：

A_{i}^{k} = \frac{R _{k} ( x _{i}^{0} , h ) - mean ({ R _{k} ( x _{j}^{0} , h ) } _{j = 1}^{G} )}{std ({ R _{k} ( x _{j}^{0} , h ) } _{j = 1}^{G} )}

再加权合并并做 batch-wise normalization 得到 $\hat{A}_{i}$ 。GRPO objective 使用 per-step importance ratio：

r_{t}^{i} (θ) = \frac{p _{θ} ( x _{t - Δ t}^{i} ∣ x _{t}^{i} , h )}{p _{θ_{o l d}} ( x _{t - Δ t}^{i} ∣ x _{t}^{i} , h )}

总 loss 加上 velocity-space KL 和 auxiliary SFT：

L_{t o t a l} = (1 - λ) L_{GRPO} + λ L_{SFT}, D_{K L} = ∥ \overset{v}{^}_{θ} (x_{t}, t) - \overset{v}{^}_{re f} (x_{t}, t) ∥_{2}^{2}

released DeepGen-RL 中 atrain_adv_type="gdpo" 对每个 reward function 分别 group normalize；compute_loss() 先 rollout、算 rewards、算 advantages，再按 micro-batch/timestep 累积梯度，最后每步额外 backward 一次 SFT-Aux loss。

def mr_grpo_train_step(trainer, model, prompts):
    images, old_logp, prev_latents, next_latents, timesteps = [], [], [], [], []
    for _ in range(trainer.rollout_accumulation_steps):
        traj = trainer._generate_images_with_trajectory(model, prompts, cfg_prompts=[""] * len(prompts))
        images += traj.images
        old_logp.append(traj.log_probs)
        prev_latents.append(traj.prev_latents)
        next_latents.append(traj.pred_latents)
        timesteps.append(traj.timesteps)
 
    rewards, rewards_per_func = trainer._compute_rewards(prompts, images)
    adv = trainer._compute_advantages(rewards, rewards_per_func, prompts)
    adv = torch.clamp(adv, -trainer.adv_clip_max, trainer.adv_clip_max)
 
    for step_idx in sample_training_timesteps(timesteps, fraction=trainer.timestep_fraction):
        for mb in micro_batches(prev_latents, next_latents, timesteps, old_logp, adv):
            v_pred = trainer._compute_diffusion_loss_single_batch(model, mb.x_t, mb.t, mb.cond)
            _, logp, mean_policy, std_policy = sde_step_with_logprob(
                trainer.scheduler, v_pred, mb.t, mb.x_t, mb.x_next,
                eta=trainer.sde_eta, sampler_type=trainer.atrain_sde_sampler,
            )
            ratio = torch.exp(logp - mb.old_logp)
            unclipped = -mb.adv * ratio
            clipped = -mb.adv * torch.clamp(ratio, 1 - trainer.clip_range, 1 + trainer.clip_range)
            policy_loss = torch.maximum(unclipped, clipped).mean()
            kl_loss = velocity_kl(v_pred, trainer.reference_prediction(mb))
            trainer.accelerator.backward(policy_loss + trainer.beta * kl_loss)
 
    if trainer.sftaux_coef > 0 and trainer.global_step % trainer.sftaux_every_n_steps == 0:
        sft_loss = trainer._compute_sftaux_loss(model, trainer._next_sftaux_batch())
        trainer.accelerator.backward(trainer.sftaux_coef * sft_loss)

3.5 Code-to-paper mapping

Code reference: main @ 7261969 (2026-03-02) — pseudocode and mapping based on this commit. RL-specific rows additionally use deepgen_rl main @ c4d5dc2 (2026-02-21).

Paper Concept	Source File	Key Class/Function
SCB model config, 128 think tokens, Qwen2.5-VL + SD3.5-Medium	`configs/models/deepgen_scb.py`	`model.num_queries=128`, `connector.num_hidden_layers=6`, `freeze_lmm=True`
SCB multi-layer channel concatenation	`src/models/sd3_kontext/qwen2_5_vl_sd3_hf_dynamic_fusion.py`	`Qwen2p5VLStableDiffusion3HF`, `selected_layers`, `torch.cat(..., dim=-1)`
Connector transformer	`src/models/connector/modeling_connector.py`	`ConnectorEncoder`, `ConnectorEncoderLayer`, `ConnectorAttention`
Unified T2I / I2I SFT loss	`src/models/sd3_kontext/qwen2_5_vl_sd3_hf_dynamic_fusion.py`	`text2image_loss`, `image2image_loss`, `diff_loss`
Pre-training config	`configs/pretrain/deepgen_joint_pretrain_scb.py`	`max_iters=200000`, `lr=1e-4`, `freeze_transformer=True`, `lora_modules=None`
Joint SFT config	`configs/finetune/deepgen_joint_sft_scb.py`	`max_iters=400000`, `lr=5e-5`, `freeze_transformer=False`, `lora_rank=64`
RL entrypoint	`deepgen_rl/grpo_deepgen.py`	default `rollout_n=8`, `learning_rate=2e-6`, `beta=5e-7`, `clip_range=1e-4`, `sftaux_coef=0.0001`
MR-GRPO trainer	`deepgen_rl/trainer/grpo_deepgen_trainer.py`	`_compute_advantages`, `_compute_rewards`, `compute_loss`, `sde_step_with_logprob`
Reward mixture config	`assets/rl_datasets/deepgen/deepgen_train.yaml`	text rendering rewards `0.2/0.1/0.7`, general T2I rewards `0.7/0.3`

4. Experimental Setup (实验设置)

Datasets. Pre-Training 使用 general generation 35M（text-to-image-2M、LAION-Aesthetic-6M、Megalith-10M、RedCaps-5M、CC-12M）和 general editing 6.6M（NHR-Edit、GPT-Image-Edit、ShareGPT-4o-Image-Edit、OpenGPT4o-Image-Edit、Nano-banana-consist、Pico-Banana、X2I2、Uniworld-Edit、in-house editing）。SFT 继续使用 general generation 11M、general editing 6.6M，并加入 UniReason-T2I 150K、UniReason-Edit 100K、text rendering / poster / Chinese poem 560K。RL prompts 分两类：text rendering 采样权重 3.0×，general T2I 采样权重 1.0×；auxiliary SFT data 同样提高 text rendering 权重。

Baselines. General generation/editing 比较 Nano Banana、GPT-Image-1、Seedream 4.0、FLUX.1 Kontext Pro、Janus-Pro、FLUX.1 Dev、OmniGen2、BAGEL、Hunyuan-Image 3.0、Qwen-Image、LongCat-Image、Z-Image-Turbo 等；reasoning generation/editing 还包含 STAR、UniWorld-V1、Qwen-Image edit、Lumina-DiMOO；text rendering 比较 FLUX.1 dev、Z-Image-Turbo、GLM-Image 等。

Metrics. GenEval 衡量 compositional semantic alignment；DPG-Bench 衡量 long-prompt instruction following；UniGenBench 给 general generation 综合分和 text 子分；WISE / T2I-CoREBench 衡量 world-knowledge 与哲学分类 reasoning generation；ImgEdit / GEdit-EN 衡量 editing consistency 和 output quality；RISE / UniREditBench 衡量 reasoning editing；CVTG-2K 用 Word Accuracy、NED、CLIPScore 衡量文字渲染。

Training config. Paper Table 9/10 给出的论文级训练数字：Stage-I Pre-Training 200000 iters、LR 1e-4、warmup ratio 0.01、global batch size 512、只训练 SCB connector；Stage-II SFT 400000 iters、LR 5e-5、warmup ratio 0.01、global batch size 768、训练 SCB connector + DiT + VLM LoRA（rank 64, alpha 128, dropout 0.05）。Released configs 可直接核到：configs/pretrain/deepgen_joint_pretrain_scb.py 设置 max_iters=200000、lr=1e-4、warmup_ratio=0.01、accumulative_counts=4；configs/finetune/deepgen_joint_sft_scb.py 设置 max_iters=400000、lr=5e-5、warmup_ratio=0.01、accumulative_counts=3；对应 dataloader 文件 configs/datasets/deepgen_512_fix_pixels/{joint_pretrain.py,joint_sft_zh.py} 暴露的是 per-dataloader batch_sizes=[4,4] / batch_size=4。因此 512/768 是 paper-derived global batch size，released code 未提供可复现该 global batch 的 GPU count / distributed launcher 配置。RL 使用 resolution 512×512、denoising steps 50、group size G=8、timestep fraction 0.6、LR 2e-6、total steps 1500、KL coefficient 5e-7、clip range 1e-4、SFT auxiliary coefficient 1e-4、global batch size 256、DeepSpeed ZeRO-2、BF16。

论文公式与 released code 实现差异：paper Table 10 明确 RL total training steps 为 1,500；deepgen_rl/scripts/train.sh 当前只显式传 --num_train_epochs 10，没有显式传 --max_steps 1500，因此复现实验需要额外确认 dataset/epoch 换算或手动设置 max steps。

5. Experimental Results (实验结果)

Figure 2 解读：气泡图把模型参数量和 generation/editing 指标放在一起，DeepGen 的位置体现了论文主张：不是参数最大的点，但在多项 generation/editing benchmark 上接近或超过大模型。

5.1 Main benchmark numbers

Model	Params	GenEval↑	DPGBench↑	UniGenBench↑	ImgEdit↑	GEdit-EN↑
DeepGen 1.0 (SFT)	3B + 2B	0.86	87.05	74.18	4.09	7.12
DeepGen 1.0 (RL)	3B + 2B	0.87	87.90	75.74	4.14	7.17

Reasoning generation: WISE overall 从 SFT 0.72 到 RL 0.73；T2I-CoREBench overall 从 45.7 到 46.5。Reasoning editing: RISE overall SFT 13.3、RL 10.8；UniREditBench overall SFT 77.5、RL 75.7。Text rendering: CVTG-2K 中 DeepGen RL Word Accuracy 0.7533、NED 0.8936、CLIPScore 0.8278，相比 SFT 的 0.6605/0.8426/0.8227 主要提升在可读文字准确度。

Figure 5 解读：RL 训练 1,500 steps 内 UniGenBench overall 约从 0.747 升到 0.756，text 子分约从 0.25 升到 0.34。这支持论文关于 MR-GRPO 同时提升 general quality 与 text rendering 的结论。

5.2 Ablations

Setting	GenEval↑	DPGBench↑	GEdit-EN↑	WISE↑	RISE↑
DeepGen 1.0 Settings	0.86	87.05	7.12	0.72	13.3
w/o SCB	0.86	85.55	6.75	0.70	12.6
w/o Think Tokens	0.87	86.35	7.02	0.68	11.7
w/o Activate VLM	0.85	86.74	6.93	0.71	12.9

SCB 的主要收益体现在长指令/编辑与 reasoning：移除 SCB 后 DPGBench 87.05 → 85.55、GEdit-EN 7.12 → 6.75、RISE 13.3 → 12.6；移除 think tokens 对 WISE/RISE 伤害更明显，说明 learnable query 对推理型条件压缩有作用。

Figure 6a 解读：overall curve 显示 auxiliary SFT loss 对 RL 稳定性关键；没有 SFT loss 时约 300 steps 后开始退化，低于起点。

Figure 6b 解读：text generation 子分中，所有 RL 变体都能提升文字渲染，但移除 SFT loss 后提升更慢且更不稳定，说明 KL 是过程约束，SFT loss 是结果分布锚点。

Table 7 的 RL ablation 数字：full RL 为 GenEval 0.87、DPGBench 87.75、GEdit-EN 7.05、UniGenBench Text 35.06、Overall 75.69；w/o Auxiliary SFT Loss 为 0.87 / 87.40 / 6.99 / 33.33 / 74.33；w/o Velocity KL 为 0.87 / 87.32 / 7.02 / 32.47 / 75.07；w/o Reward-wise Norm 为 0.86 / 87.73 / 7.02 / 32.18 / 75.27。

5.3 Limitations / caveats

论文没有集中列出 limitations。实现侧可见的风险是：DeepGen 主 repo 的 pretrain/SFT configs 使用占位 model/data/checkpoint 路径；DeepGen-RL script 也要求本地 reward service 与 checkpoint 路径，且 scripts/train.sh 没显式固定 paper Table 10 的 1500 max steps。因此阅读结论可信，但严格复现实验仍需要作者的实际数据路径、checkpoint 和 service 配置。

整体结论：DeepGen 证明 5B VLM-DiT 只要用多层 VLM-to-DiT alignment、分阶段多任务数据和多奖励 RL，就能在 generation/editing/reasoning/text rendering 上达到接近或超过大得多模型的综合能力。

Paper Notes

探索

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall framework

3.2 Stacked Channel Bridging (SCB)

3.3 Unified generation/editing diffusion loss

3.4 MR-GRPO with auxiliary SFT loss

3.5 Code-to-paper mapping

4. Experimental Setup (实验设置)

5. Experimental Results (实验结果)

5.1 Main benchmark numbers

5.2 Ablations

5.3 Limitations / caveats

目录