Emerging Properties in Unified Multimodal Pretraining (BAGEL)

Authors: Chaorui Deng*, Deyao Zhu*, Kunchang Li*, Chenhui Gou*, Feng Li*, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan Affiliations: ByteDance Seed, Shenzhen Institutes of Advanced Technology, Monash University, Hong Kong University of Science and Technology, UC Santa Cruz arXiv: 2505.14683 Project Page: bagel-ai.org GitHub: ByteDance-Seed/Bagel

1. Motivation (研究动机)

统一多模态理解与生成模型近年来取得了显著进展，但学术界开源模型与 GPT-4o、Gemini 2.0 等闭源系统之间仍存在较大差距。现有的统一模型主要面临以下问题：

数据瓶颈：大多数统一模型仍以 image-text paired data 为主进行训练，缺乏大规模的多模态交错数据（interleaved multimodal data），限制了复杂推理能力的习得
架构瓶颈：External Diffuser 方案通过少量 latent tokens 连接 LLM 与扩散模型，引入了信息瓶颈，不利于长上下文多模态推理
能力瓶颈：现有模型在理解和生成上的基础能力之外，缺乏自由形式图像编辑、世界导航、3D 操作等需要复杂组合推理的”涌现”能力

BAGEL 的核心论点是：通过在大规模、多样化的交错多模态数据上进行 bottleneck-free 的统一预训练，模型能够涌现出超越传统基准的复杂多模态推理能力。

2. Idea (核心思想)

BAGEL 提出了三个关键设计理念：

Integrated Transformer 架构：选择无瓶颈的集成式 Transformer，而非 External Diffuser，让理解与生成模块通过共享 self-attention 进行无损交互。采用 Mixture-of-Transformers (MoT) 架构，为理解和生成分别分配独立的参数子空间（QKV 投影 + FFN），同时共享 attention 计算
大规模交错多模态数据：建立了包含文本、图像、视频、网页数据的万亿级 token 训练语料，特别构建了 video-interleaved 和 web-interleaved 数据集，并引入 reasoning-augmented data 支持思维链推理
涌现能力 (Emerging Properties)：随着预训练规模扩大，模型展现出阶段性涌现：基础理解/生成能力先收敛，随后编辑与自由形式操作能力浮现，最终复杂多模态推理（Intelligent Editing）涌现

3. Method (方法)

3.1 整体架构

Figure 2 解读：BAGEL 采用 MoT 架构，包含两个 Transformer Expert——Understanding Expert 和 Generation Expert。两者在每层共享 Multi-modal Self Attention，但拥有独立的 QKV 投影和 FFN。输入端使用三种编码器：Text Tokenizer 处理文本，Understanding Encoder (SigLIP2-so400m/14) 处理视觉理解输入，Generation Encoder (FLUX VAE) 处理视觉生成的 latent 表示。文本输出通过 Next Token Prediction，图像输出通过 Velocity Prediction (Rectified Flow)。

模型规格：

总参数量：14B（7B active parameters，MoT 架构下理解和生成 Expert 各约 7B）
LLM backbone：Qwen2.5 LLM
Understanding encoder：SigLIP2-so400m/14，固定 384 分辨率，最大输入 $980 \times 980$ ，集成 NaViT 支持原生宽高比
Generation encoder：FLUX VAE，downsample ratio = 8，latent channel = 16， $2 \times 2$ patch embedding
归一化：RMSNorm
激活函数：SwiGLU
位置编码：RoPE + GQA + QK-Norm

3.2 Mixture-of-Transformers (MoT) 设计

Figure 3 解读：在 1.5B Qwen2.5 LLM 上对比 Dense、MoE、MoT 三种架构的训练 loss 曲线。左图为 CE Loss（理解任务），右图为 MSE Loss（生成任务）。MoT 在生成任务上收敛最快且最终 loss 最低，理解任务上也保持最优性能。这验证了将理解和生成参数解耦的设计优势。

三种架构设计对比：

架构	描述	特点
Dense Transformer	所有 token 共享全部参数	理解和生成目标在同一参数空间竞争
MoE variant	仅复制 FFN 层作为 generation expert	部分参数解耦
MoT variant	复制 Qwen2.5 LLM 全部可训练参数创建 generation expert	完全解耦，硬路由

MoT 的关键实现：

Generation Expert 专门处理 VAE tokens
Understanding Expert 处理 text tokens 和 ViT tokens
两者通过共享 self-attention 操作在同一 token 序列上交互
使用 hard routing：新建的 generation expert 只处理 VAE tokens，原始参数（understanding expert）处理文本和 ViT tokens

# MoT Forward Pass 伪代码
class Qwen2MoTDecoderLayer:
    def forward_train(self, hidden_states, packed_und_indexes, packed_gen_indexes):
        # 1. 分别对理解和生成 token 做 LayerNorm
        hidden_states[packed_und_indexes] = self.input_layernorm(
            hidden_states[packed_und_indexes]
        )
        hidden_states[packed_gen_indexes] = self.input_layernorm_moe_gen(
            hidden_states[packed_gen_indexes]
        )
 
        # 2. 共享 Attention（两个 Expert 分别计算 QKV，但在同一序列上做 attention）
        #    Understanding Expert: q_proj, k_proj, v_proj
        #    Generation Expert: q_proj_moe_gen, k_proj_moe_gen, v_proj_moe_gen
        q_und = self.q_proj(hidden_states[packed_und_indexes])
        q_gen = self.q_proj_moe_gen(hidden_states[packed_gen_indexes])
        # 合并后做 shared multi-modal self attention
        attn_output = flash_attn_varlen_func(merged_q, merged_k, merged_v)
 
        # 3. 独立 FFN 路由
        hidden_states[packed_und_indexes] += self.mlp(
            attn_output[packed_und_indexes]
        )
        hidden_states[packed_gen_indexes] += self.mlp_moe_gen(
            attn_output[packed_gen_indexes]
        )
        return hidden_states

3.3 Generalized Causal Attention

Figure 15 解读：展示了 BAGEL 的广义因果注意力机制。对于交错多模态输入，token 被分成多个连续 split（text、ViT、VAE）。每个 split 内的 text token 使用 causal attention，vision token 使用 bidirectional attention。后续 split 可以 attend 到前面所有 split 的 token。Noised VAE token 不被后续 token attend。

三类视觉 token：

Noised VAE tokens：加了扩散噪声的 VAE latents，仅用于计算 MSE loss（Rectified Flow 训练）
Clean VAE tokens：原始无噪声 latents，作为后续图像/文本生成的条件
ViT tokens：SigLIP2 编码的理解特征，统一输入格式并提升交错生成质量

Classifier-Free Guidance 的 dropout 策略：

Text tokens：以概率 0.1 随机丢弃
ViT tokens：以概率 0.5 随机丢弃
Clean VAE tokens：以概率 0.1 随机丢弃

# Generalized Causal Attention 伪代码
def create_attention_mask(splits, attn_modes):
    """
    splits: list of (modality, token_indices) per split
    attn_modes: 'causal' for text, 'bidirectional' for vision
    """
    mask = torch.zeros(seq_len, seq_len)
    for i, split_i in enumerate(splits):
        for j, split_j in enumerate(splits):
            if j > i:
                continue  # 后面的 split 不被前面的 attend
            if i == j:
                # 同一 split 内部
                if split_i.modality == 'text':
                    mask[split_i.idx, split_j.idx] = causal_mask()
                else:
                    mask[split_i.idx, split_j.idx] = 1  # bidirectional
            else:
                # 跨 split：可以 attend 到前面 split 的所有 token
                if split_j.modality != 'noised_vae':
                    mask[split_i.idx, split_j.idx] = 1
    return mask
 
# 使用 PyTorch FlexAttention 实现，比 naive scaled-dot-product 快约 2 倍
block_mask = create_block_mask(
    sparse_mask, B=1, H=num_heads,
    Q_LEN=seqlen, KV_LEN=seqlen,
    BLOCK_SIZE=128, _compile=True
)

3.4 Rectified Flow 生成

BAGEL 使用 Rectified Flow 进行视觉 token 的生成。给定干净的 latent $x_{1}$ 和噪声 $x_{0} \sim N (0, I)$ ，构建线性插值路径：

x_{t} = (1 - t) x_{0} + t x_{1}

模型预测速度场 $v_{t}$ ，训练目标为最小化 MSE：

L_{MSE} = ∥ v_{θ} (x_{t}, t) - (x_{1} - x_{0}) ∥^{2}

推理时使用 Euler 方法求解 ODE：

x_{t + Δ t} = x_{t} + v_{θ} (x_{t}, t) \cdot Δ t

Timestep embedding 直接加到 VAE tokens 的初始 hidden states 上（而非使用 AdaLN）。Diffusion timestep shift 从 pre-training 阶段的 1.0 提高到后续阶段的 4.0，以适应更高分辨率。

# Rectified Flow 推理伪代码
def generate_image(model, context, num_timesteps=50, cfg_text_scale=4.0, cfg_img_scale=2.0):
    x_t = torch.randn(h * w, latent_dim)  # 初始噪声
    timesteps = torch.linspace(1.0, 0.0, num_timesteps + 1)
    # 应用 timestep shift
    timesteps = timesteps / (timesteps + shift * (1 - timesteps))
 
    for i in range(num_timesteps):
        t = timesteps[i]
        dt = timesteps[i] - timesteps[i + 1]
 
        # 预测三个速度场 (unconditional, text-guided, image-guided)
        v_uncond = model.forward_flow(x_t, t, context=None)
        v_text = model.forward_flow(x_t, t, context=text_context)
        v_img = model.forward_flow(x_t, t, context=full_context)
 
        # Classifier-Free Guidance 组合
        v_guided = v_uncond + cfg_text_scale * (v_text - v_uncond) + cfg_img_scale * (v_img - v_text)
 
        # Renormalization 保持 norm 一致性
        v_guided = v_guided * (v_img.norm() / v_guided.norm())
 
        # Euler step
        x_t = x_t - v_guided * dt
 
    image = vae.decode(x_t)
    return image

3.5 数据构建

Figure 4a 解读：Video Interleaved Data 的构建流程。先对原始视频进行预处理和质量过滤，然后用 large VLM 生成帧间变化描述，再蒸馏到轻量级 small VLM（基于 Qwen2.5-VL-7B 微调），最终生成 4500 万条 temporally grounded 的交错序列。

Figure 4b 解读：Web Interleaved Data 的构建流程。基于 OmniCorpus 数据集，先用 LLM + fastText 两阶段进行 topic selection，再经过质量过滤，最后为每张图像添加 caption 作为 conceptual scaffold，生成 2000 万条结构化网页文档。

训练数据统计（Table 1）：

数据源	数据量 (M)	Token 量 (T)
Text Data	400	0.4
Image-Text-Pair Understanding	500	0.5
Image-Text-Pair Generation	1600	2.6
Interleaved Understanding	100	0.5
Interleaved Generation: Video	45	0.7
Interleaved Generation: Web	20	0.4

3.6 Reasoning-Augmented Data

受 O1 和 DeepSeek-R1 启发，构建了 50 万条 reasoning-augmented 样本，覆盖四类任务：

Text-to-Image Generation：手工编写简短/模糊的 T2I query，用 Qwen2.5-72B ICL 生成 query-guidance pair 和详细 prompt，再由 FLUX.1-dev 生成目标图像。训练三元组：(query, reasoning trace, image)
Free-form Image Manipulation：用 VLM 生成源图/目标图的 caption，结合 DeepSeek-R1 风格的推理指令生成 reasoning trace
Conceptual Edits：从 web interleaved 数据中采样图像对，用三阶段 VLM 流程构建高质量 QA 样本
Abstract Edits：处理需要高层概念推理的图像编辑任务

3.7 多阶段训练

参数	Alignment	PT	CT	SFT
Learning rate	$1 \times 1 0^{- 3}$	$1.0 \times 1 0^{- 4}$	$1.0 \times 1 0^{- 4}$	$2.5 \times 1 0^{- 5}$
LR scheduler	Cosine	Constant	Constant	Constant
Training steps	5K	200K	100K	15K
Training tokens	4.9B	2.5T	2.6T	72.7B
Gen resolution (min, max)	-	(256, 512)	(512, 1024)	(512, 1024)
Und resolution (min, max)	(378, 378)	(224, 980)	(378, 980)	(378, 980)
Loss weight (CE : MSE)	-	0.25 : 1	0.25 : 1	0.25 : 1
Diffusion timestep shift	-	1.0	4.0	4.0

四个训练阶段：

Alignment：对齐 SigLIP2 ViT 与 Qwen2.5 LLM，仅训练 MLP connector，固定分辨率 $378 \times 378$
Pre-training (PT)：添加 QK-Norm，除 VAE 外全部参数可训练，2.5T tokens，原生分辨率策略
Continued Training (CT)：提高视觉输入分辨率，增大 interleaved data 采样比例，2.6T tokens
Supervised Fine-tuning (SFT)：高质量子集微调，总计 72.7B tokens

3.8 代码到论文映射表

Paper Concept	Source File	Key Class/Function
BAGEL 主模型	`modeling/bagel/bagel.py`	`Bagel`, `BagelConfig`
MoT Decoder Layer	`modeling/bagel/qwen2_navit.py`	`Qwen2MoTDecoderLayer`, `PackedAttentionMoT`
MoE Decoder Layer	`modeling/bagel/qwen2_navit.py`	`Qwen2MoEDecoderLayer`
SigLIP2 Encoder	`modeling/bagel/siglip_navit.py`	SigLIP NaViT implementation
Qwen2 LLM backbone	`modeling/qwen2/`	Qwen2 model components
VAE Encoder/Decoder	`modeling/autoencoder.py`	AutoEncoder
Generalized Causal Attention	`modeling/bagel/bagel.py`	`create_sparse_mask`, `create_block_mask`
Rectified Flow Generation	`modeling/bagel/bagel.py`	`generate_image`, `_forward_flow`
CFG Guidance	`modeling/bagel/bagel.py`	`_forward_flow` (cfg_text_v_t, cfg_img_v_t)
Interleaved Inference	`inferencer.py`	`InterleaveInferencer`
Training Pipeline	`train/pretrain_unified_navit.py`	Main training script
Data Processing	`data/`	`t2i_dataset.py`, `vlm_dataset.py`

4. Experimental Setup (实验设置)

评估基准

多模态理解（6 个 benchmark）：

MME (MME-P + MME-S)、MMBench (1.0-EN)、MM-Vet、MMMU、MathVista、MMVP

文生图生成：

GenEval：object-centric 评估（Single Obj., Two Obj., Counting, Colors, Position, Color Attribution）
WISE：复杂语义理解和世界知识评估

图像编辑：

GEdit-Bench：基于真实用户请求的编辑评估，GPT-4.1 自动评分（G_SC, G_PQ, G_O）
IntelligentBench：350 个 free-form 图像操作样本，GPT-4o 评分（0-100 scale）

其他评估：

RISEBench、KRIS-Bench（reasoning-based editing）

基线模型

理解专用：InternVL2/2.5、Qwen2.5-VL、Qwen2-VL、LLaVA-OV、DeepSeek-VL2
统一模型：Janus-Pro、Show-o、MetaQuery-XL、VILA-U、Chameleon、LlamaFusion、MUSE-VL
生成专用：FLUX.1-dev、SD3-Medium、SDXL、DALL-E 3、PixArt-Alpha
闭源模型：GPT-4o、Gemini 2.0

5. Experimental Results (实验结果)

5.1 多模态理解 (Table 4)

BAGEL (7B MoT) 在统一模型中大幅领先：

Model	MME-P	MME-S	MMBench	MMMU	MM-Vet	MathVista	MMVP
Janus-Pro 7B	1567	1685	79.2	41.0	50.0	66.6	-
MetaQuery-XL 7B	-	-	83.5	58.6	66.6	-	-
BAGEL 7B MoT	1687	2388	85.0	55.3	67.2	73.1	69.3

关键发现：

BAGEL 在 MMMU 上超过 Janus-Pro 14.3 分，在 MM-Vet 上超过 17.1 分
与理解专用模型 Qwen2.5-VL 和 InternVL2.5 相比，BAGEL 在大多数 benchmark 上也有优势
MoT 架构有效缓解了理解和生成任务之间的冲突

5.2 文生图生成 (Table 5 & 6)

GenEval Overall Score：

Model	Overall
FLUX.1-dev	0.82
MetaQuery-XL	0.80
Janus-Pro-7B	0.80
BAGEL	0.82
BAGEL (w/ rewriter)	0.88

WISE Benchmark (世界知识评估)：

Model	Overall
GPT-4o	0.80
MetaQuery-XL	0.55
BAGEL	0.52
BAGEL w/ Self-CoT	0.70

关键发现：

BAGEL 在 GenEval 上达到 88%（使用 LLM rewriter），超过所有开源模型和 FLUX.1-dev
在 WISE 上，BAGEL with CoT 达到 0.70，超过所有开源模型，仅次于 GPT-4o

5.3 图像编辑 (Table 7 & 8)

GEdit-Bench (EN)：

Model	G_SC	G_PQ	G_O
GPT-4o	7.85	7.62	7.53
Step1X-Edit	7.09	6.76	6.70
BAGEL	7.36	6.83	6.52

IntelligentBench：

Model	Score
GPT-4o	78.9
Gemini 2.0	57.6
Step1X-Edit	14.9
BAGEL	44.9
BAGEL w/ Self-CoT	55.3

关键发现：

BAGEL 在 GEdit-Bench 上与 Step1X-Edit 竞争力相当，也超过 Gemini 2.0
在 IntelligentBench 上，BAGEL (44.9) 大幅超过 Step1X-Edit (14.9)，加入 CoT 后达到 55.3，接近 Gemini 2.0 (57.6)

5.4 涌现能力分析

Figure 7a 解读：理解任务的性能曲线。约 0.18T tokens 时达到 85% 峰值性能，说明基础理解能力收敛较早。

Figure 7b 解读：GenEval 生成分数随训练 token 数的变化。约 0.68T tokens 时达到 85% 峰值，使用 LLM rewriter 进一步提升。基础生成能力也较早饱和。

Figure 7c 解读：GEdit 图像编辑分数。需要约 2.64T tokens 才达到 85% 峰值，收敛明显更慢。采用 VAE+ViT 特征组合比仅用 VAE 效果更好。

Figure 7d 解读：IntelligentBench 智能编辑分数。需要约 3.61T tokens 才达到 85% 峰值，展现出典型的涌现行为——在 3T tokens 前性能较低且增长缓慢，之后急剧提升（从 15 跃升至 45）。去除 ViT tokens 导致 16% 的性能下降，说明视觉语义理解对复杂推理至关重要。

涌现规律总结：

不同能力在不同训练阶段涌现：理解/生成 → 基础编辑 → 智能编辑
收敛所需 token 数：理解 (0.18T) < 生成 (0.68T) < 编辑 (2.64T) < 智能编辑 (3.61T)
ViT 特征对 IntelligentBench 影响显著（-16%），但对 GEdit-Bench 影响小

5.5 Thinking 增强 (Generation/Editing with Thinking)

Figure 13a 解读：展示了 CoT thinking 对 T2I 生成的帮助。对于复杂或模糊的 prompt（如”A car made of small cars”），直接生成效果不佳，但加入 thinking 后，模型先进行推理规划再生成，输出质量显著提升。

Figure 13b 解读：展示了 CoT thinking 对 Intelligent Editing 的帮助。模型先用文字推理理解输入图像、用户需求和目标状态，再进行图像生成，在需要世界知识和多步推理的编辑任务上效果显著提升。

Paper Notes

探索

Emerging Properties in Unified Multimodal Pretraining (BAGEL)

Emerging Properties in Unified Multimodal Pretraining (BAGEL)

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 整体架构

3.2 Mixture-of-Transformers (MoT) 设计

3.3 Generalized Causal Attention

3.4 Rectified Flow 生成

3.5 数据构建

3.6 Reasoning-Augmented Data

3.7 多阶段训练

3.8 代码到论文映射表

4. Experimental Setup (实验设置)

评估基准

基线模型

5. Experimental Results (实验结果)

5.1 多模态理解 (Table 4)

5.2 文生图生成 (Table 5 & 6)

5.3 图像编辑 (Table 7 & 8)

5.4 涌现能力分析

5.5 Thinking 增强 (Generation/Editing with Thinking)

目录