Qwen-Image-VAE-2.0 Technical Report

Paper: arXiv:2605.13565 Code: alibaba/OmniDoc-TokenBench — 公开 repo 包含 benchmark/evaluation scripts；未找到 Qwen-Image-VAE-2.0 training/model implementation。 Code reference: main @ f2815a7 (2026-05-14)

1. Motivation (研究动机)

Latent Diffusion Models 通常用 VAE 把图像压缩到 latent space，再让 DiT 或 diffusion model 在 latent 上建模。业界常用的 spatial compression 是 $f 8$ ，但 native high-resolution synthesis 继续扩大时，latent token 数仍然很大，计算成本会成为瓶颈。直接把 compression 提到 $f 16$ 或 $f 32$ 又会导致 reconstruction fidelity 下降，尤其是文字、细线、页面布局和高频边缘会糊掉。

这篇报告的目标是做一组 high-compression VAEs，在 $f 16$ 和 $f 32$ 下同时保持 reconstruction fidelity 与 diffusability。它关心的不只是像素重建好看，还关心 latent space 是否容易被下游 diffusion/DiT 建模；因此引入 semantic alignment，并专门构建 text-rich benchmark 来衡量 OCR-level 可读性。

问题值得研究的原因在于：如果 VAE 能在更高 compression ratio 下保持可读文字和稳定 latent semantics，后续 image generation model 的 token 数、显存和推理成本都会下降，同时不会把 typography 和 document-like visual content 牺牲掉。

2. Idea (核心思想)

核心 insight 是：高压缩 VAE 的瓶颈不是单纯 decoder 不够强，而是 spatial information 被过早下采样后丢失、latent semantic structure 又不够适合 diffusion。Qwen-Image-VAE-2.0 用 Global Skip Connections 把 pixel-level 高频信号直接折叠到深层 latent，同时用 DINOv2 middle-layer semantic alignment 让 latent 更 generation-friendly。

关键创新包括：expanded latent channels 弥补 spatial compression 带来的信息损失；GSC 提供直接细节恢复路径；attention-free/asymmetric backbone 降低高分辨率 encoder/decoder 开销；去掉 KL/GAN，只保留 reconstruction、LPIPS 和 semantic alignment；再用 synthetic document rendering 与 OmniDoc-TokenBench 专门解决文字重建。

与传统 KL-regularized VAE 或 adversarial VAE 相比，这篇的差异是把 VAE 从「接近 Gaussian prior 的压缩器」转成「高保真且适合 diffusion 的 latent tokenizer」。与只看 ImageNet/FFHQ 的评测不同，它用 OCR-based NED 直接衡量 text-rich reconstruction 是否还能读。

3. Method (方法)

3.1 Overall framework：GSC + expanded channels + attention-free backbone

Figure 1 解读：图比较 No Skip Connection (NSC)、Local Skip Connection (LSC) 与 Global Skip Connection (GSC)。GSC 通过 Space-to-Channel (S2C) 把输入图像的局部像素信息折叠进 channel 维度，绕过早期 downsampling 的信息瓶颈；右侧 loss/PSNR 曲线显示 GSC 在 f16c64 从零训练时收敛更快、重建更好。该设计随后被用于全系列 Qwen-Image-VAE-2.0。

直觉上， $f 16$ / $f 32$ compression 的危险在于：一个 latent cell 要承载更大 image patch 的结构与文字细节。如果所有信息都必须经过连续 downsampling，细线和字符 stroke 会先被抹掉；GSC 像一条高频旁路，把 pixel-level detail 提前送到深层，让 decoder 不必从过度压缩的表示中凭空恢复文字边缘。

Table 1 给出的模型族配置如下：

Model	f	C	denc	ddec	nlayer	Residual	Params Enc/Dec
Qwen-Image-VAE-2.0-f16c64	16	64	96	144	5	GSC	76M / 248M
Qwen-Image-VAE-2.0-f16c128	16	128	96	144	5	GSC	76M / 248M
Qwen-Image-VAE-2.0-f32c128	32	128	96	144	6	GSC	77M / 250M
Qwen-Image-VAE-2.0-f32c192	32	192	96	144	6	GSC	78M / 250M

3.2 Data：billion-scale general images + text-rich real/synthetic documents

训练数据先扩展到 billions of images，覆盖多类别、分辨率和 aspect ratios，并用 clarity/blur filters 去掉边缘模糊和 compression artifacts。针对 text-rich 场景，论文再用 OCR filter 选出高字符密度样本，并 curated document corpus：academic papers、presentation slides、posters、complex web pages 等。

合成数据部分不是只渲染黑字白底，而是把 English/Chinese 文本渲染到从 general-domain images 随机采样的背景上。字符大小覆盖 5 到 20 pixels，以适配不同 compression settings；这让模型在 $f 32$ 下也要学会保留细粒度 stroke。

3.3 Loss：去掉 KL/GAN，保留 reconstruction + LPIPS + semantic alignment

总 loss 是：

L_{total} = L_{recon} + λ_{lpips} L_{lpips} + λ_{align} L_{align} .

其中 $L_{recon}$ 是 pixel-level L1 reconstruction loss， $L_{lpips}$ 是 perceptual loss。论文明确去掉 KL loss，因为 KL 会限制 latent capacity，并与 semantic alignment 的目标冲突；也去掉 GAN loss，因为在大训练预算下 GAN 对 sharpness 的收益不必要，反而增加 optimization difficulty 与训练不稳定。

Semantic alignment 采用 DINOv2-L 的 middle-layer feature。给定 target semantic feature map $f \in R^{h \times w \times c}$ ，先把 latent 投影成同空间的 $z^{'}$ ，再用两个 margin losses：

L_{mcos} (z^{'}, f) = \frac{1}{N} p \in P \sum ReLU (1 - cos (z_{p}^{'}, f_{p}) - m_{cos}),

L_{mdms} (z^{'}, f) = \frac{1}{N ^{2}} p \in P \sum q \in P \sum ReLU (cos (z_{p}^{'}, z_{q}^{'}) - cos (f_{p}, f_{q}) - m_{dist}),

L_{align} (z, f) = L_{mcos} (z^{'}, f) + L_{mdms} (z^{'}, f) .

$L_{mcos}$ 对齐局部 latent 与 DINOv2 feature 的方向； $L_{mdms}$ 保留相对空间布局。训练早期使用更 strict margins 强化 diffusability，后期逐渐放松 margins，让模型在 semantic consistency 与 pixel-level fidelity 之间重新平衡。

3.4 OmniDoc-TokenBench：OCR-based text reconstruction benchmark

Figure 2 解读：OmniDoc-TokenBench 是约 3K text-rich images，覆盖 book、slides、color textbook、exam paper、academic paper、magazine、financial report、newspaper、note 九类，并包含 English 与 Chinese。它不是用 word boxes 做复杂标注，而是对原图和重建图做 full-page OCR，再比较 OCR 输出文本。

核心指标是 Normalized Edit Distance (NED)：如果 OCR 文本完全一致，NED 接近 1；字符级错误越多，NED 越低。论文指出传统 PSNR/SSIM 对单字错误不敏感，例如 “orange” 变成 “orango” 的 PSNR 损失小于 0.5 dB，但 NED 会下降 16.7%。

3.5 Pseudocode

def gsc_encode(x, encoder_stages, space_to_channel, latent_proj):
    # Conceptual pseudocode from Figure 1; model source code is not released.
    high_freq = space_to_channel(x)          # fold spatial pixels into channels
    h = x
    for stage in encoder_stages:
        h = stage(h)                         # progressive downsampling
    h = torch.cat([h, high_freq], dim=1)      # global residual path
    z = latent_proj(h)
    return z

def vae_training_loss(x, vae, lpips_fn, dinov2, align_proj, lambda_lpips, lambda_align):
    # Conceptual pseudocode from Equations (1)-(4).
    z = vae.encode(x)
    x_hat = vae.decode(z)
    l_recon = F.l1_loss(x_hat, x)
    l_lpips = lpips_fn(x_hat, x).mean()
 
    f = dinov2.middle_layer_features(x).detach()
    z_prime = align_proj(z, output_hw=f.shape[-2:])
    l_mcos = F.relu(1 - cosine(z_prime, f) - margin_cos).mean()
    l_mdms = F.relu(pairwise_cos(z_prime) - pairwise_cos(f) - margin_dist).mean()
    l_align = l_mcos + l_mdms
    return l_recon + lambda_lpips * l_lpips + lambda_align * l_align

@torch.no_grad()
def public_repo_reconstruct_folder(gt_dir, recon_dir):
    # Based on alibaba/OmniDoc-TokenBench example_recon.py.
    vae = AutoencoderKL.from_pretrained(
        "black-forest-labs/FLUX.1-dev", subfolder="vae", torch_dtype=torch.float32
    ).cuda().eval()
    for img_path in sorted(Path(gt_dir).glob("*.png")):
        img = Image.open(img_path).convert("RGB")
        x = to_tensor(img).unsqueeze(0).cuda() * 2 - 1
        latent = vae.encode(x).latent_dist.mode()
        recon = vae.decode(latent).sample.squeeze(0).clamp(-1, 1)
        recon = ((recon + 1) / 2 * 255).round().to(torch.uint8)
        Image.fromarray(recon.permute(1, 2, 0).cpu().numpy()).save(Path(recon_dir) / img_path.name)

def public_repo_compute_ned(gt_dir, recon_dir, files):
    # Based on alibaba/OmniDoc-TokenBench eval_metrics.py.
    ocr = PaddleOCR(use_doc_orientation_classify=False,
                    use_doc_unwarping=False,
                    use_textline_orientation=False)
    scores = []
    details = []
    for name in files:
        text_gt = "".join(ocr.predict(np.asarray(Image.open(Path(gt_dir) / name)))[0].get("rec_texts", []))
        text_rec = "".join(ocr.predict(np.asarray(Image.open(Path(recon_dir) / name)))[0].get("rec_texts", []))
        max_len = max(len(text_gt), len(text_rec), 1)
        ned = 1.0 - Levenshtein.distance(text_gt, text_rec) / max_len
        scores.append(ned)
        details.append({"file": name, "gt_text": text_gt, "recon_text": text_rec, "ned": round(ned, 4)})
    return float(np.mean(scores)), details

论文公式与 released code 实现差异：paper 描述了 Qwen-Image-VAE-2.0 的 GSC architecture、semantic alignment training 和 multi-stage training strategy；alibaba/OmniDoc-TokenBench @ f2815a7 公开的是 benchmark/evaluation code 与 reconstruction example，未公开 VAE model/training implementation。因此训练 loss/GSC pseudocode 是 paper-level，NED/reconstruction pseudocode 是 released code-derived。

Code reference: main @ f2815a7 (2026-05-14) — pseudocode and mapping based on this commit

Paper Concept	Source File	Key Class/Function
OmniDoc OCR/NED metric	`eval_metrics.py`	`compute_ned`
Pixel reconstruction metrics	`eval_metrics.py`	`compute_pixel_metrics`, `compute_fid`
Folder reconstruction demo	`example_recon.py`	`AutoencoderKL.encode/decode` loop
Benchmark usage docs	`README.md`	CLI examples for `eval_metrics.py`
VAE training/model architecture	Not released	代码搜索未找到开源实现

4. Experimental Setup (实验设置)

Datasets

Training corpus: billions of images，覆盖 diverse categories、resolutions、aspect ratios；具体样本数未详细说明。
Text-rich real data: OCR-filtered high-character-density samples + curated document corpus，包括 academic papers、presentation slides、posters、complex web pages。
Synthetic text data: English/Chinese rendered documents，带 random background，字符大小 5–20 pixels。
OmniDoc-TokenBench: 约 3K text-rich document images，九类文档，English + Chinese。

Baselines

Reconstruction/generation baselines 包括 VTP-Large、RAE-DINOv2-B、RAE-SigLIP2-B、FLUX.1-dev、HunyuanVideo、Qwen-Image、Wan2.1、Cosmos-0.1-CI8x8、Cosmos-0.1-CI16x16、HunyuanVideo-1.5、HunyuanImage-3.0、Wan2.2、Stepvideo-T2V、DC-AE-sana、LTX-2、HunyuanImage-2.1、LTX-Video、FLUX.2-dev。

Evaluation metrics

PSNR / SSIM: pixel-level reconstruction quality。
LPIPS: perceptual similarity，lower is better。
FID / gFID: reconstructed/generated image distribution quality，lower is better。
IS: ImageNet generation quality for downstream SiT experiments，higher is better。
NED: OCR text similarity， $1 - \frac{Levenshtein ( t e x t _{g t} , t e x t _{rec} )}{m a x ( ∣ t e x t _{g t} ∣ , ∣ t e x t _{rec} ∣ , 1 )}$ ，higher is better。

Training config

Architecture config: f16c64、f16c128、f32c128、f32c192；GSC residual；encoder params 76M/76M/77M/78M；decoder params 248M/248M/250M/250M。
Training strategy: curriculum from low resolution to 2K；progressive text-rich data infusion；semantic alignment margins from strict to relaxed。
Loss: L1 reconstruction + LPIPS + semantic alignment；no KL loss；no GAN loss。
Hardware / optimizer / LR / steps: 论文未详细说明；公开 repo 未提供 training config。

5. Experimental Results (实验结果)

Public reconstruction and generation benchmarks (Table 2)

Model	Setting	IS ↑	gFID ↓	ImageNet PSNR / SSIM	FFHQ PSNR / SSIM
Qwen-Image-VAE-2.0-f16c64	f16c64	102.76	9.52	32.72 / 0.9086	39.14 / 0.9541
Qwen-Image-VAE-2.0-f16c128	f16c128	92.42	10.29	35.90 / 0.9519	43.10 / 0.9795
Qwen-Image-VAE-2.0-f32c128	f32c128	81.23	15.05	29.69 / 0.8423	35.91 / 0.9177
Qwen-Image-VAE-2.0-f32c192	f32c192	72.31	18.33	31.13 / 0.8785	37.52 / 0.9381

关键结论：f16c128 在 ImageNet 达 35.90 PSNR / 0.9519 SSIM，在 FFHQ 达 43.10 PSNR / 0.9795 SSIM；f32c192 虽是 $f 32$ compression，ImageNet 仍有 31.13 PSNR / 0.8785 SSIM，论文称其与部分 $f 8$ VAE 可比。

OmniDoc-TokenBench text-rich reconstruction (Table 3)

Model	Setting	SSIM ↑	PSNR ↑	LPIPS ↓	FID ↓	NED ↑
Qwen-Image-VAE-2.0-f16c64	f16c64	0.9279	26.00	0.0382	1.94	0.9244
Qwen-Image-VAE-2.0-f16c128	f16c128	0.9706	30.45	0.0167	0.79	0.9617
Qwen-Image-VAE-2.0-f32c128	f32c128	0.8442	22.13	0.0642	3.36	0.7065
Qwen-Image-VAE-2.0-f32c192	f32c192	0.8908	23.84	0.0497	1.98	0.8555

关键发现：f16c128 的 NED=0.9617，超过所有评测的 $f 8$ VAEs，包括 FLUX.1-dev 的 0.9546；f32c192 的 NED=0.8555，也超过多个 $f 16$ baselines。论文还指出 pixel metrics 与 text fidelity 不完全相关：Stepvideo-T2V 在 f16 的 NED 0.8838 高于 HunyuanImage-3.0 的 0.7753，尽管 SSIM 差距较小；LTX-Video 在 f32 的 NED 0.5651 高于 HunyuanImage-2.1 的 0.4895，尽管 FID 更差。

[Large figure omitted: fig3_group.svg exceeded Cloudflare Pages single-file limit.]

Figure 3a–3b 解读：该 composite 保留 source 中 f16 与 f32 两个 panel 的横向对比。f16 panel 中，多数 baselines 会出现 character blurring、stroke merging 或局部 unreadable；Qwen-Image-VAE-2.0-f16 系列能保留更清晰字符。f32 panel 更极端，baseline 常把文档结构和文字压坏；Qwen-Image-VAE-2.0-f32c192 在高压缩下仍保持相对可读的页面结构。

Figure 4 解读：SiT on ImageNet 的 samples 用于验证 latent diffusability。VAE 不只是重建器；如果 latent 对下游 DiT 不友好，即便 reconstruction 指标不错，generation training 也可能慢或不稳定。图中 samples 说明 Qwen-Image-VAE-2.0 的 latent 可以支持 downstream generation。

Limitations

论文未单列 limitations。作者明确提到 NED 使用 raw OCR outputs、没有 text normalization，minor spacing artifacts may inflate edit distance；此外 training hardware、optimizer/LR/steps、VAE training code 与模型实现未公开，公开 repo 只支持 benchmark evaluation/reconstruction example。

Paper Notes

探索

Qwen-Image-VAE-2.0 Technical Report

Qwen-Image-VAE-2.0 Technical Report

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall framework：GSC + expanded channels + attention-free backbone

3.2 Data：billion-scale general images + text-rich real/synthetic documents

3.3 Loss：去掉 KL/GAN，保留 reconstruction + LPIPS + semantic alignment

3.4 OmniDoc-TokenBench：OCR-based text reconstruction benchmark

3.5 Pseudocode

4. Experimental Setup (实验设置)

Datasets

Baselines

Evaluation metrics

Training config

5. Experimental Results (实验结果)

Public reconstruction and generation benchmarks (Table 2)

OmniDoc-TokenBench text-rich reconstruction (Table 3)

Limitations

目录