SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Paper: arXiv:2605.12500 Code: OpenSenseNova/SenseNova-U1 Code reference: main @ e2f1b47a (2026-05-15)

1. Motivation (研究动机)

当前 VLM 与图像生成模型通常把“理解”和“生成”拆成两套系统：理解侧依赖 vision encoder，把图像压成语义 token；生成侧依赖 VAE / diffusion latent，把图像当作另一个 latent 空间里的连续信号。这种拆分带来三个具体问题：表示空间不一致，理解结果很难原生反哺生成；级联系统需要额外适配器和调度逻辑；预训练 encoder / VAE 的固定表示会限制像素级细节、文字渲染、空间推理和 interleaved image-text generation 的共同扩展。

本文要解决的是 native unified multimodal modeling：在同一个 backbone 内同时处理文字、干净图像 token、噪声图像 token，并让模型直接在 pixel patch 空间做 flow matching，而不是在“理解 encoder + 生成 VAE”之间转换。目标不是单独提升 T2I 或 VQA，而是证明同一套 NEO-unify / MoT 架构能同时覆盖 image understanding、text understanding、X2I generation、image editing、interleaved generation、VLA 和 world modeling。

这个问题值得研究，因为一旦理解与生成共用原生表示，模型可以把空间、文字、知识、视觉细节和动作/世界预测放到同一套推理链路里：例如先理解多图上下文，再生成下一张图；或者从机器人视频中预测世界状态。论文的核心价值是把“多模态系统集成”推进到“单一模型内的能力涌现”。

2. Idea (核心思想)

核心 insight：不要先把图像压到 pretrained VE/VAE 的语义 latent，再让 LLM 学会对齐；而是用近乎无损的轻量 patch interface 把 pixel patch 和 text token 放进同一个 native Transformer，让 clean image / text token 与 noisy image token 在共享注意力结构里交互。

关键创新可以概括为三点：1) 32×32 patch 的 near-lossless visual interface，encoder 只用两层卷积，decoder 直接预测 pixel patch；2) Native Mixture-of-Transformers (MoT)，understanding stream 和 generation stream 参数解耦但在统一序列和 self-attention 里协作；3) resolution-adaptive noise scale 和 pixel-space flow matching，使不同分辨率下的信噪比更稳定。

与 BAGEL / Ovis-U1 / UniFlow 等 unified 或 tokenizer-based 方法相比，SenseNova-U1 的根本差异是更彻底地绕开 VAE / separate visual tokenizer，把 generation 也还原为同一个 native token 序列上的 pixel-space denoising。它不是把 LLM 外接一个图像生成器，而是把生成路径作为同一 backbone 的另一个 stream。

3. Method (方法)

3.1 Overall framework

Figure 1 解读：这张 overview 展示 SenseNova-U1 的整体定位：模型输入可以是 text、image 或 interleaved context，输出可以是 text、image、editing result、multi-step interleaved content，甚至 VLA/world-modeling 形式的视觉预测。它强调的不是单任务 SOTA，而是同一 native backbone 支撑多类理解与生成任务。

Figure 2 解读：架构由三块组成。左侧 near-lossless visual interface 把图像或噪声通过两层 convolution 编成 visual tokens，每个 token 对应 $32 \times 32$ pixel patch；中间 NEO-unify / MoT backbone 在同一序列里处理 text、clean image 和 noise image token；右侧理解分支用 linear head 预测文本，生成分支用 MLP-like / flow matching head 直接预测 pixel patch，省掉 VAE decoder。

直觉上，这个设计把“图像理解”和“图像生成”变成同一序列建模问题的两个视角。clean image tokens 提供感知和上下文，noise image tokens 负责逐步变成目标像素；二者在 attention 里接触，但通过 token-type routing 进入不同 stream 的 projections / FFN，避免理解能力被噪声生成目标破坏。

3.2 Key components

Near-lossless visual interface. Patch encoding layer 使用两层 convolution + GELU + 2D sinusoidal position encoding，stride 分别为 16 和 2，因此总压缩率为 32，每个 visual token 对应一个 $32 \times 32$ patch。文本仍使用底层 LLM tokenizer。理解头是 vocabulary projection；生成头直接回归 pixel patch，而不是 latent VAE code。

Dynamic noise scale. 由于 generation stream 支持不同分辨率，固定 unit Gaussian 会导致同一 timestep 下高分辨率样本的有效 SNR 不一致。论文定义 token 数 $N (H, W) = (H \cdot W) /3 2^{2}$ ，并用

σ_{R} (H, W) = σ_{0} \frac{N ( H , W )}{N _{0}}

缩放终端噪声；训练表给出的实际设置为 $σ_{0} = 1, N_{0} = 64$ ，当 $N \in [64, 4096]$ 时 $σ \in [1, 8]$ 。同时把 $\overset{σ}{ˉ} = σ_{R} / σ_{m a x}$ 经 NSEmb 加到 timestep embedding：

s_{t} = τ_{t} + NSEmb (\overset{σ}{ˉ} (H, W)) .

Native MoT and attention mask. 文本 token 仍 causal；同一 image block 内的 clean image / noise image tokens 可以双向 attention；noise tokens 能看见之前上下文和 clean inputs，而 clean tokens 不能看见 noise tokens。understanding 与 generation stream 的 projections、normalization、FFN 独立，由 token type 动态路由。

Figure 3 解读：hybrid attention kernel 是 serving 侧的关键：纯文本行保持 causal fast path，image token 所在 block 才扩展 key range 到完整 image span。这样不牺牲 LLM prefill 的常规效率，又允许图像 token 在空间维度内双向通信。

Pixel-space flow matching. 给定 clean image $x$ 和 $ϵ \sim N (0, I)$ ，论文使用 rectified-flow interpolant：

z_{t} = t x + (1 - t) σ_{R} ϵ, t \in [0, 1] .

模型先预测 clean signal $\hat{x}_{θ}$ ，再转换为 velocity：

v_{θ} (z_{t}, t) = \frac{x ^ _{θ} ( z _{t} , t , s _{t} ) - z _{t}}{1 - t},

并最小化

L_{G e n} = E [∥ v_{θ} (z_{t}, t) - v^{⋆} ∥_{2}^{2}], v^{⋆} = \frac{x - z _{t}}{1 - t} .

总目标是

L_{t o t a l} = λ_{1} L_{U n d} + λ_{2} L_{G e n},

其中理解损失为 next-token CE。Stage 3/4 的实际 loss weight 是 CE:MSE = $0.1 : 1$ 。

Classifier-Free Guidance. 训练时以 10% 概率 drop text condition，并额外以 10% 概率同时 drop text 和 image condition。推理时用 text guidance $γ$ 与 image-context guidance $γ_{im g}$ 分解条件影响；经验上 $γ = 4, γ_{im g} = 1$ 。

Figure 4 解读：理解数据的 mid-training pipeline 先做 distribution-balanced sampling，覆盖长尾视觉/语义属性；再做 prompt augmentation，把问答扩展到语义表达、格式约束、角色场景和任务复杂度；最后用多维 quality filtering 检查 correctness、hallucination 和 instruction following。

Figure 5 解读：四个 sunburst 分别展示 understanding、text-to-image、editing、interleaved corpus 的层级构成。训练目标不是单一图像生成分布，而是让模型同时见到自然图像、设计/infographic、人像、合成数据、editing 操作和 interleaved narrative。

Figure 6 解读：generation corpus 统一经过 low-level filtering、deduplication、VLM captioning 和 quality filtering。对 editing 数据，论文还会把指令拆成 dynamic objectives 与 static consistency constraints，以保证“该变的变、不该变的不变”。

Figure 7 解读：serving 虽然对外是统一模型，但实现上把 understanding/text streaming/control flow 交给 LightLLM，把 iterative image denoising 交给 LightX2V。两者通过 pinned shared memory 交换 generation state，使 text-heavy 和 image-heavy traffic 可以分开扩缩容。

3.3 Pseudocode based on released code

Code reference: main @ e2f1b47a (2026-05-15) — pseudocode and mapping based on this commit

Released repo 主要是 inference / evaluation / serving 代码，不包含论文 Stage 1–6 的训练 launch scripts。因此下面 pseudocode 对应 released inference path；训练超参来自论文 LaTeX sec/3_model.tex 的 training table，而不是代码默认值。

import math
import torch
 
def sensenova_patchify(images: torch.Tensor, patch_size: int = 32) -> torch.Tensor:
    # Mirrors NEOChatModel.patchify: [B, 3, H, W] -> [B, L, patch_size**2 * 3]
    b, c, h_img, w_img = images.shape
    h, w = h_img // patch_size, w_img // patch_size
    x = images.reshape(b, 3, h, patch_size, w, patch_size)
    x = torch.einsum("bchpwq->bhwpqc", x)
    return x.reshape(b, h * w, patch_size * patch_size * 3)
 
 
def sensenova_unpatchify(tokens: torch.Tensor, patch_size: int, h_img: int, w_img: int) -> torch.Tensor:
    h, w = h_img // patch_size, w_img // patch_size
    x = tokens.reshape(tokens.shape[0], h, w, patch_size, patch_size, 3)
    x = torch.einsum("bhwpqc->bchpwq", x)
    return x.reshape(tokens.shape[0], 3, h_img, w_img)

def t2i_flow_loop(model, prompt_embeds, image_size, num_steps, cfg_scale, seed):
    # Condensed from NEOChatModel.t2i_generate / _t2i_predict_v.
    token_h = image_size[1] // model.patch_size
    token_w = image_size[0] // model.patch_size
    grid_h, grid_w = image_size[1] // model.patch_size, image_size[0] // model.patch_size
 
    noise_scale = model.noise_scale
    if model.noise_scale_mode in {"resolution", "dynamic", "dynamic_sqrt"}:
        scale = math.sqrt((grid_h * grid_w) / model.noise_scale_base_image_seq_len)
        noise_scale = min(scale * model.noise_scale, model.noise_scale_max_value)
 
    g = torch.Generator(prompt_embeds.device).manual_seed(seed)
    image = noise_scale * torch.randn(1, 3, image_size[1], image_size[0], generator=g)
    z = sensenova_patchify(image, model.patch_size)
    timesteps = torch.linspace(0.0, 1.0, num_steps + 1, device=z.device)
    timesteps = model._apply_time_schedule(timesteps, token_h * token_w, model.timestep_shift)
 
    for t, t_next in zip(timesteps[:-1], timesteps[1:]):
        v_pos = model._t2i_predict_v(prompt_embeds, t=t, z=z, image_size=image_size)
        if cfg_scale > 1:
            v_neg = model._t2i_predict_v(model.empty_prompt_embeds, t=t, z=z, image_size=image_size)
            v = v_neg + cfg_scale * (v_pos - v_neg)
        else:
            v = v_pos
        z = z + (t_next - t) * v
    return sensenova_unpatchify(z, model.patch_size, image_size[1], image_size[0])

def hybrid_attention_range(row_is_image: bool, causal_end: int, image_span_end: int) -> int:
    # Condensed from docs/inference_infra.md and the hybrid attention figure.
    # Pure-text blocks keep causal range; blocks containing image tokens extend to full image span.
    if not row_is_image:
        return causal_end
    return max(causal_end, image_span_end)

论文公式与 released code 实现差异：论文完整描述了 RL post-training、Flow-GRPO reward、DMD2 distillation 和 Stage 1–6 training recipe；released repo at e2f1b47a 暂未提供这些训练脚本，只提供 model / inference / evaluation / serving 实现。代码中的 t2i_generate 还暴露了 dynamic_sqrt noise mode、cfg_zero_star / global / channel CFG normalization 等工程选项，论文正文没有逐一展开。

3.4 Code-to-paper mapping

Code reference: main @ e2f1b47a (2026-05-15) — pseudocode and mapping based on this commit

Paper Concept	Source File	Key Class/Function
NEO-unify chat/generation backbone	`src/sensenova_u1/models/neo_unify/modeling_neo_chat.py`	`NEOChatModel`, `t2i_generate`, `it2i_generate`, `interleave_gen`
Pixel patch interface	`src/sensenova_u1/models/neo_unify/modeling_neo_chat.py`	`patchify`, `unpatchify`
Flow matching head / pixel decoder modules	`src/sensenova_u1/models/neo_unify/modeling_fm_modules.py`	`FlowMatchingHead`, `PatchDecoder_*`, `ConvDecoder`
Vision embeddings / 2D RoPE	`src/sensenova_u1/models/neo_unify/modeling_neo_vit.py`	`NEOVisionEmbeddings`, `apply_2d_rotary_pos_emb`
MoT / MoE config knobs	`src/sensenova_u1/models/neo_unify/configuration_neo_chat.py`	`NEOMoELLMConfig`, `NEOChatConfig`
T2I / editing / interleaved inference	`examples/t2i/inference.py`, `examples/editing/inference.py`, `examples/interleave/inference.py`	`SenseNovaU1T2I.generate`, `SenseNovaU1Editing.edit`, `SenseNovaU1Interleave.generate`
Disaggregated serving design	`docs/inference_infra.md`	LightLLM + LightX2V deployment notes

4. Experimental Setup (实验设置)

Datasets and scale. Understanding pre-training corpus contains image-text pairs 32%, captions 17%, infographic understanding 14%, pure text 37%。Mid-training corpus 来自 SenseNova V6.5：General 39.2%, Agent and Spatial 22.3%, Knowledge Reasoning 19.3%, Pure Text 19.2%。SFT corpus 覆盖 spatial intelligence ~15%, general multimodal understanding ~13%, reasoning ~12%, NLP ~11%, OCR/document ~11%, agentic function calling ~10%, long-context conversation ~8%, code ~6%, multi-turn ~4%, compositional understanding ~4%。Generation corpus 覆盖 T2I、editing、interleaved：T2I 中 Nature ~40.5%, People ~26.7%, Design ~20.7%；editing 中 natural scenes ~52.3%, human subjects ~14.7%；interleaved 中 Lifestyle ~44%, Infographics ~29%, Video ~19%, Reasoning ~8%。

Baselines. Understanding 对比 Qwen3VL / Qwen3.5 / Gemma4 / LongCat-Next 等；image generation 对比 Qwen-Image、Ovis-U1、Z-Image-Turbo、FLUX.1-dev、closed-source Nano-Banana / GPT-Image 等；editing 对比 Qwen-Image-Edit 系列、Ovis-U1；interleaved generation 对比 GPT-4o+DALL-E3、Gemini+Flux、Ovis-U1。

Metrics. VQA/OCR/math benchmarks 使用 benchmark accuracy；DPG-Bench 报 Global / Entity / Attribute / Relation / Other / Overall；GenEval 报 object/counting/color/position/binding overall；GEdit-Bench 与 ImgEdit 报人评/模型评 overall；TIIF、LongText、CVTG、IGenBench 关注 text rendering / chart / infographic correctness；RISEBench、WISE 关注 reasoning image generation/editing。

Training config. 这些数字来自论文源文件 sec/3_model.tex 的 training recipe table，而 released repo 没有训练 launch config。Stage 1 understanding warmup: 120K steps, peak LR $2 \times 1 0^{- 5}$ , seq length 32768, 0.75T training tokens。Stage 2 generation pre-training: Phase I 120K steps, LR $2 \times 1 0^{- 4}$ , gen resolution $25 6^{2} \to 51 2^{2}$ , seq length 8192, 0.25T tokens；Phase II 60K steps, LR $1 \times 1 0^{- 4}$ , gen resolution $51 2^{2} \to 204 8^{2}$ , seq length 16384, 0.25T tokens；Phase III 120K steps, cosine $1 \times 1 0^{- 4} \to 2 \times 1 0^{- 5}$ , seq length 16384, 0.88T tokens。Stage 3 unified mid-training: 84K steps, LR $2 \times 1 0^{- 5}$ , seq length 32768, 1.19T tokens, data ratio understanding:generation:editing:interleave = 0.33:0.37:0.24:0.06。Stage 4 unified SFT: 9K steps, cosine $2 \times 1 0^{- 5} \to 0$ , seq length 32768, 0.13T tokens。Optimizer 为 AdamW $(β_{1} = 0.9, β_{2} = 0.95, ϵ = 1 0^{- 8})$ ，weight decay 0，gradient clip 1.0。

Post-training for T2I includes text-rendering RL: 600 epochs, LR $1 \times 1 0^{- 5}$ , KL $β = 0.01$ , per epoch $N = 48$ prompts × $K = 16$ images, 10-step flow matching, CFG 4.0, noise level 0.7, dynamic-resolution warmup 200 epochs。Unified general RL: 8B variant 1600 epochs, A3B variant 200 epochs, $λ_{s t y} = 0.5$ 。DMD2 distillation reduces NFE from 100 to 8; generator LR $2 \times 1 0^{- 6}$ , fake flow LR $4 \times 1 0^{- 7}$ , AdamW betas $(0.0, 0.999)$ , weight decay 0.01, timestep shift 3.0, CFG 4.0。

5. Experimental Results (实验结果)

Understanding and text. SenseNova-U1-8B-Think 在 MMMU 74.78、MMMU-Pro 67.69、MathVista-mini 84.20、MathVision 75.82；30BA3B-Think 分别为 80.55、72.83、85.30、79.63。空间智能上，8B-Think 在 VSI-Bench 62.66、ViewSpatial 56.19、MindCube-Tiny 62.01、3DSR-Bench 64.88；30BA3B-Think 在 MindCube-Tiny 达到 70.86。纯文本方面，8B-Think 在 IFEval 91.13、Tau2 71.70、Claw eval 58.90；30BA3B-Think 在 MMLU-Pro 84.04、IFEval 92.39、Tau2 75.39。

Generation / editing / interleaved main numbers. DPG-Bench Overall: SenseNova-U1 8BA3B 88.14, 8B 87.78, Qwen-Image 88.32, Ovis-U1 83.72。GenEval Overall: 8BA3B 0.91, 8B 0.91, Qwen-Image 0.87。IGenBench open-source Overall Q-ACC/I-ACC: 8B 0.51/0.04, 8BA3B 0.42/0.02, Qwen-Image 0.36/0.01。BizGenEval hard/easy Average: 8B 39.7/61.7, 8BA3B 28.2/51.9, Qwen-Image-2512 6.3/41.0。TIIF testmini short Overall: 8B 89.74, 8BA3B 89.25, Qwen-Image 86.14；long Overall: 8B 89.17, 8BA3B 87.36, Qwen-Image 86.83。Image editing ImgEdit Overall: Qwen-Image-Edit-2511 4.51, SenseNova-U1 8BA3B 3.91, 8B 3.90。GEdit-Bench Overall: Qwen-Image-Edit-2511 7.88, SenseNova-U1 8B 7.47, 8BA3B 7.32。OpenING Overall: SenseNova-U1-SFT w/ CoT 8BA3B 9.16, 8B 9.07, GPT-4o+DALL-E3 8.20, Gemini+Flux 7.23。RealUnify Avg: 8B 52.4, 8BA3B 50.5, Ovis-U1 35.4。

Ablations and visual analyses. Reconstruction with frozen understanding branch: Neo-unify 2B at downsampling ratio 32 and resolution 512 reaches PSNR 31.56 / SSIM 0.85, matching FLUX.1-dev VAE PSNR 31.56 but with lower SSIM. RL curves show text/rendering/reasoning-oriented post-training generally improves DPG-Bench、WISE、GEdit-Bench、RISEBench scores, while paper notes A3B still has room for further improvement after only 200 epochs。VLA/world-modeling qualitative figures show U1 can infer robot manipulation behavior and predict arm-view future states, but these are preliminary evidence rather than full robotics benchmarks。

Figure 8a–8d 解读：四个训练曲线分别对应 DPG-Bench、WISE、GEdit-Bench、RISEBench。它们展示 post-training/RL 对通用生成、reasoning generation、editing 和 reasoning editing 的收益；曲线不是单一 reward 变好，而是多个任务分布下的综合改善。

Figure 9 解读：reconstruction case 验证 frozen understanding branch 下，2B NEO-unify 仍能重建 out-domain images。它支撑 near-lossless patch interface 的论点：即使压缩率到 32，pixel-space path 仍保留足够视觉细节。

Figure 10 解读：ImgEdit case 展示模型在 frozen understanding branch 下对 editing prompt 的执行能力，重点不是单张图质量，而是验证理解条件能有效进入 generation stream。

Figure 11 解读：co-training ablation 展示 understanding 与 generation 共同训练时的冲突和收益。它对应本文最关键的风险：统一模型可能互相干扰，因此需要 MoT stream decoupling、loss ratio 和 staged recipe 来稳定联合优化。

Figure 12 解读：VLA visualization 说明模型能把机器人视频中的状态变化、动作意图和语言描述关联起来。这不是本文主评测，但支持作者关于 native multimodal model 可自然扩展到 action reasoning 的论点。

Figure 13 解读：world-modeling 结果展示从 robotic arm view 预测后续视觉状态。它说明 pixel-space generation stream 不只适用于静态图像，也能作为世界动态预测的候选接口。

Limitations. 论文明确提到的一个工程限制是 generation branch 最后 FFN 与 MLP head 独立建模 $32 \times 32$ pixel patches，可能导致 grid artifacts；作者冻结 generation branch 最后三层和 MLP head 以缓解，并提出未来用 PixelShuffle + two convolution layers 替换 MLP head。另一个实际限制是 released code 暂未覆盖完整训练/RL/DMD2 pipeline，复现实验训练 recipe 还依赖论文描述而非公开 launch scripts。

Conclusion. 结果证明 SenseNova-U1 能在不依赖 VAE/VE 分裂架构的情况下，把理解、生成、editing、interleaved generation 和初步 VLA/world-modeling 放入同一 native MoT 框架；其优势最明显在 text-rich / infographic / interleaved / reasoning-heavy 生成场景，而传统 image editing overall 仍落后于最新 Qwen-Image-Edit 系列。

Paper Notes

探索

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall framework

3.2 Key components

3.3 Pseudocode based on released code

3.4 Code-to-paper mapping

4. Experimental Setup (实验设置)

5. Experimental Results (实验结果)

目录