SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Paper: arXiv:2605.15178v1 Code: NVlabs/Sana Code reference: main @ 1bfb9352 (2026-04-14)

1. Motivation (研究动机)

SANA-WM 研究的是 camera-controlled world modeling：给定首帧、文本和连续 6-DoF 相机轨迹，生成 1 分钟、720p、遵循相机运动且保持场景身份的视频。现有开放世界模型已经能做分钟级 rollout，但通常依赖更大的模型、更多数据、更长训练和多 GPU 推理；短视频生成器蒸馏又缺少长时场景保持与轨迹遵循监督。

本文的核心问题是：能否在数据、训练、推理成本都可接受的条件下，原生训练一个高保真、相机可控、分钟级 720p world model？SANA-WM 的答案是把长上下文建模、相机条件、数据标注和视觉 refinement 全部按效率约束重新设计。

Figure 1 解读：从一个初始图像和相机 action trajectory 出发，SANA-WM 生成分钟级 720p world rollout；论文强调 64-GPU 训练与单 GPU 推理，使它区别于多 GPU 工业闭源基线。

2. Idea (核心思想)

SANA-WM 的关键不是单纯扩大 DiT，而是把 长时序状态压缩 与 周期性精确回忆 结合：大多数 block 用 frame-wise Gated DeltaNet 递推地携带上下文，少数 block 插入 softmax attention 锚定长程空间一致性；同时用双分支相机控制把 6-DoF 轨迹分别注入 latent frame rate 与 raw frame rate。最终用第二阶段 long-video refiner 修补第一阶段在细节、结构和时间一致性上的退化。

整体 pipeline 可以理解为：先把公开视频重新标注为带 metric-scale pose 的长视频训练集，再用 LTX2 VAE + Hybrid Linear DiT 训练 stage-1 world model；推理时先用便宜的 stage-1 搜索轨迹和初稿，再对有价值结果进行 long-video refinement。

Figure 2 解读：模型侧由 LTX2 tokenizer、Hybrid GDN/softmax backbone、dual-branch camera control 和 refiner 组成；系统侧提供 bidirectional、chunk-causal autoregressive、few-step distilled autoregressive 三种单 GPU 推理形态。

3. Method (方法)

3.1 Progressive Training Strategy

训练按四个阶段推进，避免直接在 1 分钟 720p 序列上端到端硬训：

Efficient VAE Adaptation：把 baseline VAE 换成 LTX2-VAE，重新初始化 patchify 和 final projection，并用约 50K steps 适配 latent 分布；LTX2 表示比 ST-DC-AE 小约 2×，比 Wan2.1-VAE 小约 8×。
Hybrid Architecture Adaptation：先在 5s 短 clip 上把 SANA-Video backbone 适配到 Hybrid GDN-Softmax 架构，低成本暴露稳定性问题。
Minute-Scale Extension + Action Conditioning：扩展到 1 分钟序列，引入 Dual-Branch Camera Control 支持 metric 6-DoF trajectory conditioning。
SFT：用约 50K 高质量 clips 对分钟级生成做最后的 supervised fine-tuning。

import torch
 
 
def progressive_sana_wm_training(model, ltx2_vae, datasets):
    # Stage 0: adapt VAE and model I/O to LTX2 latent space.
    model.replace_vae(ltx2_vae)
    model.reinit_patchify_and_output_projection()
    train(model, datasets.sana_video_sft_5s, steps=50_000, lr=5e-5)
 
    # Stage 1-2: stabilize frame-wise GDN and hybrid GDN/softmax on short clips.
    model.enable_frame_wise_gdn()
    train(model, datasets.sana_video_sft_5s, steps=30_000, lr=5e-5)
    model.interleave_softmax_blocks(indices=[3, 7, 11, 15, 19])
    train(model, datasets.sana_video_sft_5s, steps=30_000, lr=5e-5)
 
    # Stage 3-4: scale to 1-minute videos with camera control and SFT.
    model.enable_dual_branch_camera_control()
    train(model, datasets.sana_wm_pose_60s, steps=31_000, lr=1e-5, cp_size=2)
    train(model, datasets.high_quality_60s_50k, steps=10_000, lr=1e-5, cp_size=2)
    return model

3.2 Hybrid Linear Attention for Long Context

原始 cumulative linear attention 用一个 $D \times D$ 状态累积所有历史 frame 的 key-value 外积，内存常数但缺少衰减和显著性选择，分钟级时 stale feature 会不断堆积。SANA-WM 改成 frame-wise GDN：在每个 latent frame 上聚合 spatial tokens，然后用 decay gate 与 delta-rule 更新状态；再周期性插入 softmax block 做精确长程回忆。

直觉上，GDN 负责“便宜地向前滚动世界状态”，softmax block 负责“偶尔回看关键 token 来校准空间一致性”。论文实现细节给出：20 个 transformer blocks，head dim $D = 112$ ，其中 15 个 frame-wise GDN blocks，softmax blocks 位于 ${3, 7, 11, 15, 19}$ 。

import torch
import torch.nn.functional as F
 
 
def frame_wise_gdn_step(state, q_t, k_t, v_t, decay_t, beta_t):
    """Paper-level reconstruction of the frame-wise GDN update.
    q_t/k_t/v_t: [spatial_tokens, dim]; state: [dim, dim].
    """
    # Aggregate spatial tokens before the recurrent state update.
    k_bar = F.normalize(k_t, dim=-1).mean(dim=0)
    v_bar = v_t.mean(dim=0)
    decay = torch.sigmoid(decay_t).mean()
    beta = torch.sigmoid(beta_t).mean()
 
    prediction = state @ k_bar
    delta = (v_bar - prediction).unsqueeze(-1) @ k_bar.unsqueeze(0)
    state = decay * state + beta * delta
    out = q_t @ state.T
    return out, state
 
 
def hybrid_long_context_block(tokens, blocks):
    state = torch.zeros(tokens.dim, tokens.dim, device=tokens.device)
    for block_id, block in enumerate(blocks):
        if block_id in {3, 7, 11, 15, 19}:
            tokens = block.softmax_attention(tokens)       # exact long-range recall
        else:
            tokens, state = block.frame_wise_gdn(tokens, state)  # efficient recurrent scan
    return tokens

3.3 Dual-Branch Camera Control

SANA-WM 使用双速率几何条件：

Coarse branch: Ray-local UCPE：对 latent frame/token 计算世界空间 ray，构造 ray-local basis，并把 camera-branch attention head 的几何通道做 ray-local transform；它捕捉全局 6-DoF pose。
Fine branch: raw-frame Plücker mixing：由于一个 latent token 汇总 8 个 raw frames，latent-rate 条件会丢失 stride 内的细相机运动；因此在 raw frame/pixel 上计算 Plücker ray embedding，再混合回 latent 表示，补偿高频运动。

这个设计的关键直觉是：粗分支保证轨迹大方向和全局几何正确，细分支把 VAE temporal stride 内的相机变化补回来，避免“文本上对了、相机运动不贴轨迹”。

import torch
 
 
def dual_branch_camera_condition(latent_tokens, raw_frames, camera_poses, intrinsics):
    # Coarse latent-rate UCPE.
    latent_rays = unproject_latent_cells(camera_poses.latent_rate(), intrinsics.latent_rate())
    ray_local_basis = build_ray_local_basis(latent_rays)  # x/y/z basis per token
    coarse_tokens = apply_ucpe_to_attention_heads(latent_tokens, ray_local_basis)
 
    # Fine raw-frame Plücker mixing.
    raw_rays = unproject_pixels(camera_poses.raw_rate(), intrinsics.raw_rate())
    plucker = torch.cat([raw_rays.direction, torch.cross(raw_rays.origin, raw_rays.direction)], dim=-1)
    fine_features = mix_raw_frame_plucker(plucker, raw_frames)
 
    return coarse_tokens + project_to_latent_rate(fine_features)

3.4 Two-Stage Long-Video Refiner

Stage-1 模型负责快速生成长视频 latent；refiner 用 paired latents $(x_{ℓ}, x_{h})$ 学习从 stage-1/degraded latent 修到高保真 target。论文使用 truncated-flow matching：从较大起始噪声 $σ_{start} = 0.9$ 开始，鼓励模型做 refinement 而不是完整重建；refiner 条件包括文本、相机和参考图像，参考图像被拼接进序列但不计入 loss 以保留 stage-1 外观。

def long_video_refiner(refiner, stage1_latents, target_latents, text, camera, reference):
    noisy_source = add_flow_noise(stage1_latents, sigma_start=0.9)
    sequence = concat_reference(noisy_source, reference)
    pred = refiner(sequence, text=text, camera=camera)
    loss = flow_matching_loss(pred.without_reference(), target_latents)
    return loss

Figure 3 解读：refiner 对 10s 到 50s 的长程 rollout 局部区域进行修补，改善物体结构、清晰度和时间一致性；它不是改变大轨迹，而是提升 stage-1 已生成结果的视觉质量。

3.5 Data Construction Pipeline

论文构建了约 213K clips 的训练语料：从公开视频和静态 3D source 出发，重新估计 metric-scale camera pose，并对 DL3DV 做 3DGS rendered trajectory augmentation。VIPE 的深度后端被替换/增强为 Pi3X 与 MoGe-2，分别提供长序列一致深度和逐帧 metric scale；caption 则刻意不写“pan left / move forward”等相机动作，防止文本泄漏轨迹监督。

Figure 4 解读：数据 pipeline 的价值在于把公开视频变成带 metric-scale 6-DoF pose 的 action-conditioned corpus；这使 SANA-WM 不依赖专有仿真或闭源 action label。

3.6 Code-to-Paper Mapping

Code reference: main @ 1bfb9352 (2026-04-14)

项目页的 GitHub 链接指向 NVlabs/Sana。我在 main@1bfb9352 检索了 SANA-WM、GDN、DeltaNet、Plucker/Plücker、camera trajectory 等字符串；当前公开仓库主要覆盖 SANA/SANA-Video/LongSana 基座和 LTX2-720p 配置，未暴露论文中 SANA-WM 专用的 frame-wise GDN、dual-branch camera-control、benchmark/refiner 训练代码。因此下表只把已公开代码映射到可验证的相邻组件，GDN/UCPE/Plücker 伪代码按论文公式重建。

Paper Concept	Source File	Key Class/Function
SANA-Video/DiT video backbone	`diffusion/model/nets/sana_multi_scale_video.py`	`SanaMSVideo`, `SanaVideoMSBlock`, `forward_frame_aware`
Linear attention building block	`diffusion/model/nets/fastlinear/modules/lite_mla.py`	`LiteMLA.attn_matmul`, `LiteMLA.forward`
720p LTX2 VAE config	`configs/sana_video_config/Sana_2000M_720px_ltx2vae_AdamW_fsdp.yaml`	`SanaMSVideo_2000M_P1_D20`, `linear_head_dim: 112`, `vae_type: LTX2VAE_diffusers`
LongSana sequence pipeline	`diffusion/longsana/pipeline/sana_training_pipeline.py`	`SanaTrainingPipeline.setup_sequence`
LongSana trainer loop	`diffusion/longsana/trainer/longsana_trainer.py`	`LongSANATrainer.fwdbwd_one_step_streaming`
Flow-prediction training loss	`diffusion/longsana/utils/loss.py`	`FlowPredLoss`
SANA-WM-specific GDN + dual camera control	not found in public `NVlabs/Sana@main@1bfb9352`	paper-only reconstruction in this note

4. Experimental Setup (实验设置)

4.1 Training Data

Source	Type	Duration	Clips	Pose Source
SpatialVID-HQ	Real	10s	158,369	VIPE + Pi3X/MoGe-2
DL3DV	Real	10s	5,691	GT pose + Pi3X
DL3DV GS Refined	Synthetic	60s	14,881	GT pose + Pi3X
OmniWorld	Synthetic	60s	1,720	VIPE + GT depth
Sekai Game	Synthetic	60s	3,560	GT pose + Pi3X
Sekai Walking-HQ	Real	60s	9,767	VIPE + Pi3X/MoGe-2
MiraData	Real	60s	18,987	VIPE + Pi3X/MoGe-2
Total			212,975

4.2 Training Config

Item	Stage 1	Stage 2	Stage 3	Stage 4
Purpose	Frame-wise GDN	Hybrid Attention	Minute-Scale Video + CamCtrl	SFT
Data	SANA-Video SFT	SANA-Video SFT	SANA-WM data	~50K high-quality clips
Clip duration	5s	5s	1 min	1 min
Batch/GPU	1	1	0.5	0.5
CP size	—	—	2	2
Effective global batch	64	64	32	32
Learning rate	5e-5	5e-5	1e-5	1e-5
Training steps	30K	30K	31K	10K
Compute budget	~2.75 days	~2 days	~8 days	~2.5 days

补充：LTX2 VAE 适配约 50K steps，约 3.5 days on 64 H100；主 DiT/backbone 合计约 15 days on 64 H100。所有 stage 使用 AdamW、BF16 mixed precision、gradient clipping 0.5；Stages 3–4 预计算 VAE latents 以去掉在线编码成本。

4.3 Evaluation Benchmark and Metrics

Benchmark 从 80 个 1280×720 first-frame conditioning images 开始，覆盖 game-style、indoor、outdoor-city、outdoor-nature 四类场景，每类 20 个；每个场景配两个 revisit trajectories，构成 Simple 与 Hard split。

评价包括：Pose Acc.（rotation error R、translation error T、camera-motion consistency CMC，越低越好）、VBench 8 个维度（SC/BC/TF/MS/AQ/IQ/DD/OC 及 Overall，越高越好）、效率（peak memory GB、8 H100 videos/hour）、revisit memory（same-pose PSNR/SSIM/LPIPS）和 temporal IQ drop。

Figure 5 解读：benchmark 首帧覆盖几何、光照、视觉风格多样性，用来测试模型是否能在不同场景里保持身份和控制轨迹。

Figure 6 解读：轨迹模板包含 revisit、loop closure、pitch-heavy、vertical motion；这类轨迹专门暴露世界模型在长时空间记忆与相机控制上的漂移。

Figure 7 解读：Hard split 不只在 BEV 平面转弯，还包含高度变化和视角 pitch；下方 height profile 说明单看俯视轨迹会低估难度。

5. Experimental Results (实验结果)

5.1 Main Quantitative Results

Split	Method	Param	Res	G	R↓	T↓	CMC↓	VBench Overall↑	Mem↓	Tput↑
Simple	Infinite-World	1.3B	480p	1	16.55	1.98	2.08	79.18	53.5	5.9
Simple	LingBot-World	14B+14B	480p	8	10.47	2.01	2.05	81.82	454.1	0.6
Simple	HY-WorldPlay	8B	480p	8	17.89	2.36	2.45	68.82	215.5	1.1
Simple	Matrix-Game 3.0	5B	720p	8	12.96	1.83	1.92	78.53	106.2	3.1
Simple	SANA-WM	2.6B	720p	1	7.59	1.59	1.63	79.29	51.1	24.1
Simple	SANA-WM + refiner	2.6B+17B	720p	1	4.50	1.39	1.41	80.62	74.7	22.0
Hard	LingBot-World	14B+14B	480p	8	18.99	1.65	1.81	81.89	454.1	0.6
Hard	Matrix-Game 3.0	5B	720p	8	18.79	1.67	1.82	78.79	106.2	3.1
Hard	SANA-WM	2.6B	720p	1	10.02	1.66	1.72	79.60	51.1	24.1
Hard	SANA-WM + refiner	2.6B+17B	720p	1	8.34	1.39	1.44	81.89	74.7	22.0

核心结论：SANA-WM 用 2.6B 参数和单 GPU generation，在 Simple/Hard 两个 split 上显著降低 pose error；refiner 版本把 Hard split Overall 提到 81.89，与 LingBot-World 相当，但吞吐从 0.6 videos/hour 提升到 22.0 videos/hour，约 36×。

Figure 8 解读：Hard trajectory qualitative comparison 中绿色边框为 SANA-WM，action overlay 显示它在复杂轨迹上保持较强控制与场景连续性。

Figure 9 解读：附录 qualitative comparison 扩展到更多 Hard videos，主要观察点是长时视角变化下场景是否崩坏、是否跟随 overlay action。

5.2 Ablation and Efficiency

Model	Attention	Tokenizer	Quality↑	I2V↑	Total↑	Mem GiB↓	Lat ms↓	Tput steps/s↑
Sana-Video	cumulative linear	Wan 2.1 / 480p	0.7683	0.9073	0.8378	8.90	1266.6	0.79
+ LTX2 VAE	cumulative linear	LTX2 / 720p	0.7697	0.9082	0.8390	5.40	371.7	2.69
+ Hybrid attn.	GDN + softmax	LTX2 / 720p	0.7834	0.9226	0.8530	5.68	433.2	2.31

Ablation 说明 LTX2 VAE 主要带来内存/延迟收益，Hybrid GDN+softmax 在略增计算的情况下明显提升 Quality/I2V/Total，是“效率优先但不牺牲质量”的关键组件。

Figure 10 解读：VAE/DiT stage 的延迟与显存随时长扩展；recurrent variants 在 60s 下仍能保持紧凑，而 all-softmax 在 60s 出现 OOM，直接支撑 Hybrid Linear Attention 的必要性。

Figure 11 解读：训练稳定性与 scale/condition 的 ablation 用来说明直接扩展长时序不稳定，必须逐步引入高压缩 tokenizer、GDN/softmax 和 camera condition。

5.3 Refiner and 3D-Aware Qualitative Results

Refiner ablation 显示，Simple split 上 long-video refiner 的 Overall 为 80.62，高于 Original LTX-2.3 refiner 的 71.37；R/T/CMC 从 8.65/2.32/2.35 降到 4.50/1.39/1.41，IQ $_{50 - 60}$ 从 35.70 提升到 72.21， $Δ$ IQ 从 3.73 降到 1.17。说明 refiner 不只是提升单帧清晰度，也改善最后 10 秒的时间稳定性。

Figure 12 解读：用 Pi3X 对生成视频做 3D reconstruction，间接检验 rollout 是否保留可重建几何；如果视频只是纹理连续但几何漂移，重建会明显不稳定。

5.4 Limitations

作者明确指出 SANA-WM 仍然 scale-limited，没有显式 3D scene memory；在 dynamic scenes、rare viewpoints 或更长 rollout 下仍可能漂移。实际应用时需要记录数据来源、模型 scope 与 evaluation setting；后续方向包括扩大模型/数据、引入 robot action 或 point-tracking controls、增强持久场景记忆，以及开发更鲁棒的 real-time/streaming refiner。

Paper Notes

探索

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Progressive Training Strategy

3.2 Hybrid Linear Attention for Long Context

3.3 Dual-Branch Camera Control

3.4 Two-Stage Long-Video Refiner

3.5 Data Construction Pipeline

3.6 Code-to-Paper Mapping

4. Experimental Setup (实验设置)

4.1 Training Data

4.2 Training Config

4.3 Evaluation Benchmark and Metrics

5. Experimental Results (实验结果)

5.1 Main Quantitative Results

5.2 Ablation and Efficiency

5.3 Refiner and 3D-Aware Qualitative Results

5.4 Limitations

目录