World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Paper: arXiv:2604.24764
Code: microsoft/World-R1
Code reference: main @ cf54603d (2026-05-01)

1. Motivation (研究动机)

现有 video foundation model 已经能生成高保真短视频,但它们多数仍是 image-space generation:模型主要学到的是像素/纹理相关性,而不是真实世界中稳定的 3D geometry。因此当 prompt 要求大幅相机运动、环绕物体、穿过走廊、长距离 driving scene 时,常见问题是物体形变、消失、墙面扭曲、背景漂移、点云重建失败。这类错误说明模型没有把同一个场景当作一个可被多视角观察的 3D world 来模拟。

已有 3D-aware / camera-control 方法通常把 3D prior 作为架构模块或 inference-time 约束注入,例如额外 camera encoder、control module、3D-conditioned image-to-video pipeline。这些方法可以改善相机控制,但代价是推理成本高、改动架构、适配范围受限,并且容易牺牲原始视频模型的 visual quality 和 motion diversity。World-R1 要解决的具体问题是:不改 base T2V model 架构、不依赖大规模 3D supervised video 数据,也能让生成视频满足 3D consistency 与相机轨迹约束

这个问题值得研究,因为它把 text-to-video post-training 从“视觉偏好对齐”推进到“物理/几何对齐”:如果 video generator 能被 RL 奖励引导出 latent 3D awareness,就可以作为 autonomous driving simulation、robotics、immersive world generation 的基础,而不是只能产出表面上连贯的 2D clip。

Figure 1 解读:这张 teaser 展示 World-R1 的目标形态:输入包含 camera push-in / move-right / turn-right 等文本指令的 prompt,模型生成视频后可以被重建为更稳定的 3D world visualization。重点不是单帧更漂亮,而是同一场景在相机运动下保持 object permanence 和几何结构。

2. Idea (核心思想)

World-R1 的核心 insight 是:不要在生成模型里硬塞一个 3D 模块,而是把“生成视频是否像一个可重建的 3D 世界”变成可优化的 RL reward,让已有 T2V 模型自己 internalize 3D constraints。它利用 pre-trained 3D foundation model 和 VLM 作为 analysis-by-synthesis critic:先从生成视频重建 3DGS,再用重渲染质量、meta-view 质量、相机轨迹一致性给奖励。

关键创新可以概括为三点:第一,用 camera-aware latent initialization 把 prompt 中的相机运动转成 trajectory-guided noise wrapping,从初始 latent 里隐式注入相机先验;第二,用 组成复合奖励,在 Flow-GRPO-Fast 中在线采样视频并优化;第三,用 periodic decoupled training 在几何对齐阶段之间插入 dynamic-only 阶段,避免强 3D reward 把视频推成“静态、易重建但没有动态”的 reward-hacking 解。

它和 ReCamMaster / CameraCtrl / GCD 等显式 camera-control 方法的根本差异在于:那些方法主要在架构或推理路径中加入外部相机控制模块;World-R1 不改变 Wan 2.1 / CogVideoX 这类 base model 的推理架构,而是在 post-training 阶段通过 RL reward 改变模型参数,使几何一致性变成模型自身的生成倾向。

3. Method (方法)

3.1 Overall framework

Figure 2 解读:左侧是 camera conditioning:从文本 prompt 中检测 camera motion token,生成外参轨迹 ,再把轨迹投影成 optical flow 并 warp 初始 latent noise。中间是 base video foundation model,论文主要使用 Wan 2.1 1.3B/14B。右侧是 reward stack:生成视频被 Depth Anything 3 lift 到 3DGS,再计算 meta-view、reconstruction、trajectory 三个 3D-aware reward,并与 general aesthetic reward 合成总奖励;训练算法是 Flow-GRPO-Fast。

整体训练流程是一个 online RL loop:给定 prompt,先根据 prompt 生成 camera trajectory 和 camera-aware latents;模型 rollout 出一组候选视频;reward server 对视频做 3D reconstruction / VLM judging / HPS scoring;Flow-GRPO 用同 prompt 下 group samples 的相对 reward 估计 advantage,并对 flow-matching denoising policy 做 clipped policy update。

直觉:World-R1 的设计把“3D 一致性”从一个难以直接监督的隐变量,转换成“生成视频能否支撑稳定的 3D reconstruction”这个可验证反馈。单帧看起来合理的 hallucination,在 canonical view 里可能不暴露,但一旦从 meta-view 观察重建出的点云/3DGS,就会出现 floaters、billboard、断裂结构;trajectory term 又防止模型为了易重建而偷懒生成静态视频;general reward 则防止几何约束把画质拖垮。

3.2 Camera Conditioning:从文本相机指令到 latent noise wrapping

论文避免训练额外 camera encoder,而是使用 Go-with-the-Flow 风格的 discrete noise transport。prompt 中的相机词先被映射为外参序列:

如果 prompt 包含多个 camera movement,轨迹会按顺序 concatenate。然后把相邻 pose 的相对运动投影到 2D flow:

连续 flow 会在离散 latent grid 上产生重叠和空洞,因此实现里用 density tracker 做 variance-preserving transport:

代码中 TrajectoryGenerator 生成 push_inpull_outmove_left/rightpan_left/rightorbit_left/right 等轨迹;prepare_latents_with_camera 为每个 batch item 生成对应 warped latent,并通过 wrap_strength 与 base Gaussian latent 混合。发布配置中 Small / Large 的 wrap_strength 分别是 0.35 / 0.4,默认视频分辨率是 ,帧数是 81。

3.3 Reward Design:3D-aware reward + general generation reward

总奖励是:

其中论文和代码默认直接相加,appendix 说明 。3D-aware reward 被分成三项:

  • :从重建的 3DGS 渲染 novel meta-view,用 Qwen3-VL 判断结构是否稳定、是否存在 floaters / distortion / texture stretching。原始 0—9 分缩放到
  • :把 3DGS 从估计相机轨迹重渲染回视频视角,论文定义为 ;发布 server 默认 REWARD_3D_USE_LPIPS=1,即用 LPIPS 计算 reconstruction score。
  • :比较 prompt-derived target trajectory 与 Depth Anything 3 估计出的 trajectory ,结合 translation path、extent、rotation geodesic error 得到 score。

General generation reward 在论文公式中定义为前 帧的 HPSv3 风格 aesthetic preference score 平均:

发布代码和论文公式有两个实现差异:flow_grpo/rewards.py 不是对前 帧逐帧平均,而是从每个 rollout video 随机抽一帧送到 general reward server;reward_server/general_reward.py 实际调用 hpsv2.score(..., hps_version="v2.1"),而论文正文写 HPSv3。阅读时应区分 paper objective 与 released implementation。

Figure 3 解读:这张 appendix figure 解释为什么 meta-view 有用。canonical generated frames 可能看起来还行,但从 3DGS 的偏移视角观察时,低质量视频会暴露点云破碎、漂浮物、结构塌陷;World-R1 用这个视角给 ,专门惩罚“2D 看起来合理但 3D 不成立”的伪一致性。

3.4 Periodic Decoupled Training:防止几何奖励压制动态

严格 3D reward 容易诱导模型生成静态、刚性、易重建的视频。World-R1 构造约 500 条 high-entropy dynamic prompts(fire、flowing water、crowds、fluid 等),并采用周期训练:主阶段使用完整 ;每 100 个 training steps 进入 dynamic fine-tuning phase,临时关闭 ,只在 dynamic subset 上用 优化。发布配置中 dynamic_training.main_steps=100dynamic_training.dynamic_steps=50flow_grpo/rewards.py 也会在 metadata 标记 is_dynamic=True 时跳过 reward_3d

Figure 4 解读:该图展示 periodic decoupled training 后,World-R1 仍能生成包含非刚性动态的场景,而不是只会产出静态、可重建的 rigid scene。这对应论文的核心 trade-off:几何一致性必须提升,但不能以牺牲 fluid / biological / crowd motion 为代价。

3.5 Flow-GRPO-Fast:把 flow matching sampler 当作 policy

Flow-GRPO 把 denoising trajectory 看成 MDP。为让 flow model 在 rollout 中具有探索性,先把 deterministic ODE 转成 reverse-time SDE:

离散更新为:

同一个 condition 下采样 条轨迹,用 group reward 标准化 advantage:

优化目标包含 PPO-style clipped surrogate 与 reference policy KL:

发布配置里 sample.num_steps=50sample.num_image_per_prompt=2sample.num_batches_per_epoch=24,代码里的 group size 来自 sampler/reward grouping 与论文设置 ;训练使用 LoRA、bf16、EMA、learning rate train.beta=0.004clip_range=1e-3

3.6 Pseudocode based on released code

Camera-aware latent preparation(对应 camera_trajectory_utils.pyscripts/train_world_r1.py):

import torch
import torch.nn.functional as F
 
 
def prepare_camera_aware_latents(pipeline, prompts, batch_size, cfg, device):
    num_channels = pipeline.transformer.config.in_channels
    vae_t = pipeline.vae_scale_factor_temporal
    trajectories, detected, expanded_prompts, profiles = get_camera_trajectories_for_batch(
        prompts,
        batch_size=batch_size,
        frames_per_trajectory=81,
        force_camera_movement=cfg.sample.force_camera_movement,
    )
    base_latents = torch.randn(
        batch_size,
        num_channels,
        (cfg.frames - 1) // vae_t + 1,
        cfg.height // 8,
        cfg.width // 8,
        device=device,
        dtype=torch.float32,
    )
    output, callbacks = [], []
    for i, trajectory in enumerate(trajectories):
        if trajectory is None:
            output.append(base_latents[i : i + 1])
            callbacks.append(None)
            continue
        wrapped = generate_camera_warped_latents(
            trajectory=trajectory,
            batch_size=1,
            num_channels_latents=num_channels,
            height=cfg.height,
            width=cfg.width,
            num_frames=cfg.frames,
            temporal_compression=vae_t,
            noise_degradation=cfg.sample.noise_degradation,
            flow_scale=cfg.sample.noise_wrap_flow_scale,
            device=device,
        )
        if cfg.sample.wrap_injection_mode == "stepwise_delta":
            delta_low = lowpass_latent_delta(
                wrapped.float() - base_latents[i : i + 1].float(),
                cfg.sample.delta_lowpass_kernel,
            )
            callbacks.append(
                build_stepwise_delta_callback(
                    delta_low=delta_low,
                    wrap_strength=float(cfg.sample.wrap_strength),
                    guidance_steps=cfg.sample.stepwise_guidance_steps,
                )
            )
            output.append(base_latents[i : i + 1])
        else:
            output.append(
                apply_wrap_strength_to_latents(
                    base_latents=base_latents[i : i + 1],
                    wrapped_latents=wrapped,
                    wrap_strength=float(cfg.sample.wrap_strength),
                    injection_mode=cfg.sample.wrap_injection_mode,
                    delta_lowpass_kernel=cfg.sample.delta_lowpass_kernel,
                )
            )
            callbacks.append(None)
    return torch.cat(output, dim=0), callbacks, trajectories, detected

3D-aware reward server(对应 flow_grpo/rewards.pyreward_server/reward_3d.pyreward_3d_backend.py):

import torch
import random
 
 
@torch.no_grad()
def compute_world_r1_reward(video_frames, prompt, target_trajectory, backend, meta_scorer, general_worker):
    # Reward3DBackend.process_video_frames calls backend.model.inference(...)
    # and internally runs _generate_gs_video, _generate_meta_view, and
    # _compute_camera_motion_score.
    gs_video, meta_view, s_traj, traj_vis = backend.process_video_frames(
        frames=video_frames,
        camera_trajectory=target_trajectory,
    )
    s_recon = (1.0 - lpips(video_to_tensor(video_frames), gs_video).mean()).clamp(0.0, 1.0)
    s_meta = torch.as_tensor(meta_scorer([prompt], [meta_view.unsqueeze(0)])[0]).clamp(0.0, 1.0)
    s_traj = torch.as_tensor(s_traj).clamp(0.0, 1.0)
    r_3d = s_recon + s_meta + s_traj
 
    # flow_grpo/rewards.py randomly selects one video frame before calling
    # reward_server/general_reward.py, which scores it with hpsv2 v2.1.
    sampled_frame = random.choice(video_frames)
    r_gen = torch.as_tensor(general_worker.compute_score([sampled_frame], [prompt])[0])
    return {
        "reward_3d": r_3d,
        "reward_general": r_gen,
        "reward_total": r_3d + r_gen,
        "score_reconstruction": s_recon,
        "score_meta_view": s_meta,
        "score_trajectory_alignment": s_traj,
    }

Periodic decoupled training / reward routing(对应 TextPromptDatasetmulti_scoremain training loop):

def choose_training_batch(global_step, main_loader, dynamic_loader, cfg):
    cycle = cfg.dynamic_training.main_steps + cfg.dynamic_training.dynamic_steps
    use_dynamic = cfg.dynamic_training.enabled and (global_step % cycle >= cfg.dynamic_training.main_steps)
    if use_dynamic:
        prompts, metadata = next(dynamic_loader)
        for m in metadata:
            m["is_dynamic"] = True
    else:
        prompts, metadata = next(main_loader)
        for m in metadata:
            m["is_dynamic"] = False
    return prompts, metadata
 
 
def combine_rewards(videos, prompts, metadata, reward_fns, weights):
    total = torch.zeros(len(prompts), dtype=torch.float32)
    details = {}
    skip_3d = any(m.get("is_dynamic", False) for m in metadata)
    for name, fn in reward_fns.items():
        if skip_3d and name == "reward_3d":
            continue
        scores, score_details = fn(videos, prompts, metadata)
        total = total + weights[name] * torch.as_tensor(scores)
        details[name] = scores
        details.update(score_details)
    details["reward_total"] = total.tolist()
    return details

Flow-GRPO update(对应 wan_pipeline_with_logprob.pyscripts/train_world_r1.py):

def flow_grpo_update(transformer, samples, optimizer, cfg):
    rewards = samples["reward_total"] - cfg.sample.kl_reward * samples["kl"]
    advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
    advantages = advantages.clamp(-cfg.train.adv_clip_max, cfg.train.adv_clip_max)
 
    for t in range(cfg.sample.num_steps):
        prev_mean, log_prob = compute_log_prob(transformer, samples, t)
        ratio = torch.exp(log_prob - samples["log_probs"][:, t])
        unclipped = -advantages[:, t] * ratio
        clipped = -advantages[:, t] * ratio.clamp(1.0 - cfg.train.clip_range, 1.0 + cfg.train.clip_range)
        policy_loss = torch.maximum(unclipped, clipped).mean()
 
        if cfg.train.beta > 0:
            with transformer.disable_adapter():
                ref_mean, _ = compute_log_prob(transformer, samples, t)
            kl_loss = ((prev_mean - ref_mean) ** 2).mean()
            loss = policy_loss + cfg.train.beta * kl_loss
        else:
            loss = policy_loss
 
        loss.backward()
        torch.nn.utils.clip_grad_norm_(transformer.parameters(), cfg.train.max_grad_norm)
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Code reference: main @ cf54603d (2026-05-01) — pseudocode and mapping based on this commit

Paper ConceptSource FileKey Class/Function
Flow-GRPO / SDE sampler with log-probflow_grpo/diffusers_patch/wan_pipeline_with_logprob.pysde_step_with_logprob, wan_pipeline_with_logprob
Camera motion detection and trajectory generationflow_grpo/diffusers_patch/camera_trajectory_utils.pyTrajectoryGenerator, detect_camera_movements, get_camera_trajectories_for_batch
Camera-to-flow projection and discrete noise transportflow_grpo/diffusers_patch/camera_trajectory_utils.pycamera_motion_to_flow, NoiseWarper, generate_camera_warped_latents, prepare_latents_with_camera
Rollout latent injection into trainingscripts/train_world_r1.pyprepare_rollout_latents_and_callback, rollout_with_logprob
Prompt dataset and dynamic subsetscripts/train_world_r1.pyTextPromptDataset, choose main/dynamic dataloader by global_step
Composite reward clientflow_grpo/rewards.pyremote_reward_3d, remote_reward_general, multi_score
3D reconstruction reward backendreward_server/reward_3d_backend.pyReward3DBackend.process_video_frames, _generate_gs_video, _generate_meta_view, _compute_camera_motion_score
Multi-GPU 3D reward servicereward_server/reward_3d.py, scripts/serve_reward_3d.pyMultiGPUReward3DManager, reward_3d_worker_process, create_app
General aesthetic reward servicereward_server/general_reward.py, scripts/serve_general_reward.pyMultiGPUGeneralRewardManager, GeneralRewardInstance.compute_score
Experiment configsconfig/world_r1.py, config/base.pyworld_r1_small, world_r1_large, dynamic and RL hyperparameters

4. Experimental Setup (实验设置)

Datasets and scale:论文使用 Gemini 合成的 Pure Text Dataset,约 3,000 条 prompt,覆盖 Natural Landscapes、Urban & Architecture、Micro World、Fantasy / surreal scenes、dynamic scenes 等;其中 dynamic subset 约 500 条,用于 periodic decoupled training。发布代码快照中 dataset/final/ 有 2,468 条 train、42 条 test、500 个非空 dynamic prompt(文件无 trailing newline,因此 wc -l 显示 499);dataset/enhanced/ 有 2,651 条 train、300 条 test、515 条 dynamic。用户研究使用 30 个 complex prompts、25 名参与者;metric-validation study 使用 20 名参与者和 30 个 randomized video pairs。

Baselines:主实验比较 CogVideoX-1.5-5B、Wan2.1-T2V-1.3B、Wan2.1-T2V-14B、Wan2.2-T2V-5B、Wan2.2-T2V-14B。camera-control / 3D-aware 对比包括 GCD、Trajectory-Attention、DAS、ReCamMaster、TrajectoryCrafter、CamCloneMaster、ViewCrafter、Voyager、FlashWorld、VerseCrafter;正文还讨论 CameraCtrl 类显式 camera-control 方法。

Evaluation metrics:3D consistency 用 3DGS reconstruction 后 re-render 与原视频比较:PSNR 越高越好、SSIM 越高越好、LPIPS 越低越好。附录还报告 MVCS,用于不依赖重建 pipeline 的 multi-view consistency。General video quality 用 VBench 子项,包括 Aesthetic Quality、Imaging Quality、Motion Smoothness、Subject Consistency、Background Consistency。Camera-control accuracy 用 RotErr、TransErr、CamMC,均越低越好。用户研究报告 World-R1 相对 Wan 2.1 的 win rate。

Training config:World-R1-Small 从 Wan2.1-T2V-1.3B 初始化,用 48 张 NVIDIA H200;World-R1-Large 从 Wan2.1-T2V-14B 初始化,用 96 张 NVIDIA H200。训练分辨率 ,81 frames,Flow-GRPO-Fast,48 parallel groups,group size 。发布配置使用 50 denoising steps、guidance scale 5.0、bf16、LoRA、EMA、learning rate train.beta=0.004clip_range=1e-3、每 100 main steps 后 50 dynamic-only steps。

5. Experimental Results (实验结果)

5.1 Main quantitative results

3D consistency(Table 2):World-R1 在 reconstruction-based 3D consistency 上显著超过 base video models。

MethodPSNR ↑SSIM ↑LPIPS ↓
CogVideoX-1.5-5B24.440.7830.242
Wan2.2-T2V-14B23.470.7790.253
Wan2.2-T2V-5B22.360.7160.303
Wan2.1-T2V-14B19.760.6290.405
Wan2.1-T2V-1.3B17.400.5500.467
World-R1-Small27.630.8580.201
World-R1-Large27.670.8650.162

相对 Wan2.1-T2V-1.3B,World-R1-Small 的 PSNR 提升 10.23 dB;相对 Wan2.1-T2V-14B,World-R1-Large 的 PSNR 提升 7.91 dB。

VBench general quality(Table 1):World-R1-Small 不仅没有牺牲 general video quality,还超过 Wan2.1-T2V-1.3B backbone。

MethodAesthetic ↑Imaging ↑Motion Smooth. ↑Subject Cons. ↑Background Cons. ↑
CogVideoX-1.5-5B62.0765.3498.1596.5696.81
Wan2.1-T2V-1.3B62.4366.5197.4496.3497.29
GCD38.2141.5698.3788.9492.00
Trajectory-Attention38.5051.0098.2190.6092.83
DAS39.8651.5599.1490.3492.03
ReCamMaster42.7053.9799.2892.0593.83
World-R1-Small65.7467.5398.5597.5896.67

Figure 5 解读:该 qualitative comparison 把生成帧和对应 3D reconstruction 放在一起看。baseline 在复杂相机运动下会出现物体消失、墙体弯曲、点云稀疏/噪声;World-R1 的重建更密、更结构化,说明它改善的是跨视角一致性,而不仅是单帧纹理。

5.2 Human study and additional metrics

用户研究中,World-R1 相对 Wan 2.1 的 win rate 是:Geometric Consistency 92%,Camera Control Accuracy 76%,Overall Preference 86%。Metric-validation study 中,自动 3D-consistency metric 与人类多数偏好的 agreement 是 91.17%。

Camera-control appendix 结果显示 World-R1-Large 的 RotErr / TransErr / CamMC 为 1.21 / 1.30 / 2.95,优于 ReCamMaster 的 1.53 / 3.12 / 4.17,也优于 CamCloneMaster 的 1.36 / 2.02 / 3.05。MVCS 上,Wan2.1-1.3B 为 0.974,World-R1-Small 为 0.989;Wan2.1-14B 为 0.963,World-R1-Large 为 0.993。

Long-video 121-frame evaluation 中,Wan2.1-T2V-14B 的 PSNR / SSIM / LPIPS 是 18.32 / 0.558 / 0.534,World-R1-Large 是 26.32 / 0.828 / 0.257,说明短视频训练得到的几何对齐能部分迁移到更长 horizon。

Scene-complexity breakdown 显示最难的是 long-horizon / non-rigid 场景,但 World-R1-Small 在所有类别上都显著优于 Wan2.1-1.3B:

Scene TypeNMethodPSNR ↑SSIM ↑LPIPS ↓MVCS ↑
Static Scene30.11%Wan2.1-1.3B20.140.6320.3890.981
Static Scene30.11%World-R1-Small30.520.9120.1420.994
Single-obj Dynamic29.03%Wan2.1-1.3B17.860.5630.4520.976
Single-obj Dynamic29.03%World-R1-Small28.170.8690.1890.991
Multi-obj Dynamic21.51%Wan2.1-1.3B15.230.4870.5280.968
Multi-obj Dynamic21.51%World-R1-Small25.410.8120.2480.985
Non-rigid Motion19.35%Wan2.1-1.3B14.580.4620.5480.965
Non-rigid Motion19.35%World-R1-Small24.730.7930.2670.982
Long-horizon Dynamics12.89%Wan2.1-1.3B12.530.3820.6830.951
Long-horizon Dynamics12.89%World-R1-Small23.590.7810.2990.974

和 3D-conditioned / camera-control methods 的 consolidated comparison 也支持同一结论:World-R1-Small 在 3D consistency 指标上领先,同时没有像 camera-control baselines 那样牺牲 VBench visual quality。

TypeMethodPSNR ↑SSIM ↑LPIPS ↓MVCS ↑Aesthetic ↑BG Cons. ↑Subject Cons. ↑Motion Smooth. ↑
3D-Cond.ViewCrafter23.150.7240.2910.97955.5292.0994.2597.86
3D-Cond.Voyager21.380.6780.3340.97549.8092.3191.5599.39
3D-Cond.FlashWorld22.460.7020.3120.97753.7291.8894.4498.81
3D-Cond.VerseCrafter23.820.7480.2680.98154.7894.8895.5597.62
Cam. Ctrl.GCD18.260.5820.4380.96638.2192.0088.9498.37
Cam. Ctrl.Traj.-Attn.18.870.5980.4210.96938.5092.8390.6098.21
Cam. Ctrl.DAS19.420.6180.3980.97139.8692.0390.3499.14
Cam. Ctrl.ReCamMaster20.580.6530.3680.97542.7093.8392.0599.28
FoundationWan2.1-1.3B17.400.5500.4670.97462.4397.2996.3497.44
OursWorld-R1-Small27.630.8580.2010.98965.7496.6797.5898.55

Figure 6 解读:该图展示 World-R1 生成视频可以被恢复成较密集、干净的 3D scene representation,说明视频帧之间携带了一致的多视角几何信息。

Figure 7 解读:该图展示 Wan 2.1 类 baseline 视频导致的 3D reconstruction failure:点云稀疏、噪声多、结构无法闭合。它直观解释了为什么只看 generated frames 不够,必须看视频是否能支撑稳定的 3D reconstruction。

5.3 Ablation findings

Dataset scaling 在 World-R1-Small 上呈单调提升:1K prompts 得到 PSNR 25.82、SSIM 0.812、LPIPS 0.258、VBench AVG 83.23;2K 为 26.54 / 0.839 / 0.223 / 84.76;3K 为 27.63 / 0.858 / 0.201 / 85.21。

Reward component ablation 显示三个 3D reward 都有用:Full pipeline 是 PSNR 27.63、SSIM 0.858、LPIPS 0.201、VBench AVG 85.21;去掉 后为 26.91 / 0.841 / 0.218 / 83.67;去掉 后为 25.14 / 0.798 / 0.271 / 84.35;去掉 后为 26.27 / 0.829 / 0.237 / 84.53。

Training/conditioning ablation 更能说明 trade-off:去掉 noise wrapping 后 PSNR 降到 24.46、VBench AVG 降到 76.39,说明相机先验 latent initialization 对收敛和 trajectory alignment 很关键;去掉 periodic decoupled training 后 PSNR 反而到 27.89、SSIM 0.898、LPIPS 0.192,但 VBench AVG 降到 82.64,说明模型更 rigid、更易重建,但 general/dynamic quality 变差;去掉 3D-aware reward 后 PSNR 只有 18.93、SSIM 0.502、LPIPS 0.496,几何正则基本失效;去掉 general reward 后 VBench AVG 下降到 83.44。

Figure 8 解读:两组曲线分别跟踪 general generation reward 和 3D-aware reward。它显示 、noise wrapping、periodic decoupled training 不是互相替代的模块:3D reward 负责几何,general reward 负责美学/画质,noise wrapping 提供相机运动先验,dynamic-only phase 抑制过刚性。

5.4 Limitations and conclusion

作者明确指出两类限制:第一,video RL 训练成本仍然高,因为 online RL 需要反复 rollout 视频并做 reward evaluation,尤其 3D reconstruction / VLM scoring 昂贵;第二,World-R1 仍受 base video foundation model 能力上限限制,dense multi-object composition、fine-grained non-rigid motion、hand dynamics、very long-horizon scene evolution 仍可能继承 base model artifacts。

总体结论是:World-R1 证明了用 RL post-training 对齐 3D constraints 是可行的。它不需要改 T2V 架构,也不依赖大规模 3D supervised videos,却能显著改善 3D consistency、camera control 和用户偏好,同时保持甚至提升 VBench general quality。