World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
Paper: arXiv:2604.24764
Code: microsoft/World-R1
Code reference:main@cf54603d(2026-05-01)
1. Motivation (研究动机)
现有 video foundation model 已经能生成高保真短视频,但它们多数仍是 image-space generation:模型主要学到的是像素/纹理相关性,而不是真实世界中稳定的 3D geometry。因此当 prompt 要求大幅相机运动、环绕物体、穿过走廊、长距离 driving scene 时,常见问题是物体形变、消失、墙面扭曲、背景漂移、点云重建失败。这类错误说明模型没有把同一个场景当作一个可被多视角观察的 3D world 来模拟。
已有 3D-aware / camera-control 方法通常把 3D prior 作为架构模块或 inference-time 约束注入,例如额外 camera encoder、control module、3D-conditioned image-to-video pipeline。这些方法可以改善相机控制,但代价是推理成本高、改动架构、适配范围受限,并且容易牺牲原始视频模型的 visual quality 和 motion diversity。World-R1 要解决的具体问题是:不改 base T2V model 架构、不依赖大规模 3D supervised video 数据,也能让生成视频满足 3D consistency 与相机轨迹约束。
这个问题值得研究,因为它把 text-to-video post-training 从“视觉偏好对齐”推进到“物理/几何对齐”:如果 video generator 能被 RL 奖励引导出 latent 3D awareness,就可以作为 autonomous driving simulation、robotics、immersive world generation 的基础,而不是只能产出表面上连贯的 2D clip。

Figure 1 解读:这张 teaser 展示 World-R1 的目标形态:输入包含 camera push-in / move-right / turn-right 等文本指令的 prompt,模型生成视频后可以被重建为更稳定的 3D world visualization。重点不是单帧更漂亮,而是同一场景在相机运动下保持 object permanence 和几何结构。
2. Idea (核心思想)
World-R1 的核心 insight 是:不要在生成模型里硬塞一个 3D 模块,而是把“生成视频是否像一个可重建的 3D 世界”变成可优化的 RL reward,让已有 T2V 模型自己 internalize 3D constraints。它利用 pre-trained 3D foundation model 和 VLM 作为 analysis-by-synthesis critic:先从生成视频重建 3DGS,再用重渲染质量、meta-view 质量、相机轨迹一致性给奖励。
关键创新可以概括为三点:第一,用 camera-aware latent initialization 把 prompt 中的相机运动转成 trajectory-guided noise wrapping,从初始 latent 里隐式注入相机先验;第二,用 与 组成复合奖励,在 Flow-GRPO-Fast 中在线采样视频并优化;第三,用 periodic decoupled training 在几何对齐阶段之间插入 dynamic-only 阶段,避免强 3D reward 把视频推成“静态、易重建但没有动态”的 reward-hacking 解。
它和 ReCamMaster / CameraCtrl / GCD 等显式 camera-control 方法的根本差异在于:那些方法主要在架构或推理路径中加入外部相机控制模块;World-R1 不改变 Wan 2.1 / CogVideoX 这类 base model 的推理架构,而是在 post-training 阶段通过 RL reward 改变模型参数,使几何一致性变成模型自身的生成倾向。
3. Method (方法)
3.1 Overall framework
Figure 2 解读:左侧是 camera conditioning:从文本 prompt 中检测 camera motion token,生成外参轨迹 ,再把轨迹投影成 optical flow 并 warp 初始 latent noise。中间是 base video foundation model,论文主要使用 Wan 2.1 1.3B/14B。右侧是 reward stack:生成视频被 Depth Anything 3 lift 到 3DGS,再计算 meta-view、reconstruction、trajectory 三个 3D-aware reward,并与 general aesthetic reward 合成总奖励;训练算法是 Flow-GRPO-Fast。
整体训练流程是一个 online RL loop:给定 prompt,先根据 prompt 生成 camera trajectory 和 camera-aware latents;模型 rollout 出一组候选视频;reward server 对视频做 3D reconstruction / VLM judging / HPS scoring;Flow-GRPO 用同 prompt 下 group samples 的相对 reward 估计 advantage,并对 flow-matching denoising policy 做 clipped policy update。
直觉:World-R1 的设计把“3D 一致性”从一个难以直接监督的隐变量,转换成“生成视频能否支撑稳定的 3D reconstruction”这个可验证反馈。单帧看起来合理的 hallucination,在 canonical view 里可能不暴露,但一旦从 meta-view 观察重建出的点云/3DGS,就会出现 floaters、billboard、断裂结构;trajectory term 又防止模型为了易重建而偷懒生成静态视频;general reward 则防止几何约束把画质拖垮。
3.2 Camera Conditioning:从文本相机指令到 latent noise wrapping
论文避免训练额外 camera encoder,而是使用 Go-with-the-Flow 风格的 discrete noise transport。prompt 中的相机词先被映射为外参序列:
如果 prompt 包含多个 camera movement,轨迹会按顺序 concatenate。然后把相邻 pose 的相对运动投影到 2D flow:
连续 flow 会在离散 latent grid 上产生重叠和空洞,因此实现里用 density tracker 做 variance-preserving transport:
代码中 TrajectoryGenerator 生成 push_in、pull_out、move_left/right、pan_left/right、orbit_left/right 等轨迹;prepare_latents_with_camera 为每个 batch item 生成对应 warped latent,并通过 wrap_strength 与 base Gaussian latent 混合。发布配置中 Small / Large 的 wrap_strength 分别是 0.35 / 0.4,默认视频分辨率是 ,帧数是 81。
3.3 Reward Design:3D-aware reward + general generation reward
总奖励是:
其中论文和代码默认直接相加,appendix 说明 ,,。3D-aware reward 被分成三项:
- :从重建的 3DGS 渲染 novel meta-view,用 Qwen3-VL 判断结构是否稳定、是否存在 floaters / distortion / texture stretching。原始 0—9 分缩放到 。
- :把 3DGS 从估计相机轨迹重渲染回视频视角,论文定义为 ;发布 server 默认
REWARD_3D_USE_LPIPS=1,即用 LPIPS 计算 reconstruction score。 - :比较 prompt-derived target trajectory 与 Depth Anything 3 估计出的 trajectory ,结合 translation path、extent、rotation geodesic error 得到 score。
General generation reward 在论文公式中定义为前 帧的 HPSv3 风格 aesthetic preference score 平均:
发布代码和论文公式有两个实现差异:flow_grpo/rewards.py 不是对前 帧逐帧平均,而是从每个 rollout video 随机抽一帧送到 general reward server;reward_server/general_reward.py 实际调用 hpsv2.score(..., hps_version="v2.1"),而论文正文写 HPSv3。阅读时应区分 paper objective 与 released implementation。
Figure 3 解读:这张 appendix figure 解释为什么 meta-view 有用。canonical generated frames 可能看起来还行,但从 3DGS 的偏移视角观察时,低质量视频会暴露点云破碎、漂浮物、结构塌陷;World-R1 用这个视角给 ,专门惩罚“2D 看起来合理但 3D 不成立”的伪一致性。
3.4 Periodic Decoupled Training:防止几何奖励压制动态
严格 3D reward 容易诱导模型生成静态、刚性、易重建的视频。World-R1 构造约 500 条 high-entropy dynamic prompts(fire、flowing water、crowds、fluid 等),并采用周期训练:主阶段使用完整 ;每 100 个 training steps 进入 dynamic fine-tuning phase,临时关闭 ,只在 dynamic subset 上用 优化。发布配置中 dynamic_training.main_steps=100、dynamic_training.dynamic_steps=50,flow_grpo/rewards.py 也会在 metadata 标记 is_dynamic=True 时跳过 reward_3d。

Figure 4 解读:该图展示 periodic decoupled training 后,World-R1 仍能生成包含非刚性动态的场景,而不是只会产出静态、可重建的 rigid scene。这对应论文的核心 trade-off:几何一致性必须提升,但不能以牺牲 fluid / biological / crowd motion 为代价。
3.5 Flow-GRPO-Fast:把 flow matching sampler 当作 policy
Flow-GRPO 把 denoising trajectory 看成 MDP。为让 flow model 在 rollout 中具有探索性,先把 deterministic ODE 转成 reverse-time SDE:
离散更新为:
同一个 condition 下采样 条轨迹,用 group reward 标准化 advantage:
优化目标包含 PPO-style clipped surrogate 与 reference policy KL:
发布配置里 sample.num_steps=50、sample.num_image_per_prompt=2、sample.num_batches_per_epoch=24,代码里的 group size 来自 sampler/reward grouping 与论文设置 ;训练使用 LoRA、bf16、EMA、learning rate 、train.beta=0.004、clip_range=1e-3。
3.6 Pseudocode based on released code
Camera-aware latent preparation(对应 camera_trajectory_utils.py 与 scripts/train_world_r1.py):
import torch
import torch.nn.functional as F
def prepare_camera_aware_latents(pipeline, prompts, batch_size, cfg, device):
num_channels = pipeline.transformer.config.in_channels
vae_t = pipeline.vae_scale_factor_temporal
trajectories, detected, expanded_prompts, profiles = get_camera_trajectories_for_batch(
prompts,
batch_size=batch_size,
frames_per_trajectory=81,
force_camera_movement=cfg.sample.force_camera_movement,
)
base_latents = torch.randn(
batch_size,
num_channels,
(cfg.frames - 1) // vae_t + 1,
cfg.height // 8,
cfg.width // 8,
device=device,
dtype=torch.float32,
)
output, callbacks = [], []
for i, trajectory in enumerate(trajectories):
if trajectory is None:
output.append(base_latents[i : i + 1])
callbacks.append(None)
continue
wrapped = generate_camera_warped_latents(
trajectory=trajectory,
batch_size=1,
num_channels_latents=num_channels,
height=cfg.height,
width=cfg.width,
num_frames=cfg.frames,
temporal_compression=vae_t,
noise_degradation=cfg.sample.noise_degradation,
flow_scale=cfg.sample.noise_wrap_flow_scale,
device=device,
)
if cfg.sample.wrap_injection_mode == "stepwise_delta":
delta_low = lowpass_latent_delta(
wrapped.float() - base_latents[i : i + 1].float(),
cfg.sample.delta_lowpass_kernel,
)
callbacks.append(
build_stepwise_delta_callback(
delta_low=delta_low,
wrap_strength=float(cfg.sample.wrap_strength),
guidance_steps=cfg.sample.stepwise_guidance_steps,
)
)
output.append(base_latents[i : i + 1])
else:
output.append(
apply_wrap_strength_to_latents(
base_latents=base_latents[i : i + 1],
wrapped_latents=wrapped,
wrap_strength=float(cfg.sample.wrap_strength),
injection_mode=cfg.sample.wrap_injection_mode,
delta_lowpass_kernel=cfg.sample.delta_lowpass_kernel,
)
)
callbacks.append(None)
return torch.cat(output, dim=0), callbacks, trajectories, detected3D-aware reward server(对应 flow_grpo/rewards.py、reward_server/reward_3d.py、reward_3d_backend.py):
import torch
import random
@torch.no_grad()
def compute_world_r1_reward(video_frames, prompt, target_trajectory, backend, meta_scorer, general_worker):
# Reward3DBackend.process_video_frames calls backend.model.inference(...)
# and internally runs _generate_gs_video, _generate_meta_view, and
# _compute_camera_motion_score.
gs_video, meta_view, s_traj, traj_vis = backend.process_video_frames(
frames=video_frames,
camera_trajectory=target_trajectory,
)
s_recon = (1.0 - lpips(video_to_tensor(video_frames), gs_video).mean()).clamp(0.0, 1.0)
s_meta = torch.as_tensor(meta_scorer([prompt], [meta_view.unsqueeze(0)])[0]).clamp(0.0, 1.0)
s_traj = torch.as_tensor(s_traj).clamp(0.0, 1.0)
r_3d = s_recon + s_meta + s_traj
# flow_grpo/rewards.py randomly selects one video frame before calling
# reward_server/general_reward.py, which scores it with hpsv2 v2.1.
sampled_frame = random.choice(video_frames)
r_gen = torch.as_tensor(general_worker.compute_score([sampled_frame], [prompt])[0])
return {
"reward_3d": r_3d,
"reward_general": r_gen,
"reward_total": r_3d + r_gen,
"score_reconstruction": s_recon,
"score_meta_view": s_meta,
"score_trajectory_alignment": s_traj,
}Periodic decoupled training / reward routing(对应 TextPromptDataset、multi_score、main training loop):
def choose_training_batch(global_step, main_loader, dynamic_loader, cfg):
cycle = cfg.dynamic_training.main_steps + cfg.dynamic_training.dynamic_steps
use_dynamic = cfg.dynamic_training.enabled and (global_step % cycle >= cfg.dynamic_training.main_steps)
if use_dynamic:
prompts, metadata = next(dynamic_loader)
for m in metadata:
m["is_dynamic"] = True
else:
prompts, metadata = next(main_loader)
for m in metadata:
m["is_dynamic"] = False
return prompts, metadata
def combine_rewards(videos, prompts, metadata, reward_fns, weights):
total = torch.zeros(len(prompts), dtype=torch.float32)
details = {}
skip_3d = any(m.get("is_dynamic", False) for m in metadata)
for name, fn in reward_fns.items():
if skip_3d and name == "reward_3d":
continue
scores, score_details = fn(videos, prompts, metadata)
total = total + weights[name] * torch.as_tensor(scores)
details[name] = scores
details.update(score_details)
details["reward_total"] = total.tolist()
return detailsFlow-GRPO update(对应 wan_pipeline_with_logprob.py 与 scripts/train_world_r1.py):
def flow_grpo_update(transformer, samples, optimizer, cfg):
rewards = samples["reward_total"] - cfg.sample.kl_reward * samples["kl"]
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
advantages = advantages.clamp(-cfg.train.adv_clip_max, cfg.train.adv_clip_max)
for t in range(cfg.sample.num_steps):
prev_mean, log_prob = compute_log_prob(transformer, samples, t)
ratio = torch.exp(log_prob - samples["log_probs"][:, t])
unclipped = -advantages[:, t] * ratio
clipped = -advantages[:, t] * ratio.clamp(1.0 - cfg.train.clip_range, 1.0 + cfg.train.clip_range)
policy_loss = torch.maximum(unclipped, clipped).mean()
if cfg.train.beta > 0:
with transformer.disable_adapter():
ref_mean, _ = compute_log_prob(transformer, samples, t)
kl_loss = ((prev_mean - ref_mean) ** 2).mean()
loss = policy_loss + cfg.train.beta * kl_loss
else:
loss = policy_loss
loss.backward()
torch.nn.utils.clip_grad_norm_(transformer.parameters(), cfg.train.max_grad_norm)
optimizer.step()
optimizer.zero_grad(set_to_none=True)Code reference:
main@cf54603d(2026-05-01) — pseudocode and mapping based on this commit
| Paper Concept | Source File | Key Class/Function |
|---|---|---|
| Flow-GRPO / SDE sampler with log-prob | flow_grpo/diffusers_patch/wan_pipeline_with_logprob.py | sde_step_with_logprob, wan_pipeline_with_logprob |
| Camera motion detection and trajectory generation | flow_grpo/diffusers_patch/camera_trajectory_utils.py | TrajectoryGenerator, detect_camera_movements, get_camera_trajectories_for_batch |
| Camera-to-flow projection and discrete noise transport | flow_grpo/diffusers_patch/camera_trajectory_utils.py | camera_motion_to_flow, NoiseWarper, generate_camera_warped_latents, prepare_latents_with_camera |
| Rollout latent injection into training | scripts/train_world_r1.py | prepare_rollout_latents_and_callback, rollout_with_logprob |
| Prompt dataset and dynamic subset | scripts/train_world_r1.py | TextPromptDataset, choose main/dynamic dataloader by global_step |
| Composite reward client | flow_grpo/rewards.py | remote_reward_3d, remote_reward_general, multi_score |
| 3D reconstruction reward backend | reward_server/reward_3d_backend.py | Reward3DBackend.process_video_frames, _generate_gs_video, _generate_meta_view, _compute_camera_motion_score |
| Multi-GPU 3D reward service | reward_server/reward_3d.py, scripts/serve_reward_3d.py | MultiGPUReward3DManager, reward_3d_worker_process, create_app |
| General aesthetic reward service | reward_server/general_reward.py, scripts/serve_general_reward.py | MultiGPUGeneralRewardManager, GeneralRewardInstance.compute_score |
| Experiment configs | config/world_r1.py, config/base.py | world_r1_small, world_r1_large, dynamic and RL hyperparameters |
4. Experimental Setup (实验设置)
Datasets and scale:论文使用 Gemini 合成的 Pure Text Dataset,约 3,000 条 prompt,覆盖 Natural Landscapes、Urban & Architecture、Micro World、Fantasy / surreal scenes、dynamic scenes 等;其中 dynamic subset 约 500 条,用于 periodic decoupled training。发布代码快照中 dataset/final/ 有 2,468 条 train、42 条 test、500 个非空 dynamic prompt(文件无 trailing newline,因此 wc -l 显示 499);dataset/enhanced/ 有 2,651 条 train、300 条 test、515 条 dynamic。用户研究使用 30 个 complex prompts、25 名参与者;metric-validation study 使用 20 名参与者和 30 个 randomized video pairs。
Baselines:主实验比较 CogVideoX-1.5-5B、Wan2.1-T2V-1.3B、Wan2.1-T2V-14B、Wan2.2-T2V-5B、Wan2.2-T2V-14B。camera-control / 3D-aware 对比包括 GCD、Trajectory-Attention、DAS、ReCamMaster、TrajectoryCrafter、CamCloneMaster、ViewCrafter、Voyager、FlashWorld、VerseCrafter;正文还讨论 CameraCtrl 类显式 camera-control 方法。
Evaluation metrics:3D consistency 用 3DGS reconstruction 后 re-render 与原视频比较:PSNR 越高越好、SSIM 越高越好、LPIPS 越低越好。附录还报告 MVCS,用于不依赖重建 pipeline 的 multi-view consistency。General video quality 用 VBench 子项,包括 Aesthetic Quality、Imaging Quality、Motion Smoothness、Subject Consistency、Background Consistency。Camera-control accuracy 用 RotErr、TransErr、CamMC,均越低越好。用户研究报告 World-R1 相对 Wan 2.1 的 win rate。
Training config:World-R1-Small 从 Wan2.1-T2V-1.3B 初始化,用 48 张 NVIDIA H200;World-R1-Large 从 Wan2.1-T2V-14B 初始化,用 96 张 NVIDIA H200。训练分辨率 ,81 frames,Flow-GRPO-Fast,48 parallel groups,group size 。发布配置使用 50 denoising steps、guidance scale 5.0、bf16、LoRA、EMA、learning rate 、train.beta=0.004、clip_range=1e-3、每 100 main steps 后 50 dynamic-only steps。
5. Experimental Results (实验结果)
5.1 Main quantitative results
3D consistency(Table 2):World-R1 在 reconstruction-based 3D consistency 上显著超过 base video models。
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| CogVideoX-1.5-5B | 24.44 | 0.783 | 0.242 |
| Wan2.2-T2V-14B | 23.47 | 0.779 | 0.253 |
| Wan2.2-T2V-5B | 22.36 | 0.716 | 0.303 |
| Wan2.1-T2V-14B | 19.76 | 0.629 | 0.405 |
| Wan2.1-T2V-1.3B | 17.40 | 0.550 | 0.467 |
| World-R1-Small | 27.63 | 0.858 | 0.201 |
| World-R1-Large | 27.67 | 0.865 | 0.162 |
相对 Wan2.1-T2V-1.3B,World-R1-Small 的 PSNR 提升 10.23 dB;相对 Wan2.1-T2V-14B,World-R1-Large 的 PSNR 提升 7.91 dB。
VBench general quality(Table 1):World-R1-Small 不仅没有牺牲 general video quality,还超过 Wan2.1-T2V-1.3B backbone。
| Method | Aesthetic ↑ | Imaging ↑ | Motion Smooth. ↑ | Subject Cons. ↑ | Background Cons. ↑ |
|---|---|---|---|---|---|
| CogVideoX-1.5-5B | 62.07 | 65.34 | 98.15 | 96.56 | 96.81 |
| Wan2.1-T2V-1.3B | 62.43 | 66.51 | 97.44 | 96.34 | 97.29 |
| GCD | 38.21 | 41.56 | 98.37 | 88.94 | 92.00 |
| Trajectory-Attention | 38.50 | 51.00 | 98.21 | 90.60 | 92.83 |
| DAS | 39.86 | 51.55 | 99.14 | 90.34 | 92.03 |
| ReCamMaster | 42.70 | 53.97 | 99.28 | 92.05 | 93.83 |
| World-R1-Small | 65.74 | 67.53 | 98.55 | 97.58 | 96.67 |

Figure 5 解读:该 qualitative comparison 把生成帧和对应 3D reconstruction 放在一起看。baseline 在复杂相机运动下会出现物体消失、墙体弯曲、点云稀疏/噪声;World-R1 的重建更密、更结构化,说明它改善的是跨视角一致性,而不仅是单帧纹理。
5.2 Human study and additional metrics
用户研究中,World-R1 相对 Wan 2.1 的 win rate 是:Geometric Consistency 92%,Camera Control Accuracy 76%,Overall Preference 86%。Metric-validation study 中,自动 3D-consistency metric 与人类多数偏好的 agreement 是 91.17%。
Camera-control appendix 结果显示 World-R1-Large 的 RotErr / TransErr / CamMC 为 1.21 / 1.30 / 2.95,优于 ReCamMaster 的 1.53 / 3.12 / 4.17,也优于 CamCloneMaster 的 1.36 / 2.02 / 3.05。MVCS 上,Wan2.1-1.3B 为 0.974,World-R1-Small 为 0.989;Wan2.1-14B 为 0.963,World-R1-Large 为 0.993。
Long-video 121-frame evaluation 中,Wan2.1-T2V-14B 的 PSNR / SSIM / LPIPS 是 18.32 / 0.558 / 0.534,World-R1-Large 是 26.32 / 0.828 / 0.257,说明短视频训练得到的几何对齐能部分迁移到更长 horizon。
Scene-complexity breakdown 显示最难的是 long-horizon / non-rigid 场景,但 World-R1-Small 在所有类别上都显著优于 Wan2.1-1.3B:
| Scene Type | N | Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | MVCS ↑ |
|---|---|---|---|---|---|---|
| Static Scene | 30.11% | Wan2.1-1.3B | 20.14 | 0.632 | 0.389 | 0.981 |
| Static Scene | 30.11% | World-R1-Small | 30.52 | 0.912 | 0.142 | 0.994 |
| Single-obj Dynamic | 29.03% | Wan2.1-1.3B | 17.86 | 0.563 | 0.452 | 0.976 |
| Single-obj Dynamic | 29.03% | World-R1-Small | 28.17 | 0.869 | 0.189 | 0.991 |
| Multi-obj Dynamic | 21.51% | Wan2.1-1.3B | 15.23 | 0.487 | 0.528 | 0.968 |
| Multi-obj Dynamic | 21.51% | World-R1-Small | 25.41 | 0.812 | 0.248 | 0.985 |
| Non-rigid Motion | 19.35% | Wan2.1-1.3B | 14.58 | 0.462 | 0.548 | 0.965 |
| Non-rigid Motion | 19.35% | World-R1-Small | 24.73 | 0.793 | 0.267 | 0.982 |
| Long-horizon Dynamics | 12.89% | Wan2.1-1.3B | 12.53 | 0.382 | 0.683 | 0.951 |
| Long-horizon Dynamics | 12.89% | World-R1-Small | 23.59 | 0.781 | 0.299 | 0.974 |
和 3D-conditioned / camera-control methods 的 consolidated comparison 也支持同一结论:World-R1-Small 在 3D consistency 指标上领先,同时没有像 camera-control baselines 那样牺牲 VBench visual quality。
| Type | Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | MVCS ↑ | Aesthetic ↑ | BG Cons. ↑ | Subject Cons. ↑ | Motion Smooth. ↑ |
|---|---|---|---|---|---|---|---|---|---|
| 3D-Cond. | ViewCrafter | 23.15 | 0.724 | 0.291 | 0.979 | 55.52 | 92.09 | 94.25 | 97.86 |
| 3D-Cond. | Voyager | 21.38 | 0.678 | 0.334 | 0.975 | 49.80 | 92.31 | 91.55 | 99.39 |
| 3D-Cond. | FlashWorld | 22.46 | 0.702 | 0.312 | 0.977 | 53.72 | 91.88 | 94.44 | 98.81 |
| 3D-Cond. | VerseCrafter | 23.82 | 0.748 | 0.268 | 0.981 | 54.78 | 94.88 | 95.55 | 97.62 |
| Cam. Ctrl. | GCD | 18.26 | 0.582 | 0.438 | 0.966 | 38.21 | 92.00 | 88.94 | 98.37 |
| Cam. Ctrl. | Traj.-Attn. | 18.87 | 0.598 | 0.421 | 0.969 | 38.50 | 92.83 | 90.60 | 98.21 |
| Cam. Ctrl. | DAS | 19.42 | 0.618 | 0.398 | 0.971 | 39.86 | 92.03 | 90.34 | 99.14 |
| Cam. Ctrl. | ReCamMaster | 20.58 | 0.653 | 0.368 | 0.975 | 42.70 | 93.83 | 92.05 | 99.28 |
| Foundation | Wan2.1-1.3B | 17.40 | 0.550 | 0.467 | 0.974 | 62.43 | 97.29 | 96.34 | 97.44 |
| Ours | World-R1-Small | 27.63 | 0.858 | 0.201 | 0.989 | 65.74 | 96.67 | 97.58 | 98.55 |
Figure 6 解读:该图展示 World-R1 生成视频可以被恢复成较密集、干净的 3D scene representation,说明视频帧之间携带了一致的多视角几何信息。
Figure 7 解读:该图展示 Wan 2.1 类 baseline 视频导致的 3D reconstruction failure:点云稀疏、噪声多、结构无法闭合。它直观解释了为什么只看 generated frames 不够,必须看视频是否能支撑稳定的 3D reconstruction。
5.3 Ablation findings
Dataset scaling 在 World-R1-Small 上呈单调提升:1K prompts 得到 PSNR 25.82、SSIM 0.812、LPIPS 0.258、VBench AVG 83.23;2K 为 26.54 / 0.839 / 0.223 / 84.76;3K 为 27.63 / 0.858 / 0.201 / 85.21。
Reward component ablation 显示三个 3D reward 都有用:Full pipeline 是 PSNR 27.63、SSIM 0.858、LPIPS 0.201、VBench AVG 85.21;去掉 后为 26.91 / 0.841 / 0.218 / 83.67;去掉 后为 25.14 / 0.798 / 0.271 / 84.35;去掉 后为 26.27 / 0.829 / 0.237 / 84.53。
Training/conditioning ablation 更能说明 trade-off:去掉 noise wrapping 后 PSNR 降到 24.46、VBench AVG 降到 76.39,说明相机先验 latent initialization 对收敛和 trajectory alignment 很关键;去掉 periodic decoupled training 后 PSNR 反而到 27.89、SSIM 0.898、LPIPS 0.192,但 VBench AVG 降到 82.64,说明模型更 rigid、更易重建,但 general/dynamic quality 变差;去掉 3D-aware reward 后 PSNR 只有 18.93、SSIM 0.502、LPIPS 0.496,几何正则基本失效;去掉 general reward 后 VBench AVG 下降到 83.44。
Figure 8 解读:两组曲线分别跟踪 general generation reward 和 3D-aware reward。它显示 、、noise wrapping、periodic decoupled training 不是互相替代的模块:3D reward 负责几何,general reward 负责美学/画质,noise wrapping 提供相机运动先验,dynamic-only phase 抑制过刚性。
5.4 Limitations and conclusion
作者明确指出两类限制:第一,video RL 训练成本仍然高,因为 online RL 需要反复 rollout 视频并做 reward evaluation,尤其 3D reconstruction / VLM scoring 昂贵;第二,World-R1 仍受 base video foundation model 能力上限限制,dense multi-object composition、fine-grained non-rigid motion、hand dynamics、very long-horizon scene evolution 仍可能继承 base model artifacts。
总体结论是:World-R1 证明了用 RL post-training 对齐 3D constraints 是可行的。它不需要改 T2V 架构,也不依赖大规模 3D supervised videos,却能显著改善 3D consistency、camera control 和用户偏好,同时保持甚至提升 VBench general quality。