VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
Paper: arXiv:2603.26599 / PDF / HF Paper Code: 代码搜索未找到开源实现 Code reference: N/A — GitHub/CatalyzeX/HF/project-page search found no public algorithm implementation as of 2026-05-16.
1. Motivation(动机)
大规模 video diffusion / rectified-flow 模型已经能生成高视觉质量视频,但在“世界一致性”上仍会失败:同一场景跨帧会出现 geometry drift、camera jitter、结构断裂或突然换景。对 embodied AI、physics-aware simulation、robotics data generation 来说,视频不仅要好看,还必须在相机运动和 3D/4D 场景结构上自洽。
已有路线主要有两个缺口:
- 改架构/加条件模块:例如引入 point cloud / depth / camera-conditioning,能提升局部 3D 一致性,但会增加结构复杂度,并可能削弱 internet-scale pretrained video model 的泛化能力。
- RGB-space geometry reward / DPO:Epipolar-DPO、VideoGPA 等需要反复把 video latent VAE decode 成 RGB,再跑几何模型;这既贵,又把 reward model 暴露在 generated RGB 的分布偏移下,并且许多静态几何假设无法处理真实动态场景。
本文的核心问题是:能否不改 base video generator、不反复 decode RGB,而直接在 video latent 上构造可用于 RL post-training 的 4D geometry reward?
Figure 1 解读:teaser 对比 baseline 与 VGGRPO-aligned model。上方是 inferred 4D scene representation / reconstructed geometry,下方是代表性 keyframes。baseline 的几何结构和相机轨迹更容易漂移,VGGRPO 通过 latent-space geometry reward 使动态场景中的结构和相机运动更稳定。
2. Idea(核心想法)
核心 insight:把 video diffusion latent 直接接到 4D geometry foundation model 的中间特征空间,用 Latent Geometry Model(LGM)在 latent space 输出 camera、depth、point map、scene flow,然后把这些几何预测转成 GRPO reward。
这样做同时解决三件事:
- reward computation 不再需要 repeated VAE decoding,降低 group-based online RL 的时间和显存成本;
- reward model 输入从 RGB frames 变成 diffusion latents,缓解 generated RGB 与 real-image geometry model 之间的 distribution gap;
- 通过 Any4D 这类支持 dynamic 4D reconstruction 的 geometry model,reward 可覆盖动态场景,而不是只对静态多视角几何有效。
一句话概括:VGGRPO 是一个 latent geometry-guided video post-training framework,用 camera motion smoothness + geometry reprojection consistency 两个 latent rewards 做 Group Relative Policy Optimization,使 pretrained video generator 往 4D world-consistent generation 对齐。
3. Method(方法)
3.1 总体框架
Figure 2 解读:方法由两部分组成。左侧 LGM 用 video VAE encoder 的 latent 替换几何模型的 RGB input pathway,并用 lightweight 3D convolutional connector 对齐到 geometry foundation model 的中间层。右侧 VGGRPO 在 latent denoising trajectory 上采样 group videos,用 LGM 直接估计 4D geometry,并把 camera motion smoothness 与 reprojection consistency 作为 GRPO reward。
直觉上,LGM 相当于给 diffusion latent 装了一个“几何读头”:它不需要先还原成 RGB,也不要求修改 video generator 的主干,只要能从 denoised latent 里稳定读出 camera/depth/pointmap/scene-flow,就可以把几何错误变成 reward。GRPO 再利用同一 prompt 下多个 samples 的相对好坏来更新 LoRA policy,因此不需要额外训练 critic。
3.2 Latent Geometry Model(LGM)
设 video VAE encoder 为 ,把视频 编码成 latent 。原始 geometry model 从 RGB sequence 输出每帧几何:
VGGRPO 用 connector 替换 的前 层,并通过 feature stitching 训练:
训练后 LGM 直接从 latent 输出用于 reward 的 4D 几何量:
其中 是 camera parameters, 是 depth, 是 world-frame point map, 是 scene flow;scene flow 使动态区域可被过滤或单独处理,因此比静态-only epipolar reward 更适合 dynamic scenes。
3.3 Camera Motion Smoothness Reward
LGM 从 denoised video latent 预测 camera poses 。从相机中心 构造速度 和加速度 ,定义平移抖动误差:
旋转平滑性类似:用 表示角速度,用 表示角加速度:
最终 motion reward 是两个 smoothness score 的平均:
3.4 Geometry Reprojection Consistency Reward
LGM 预测 point maps 、depths 、camera parameters 与 scene flow 。方法先从 构建 scene point cloud;静态场景聚合所有帧,动态场景用 过滤 dynamic regions,仅聚合稳定静态点。再把 point cloud 投影到每个 view ,得到 rendered depth ,并与预测 depth 比较:
其中 是 view 中有效投影像素。为了聚焦局部坏 case,reward 取 worst 3 views 的负平均:
3.5 Latent-space GRPO Objective
对每个 prompt 采样 条 denoising trajectories。普通 GRPO 用 group rewards 标准化 advantage:
VGGRPO 分别标准化 motion reward 和 geo reward,再平均:
每个 denoising step 的 policy ratio 为:
VGGRPO 的 clipped objective:
3.6 论文伪代码(非官方实现)
代码搜索未找到开源实现;以下 pseudocode 根据 paper equations、appendix listing 与 method prose 重构,不代表作者源码。
A. LGM feature stitching training
import torch
import torch.nn.functional as F
from torch import nn
class LatentGeometryConnector(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.proj = nn.Conv3d(
in_channels, out_channels,
kernel_size=(5, 5, 5), stride=(1, 2, 2), padding=(2, 2, 2)
)
def forward(self, video_latents):
return self.proj(video_latents)
def train_lgm_connector(vae_encoder, geometry_model, connector, videos, optimizer):
with torch.no_grad():
z = vae_encoder(videos) # z = E(x)
target_feat = geometry_model.forward_to_layer(videos, layer="ell_hat")
pred_feat = connector(z) # S_psi(E(x))
loss = F.mse_loss(pred_feat, target_feat)
loss.backward()
torch.nn.utils.clip_grad_norm_(connector.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
return lossB. latent geometry rewards
import torch
def camera_motion_reward(camera_poses):
centers = camera_poses.camera_centers_world() # [T, 3]
rotations = camera_poses.rotations() # [T, 3, 3]
v = centers[1:] - centers[:-1]
a = v[1:] - v[:-1]
e_trans = (a.norm(dim=-1) / (v[1:].norm(dim=-1) + v[:-1].norm(dim=-1) + 1e-8)).mean()
omega = so3_log(torch.matmul(rotations[:-1].transpose(-1, -2), rotations[1:]))
alpha = omega[1:] - omega[:-1]
e_rot = (alpha.norm(dim=-1) / (omega[1:].norm(dim=-1) + omega[:-1].norm(dim=-1) + 1e-8)).mean()
return 0.5 * (1.0 / (1.0 + e_trans) + 1.0 / (1.0 + e_rot))
def geometry_reprojection_reward(pointmaps, depths, cameras, scene_flow, topk=3):
static_points = aggregate_static_scene_points(pointmaps, scene_flow)
errors = []
for i, camera in enumerate(cameras):
rendered_depth, valid = render_depth(static_points, camera)
err = (rendered_depth[valid] - depths[i][valid]).abs().mean()
errors.append(err)
worst = torch.stack(errors).topk(k=topk, largest=True).values
return -worst.mean()C. VGGRPO policy update
def vggrpo_update(policy, old_policy, ref_policy, lgm, prompts, optimizer, group_size=64,
clip_eps=1e-3, beta=0.004):
trajectories = policy.sample_latent_trajectories(prompts, group_size=group_size)
z0 = trajectories.final_latents()
geom = lgm(z0) # cameras, depths, pointmaps, scene_flow
r_motion = camera_motion_reward(geom.cameras)
r_geo = geometry_reprojection_reward(geom.pointmaps, geom.depths, geom.cameras, geom.scene_flow)
adv = 0.5 * (normalize_by_prompt_group(r_motion) + normalize_by_prompt_group(r_geo))
ratios = policy.step_logprobs(trajectories) - old_policy.step_logprobs(trajectories)
ratios = ratios.exp()
clipped = torch.clamp(ratios, 1.0 - clip_eps, 1.0 + clip_eps)
pg = torch.minimum(ratios * adv[:, None], clipped * adv[:, None]).mean()
kl = closed_form_step_kl(policy, ref_policy, trajectories).mean()
loss = -(pg - beta * kl)
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.lora_parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
return {"loss": loss, "r_motion": r_motion.mean(), "r_geo": r_geo.mean(), "kl": kl}D. test-time latent reward guidance(appendix)
def reward_guided_sampling(model, lgm, latents, prompt_embeds, timesteps, dts,
reward_guidance_scale, reward_weights, guidance_interval):
for i, t in enumerate(timesteps):
latents = latents.detach().requires_grad_(True)
v_pred = model(latents, t, prompt_embeds)
if i in guidance_interval:
geom = lgm(latents)
reward_smooth = camera_motion_reward(geom.cameras)
reward_geo = geometry_reprojection_reward(geom.pointmaps, geom.depths, geom.cameras, geom.scene_flow)
reward = reward_weights["smooth"] * reward_smooth + reward_weights["geo"] * reward_geo
grad = torch.autograd.grad(reward, latents)[0]
v_pred = v_pred - reward_guidance_scale * t / (1.0 - t) * grad
latents = latents - dts[i] * v_pred
return latents.detach()3.7 Code-to-paper mapping
代码搜索未找到开源实现,因此下表是 paper component → expected implementation artifact 的审计式 mapping,不是作者源码验证。若未来 release code,需要用实际文件/class/function 重新替换本表并设置真实 github_ref=<branch>@<short_sha> (date)。
| Paper concept | Paper location | Expected implementation artifact | Current verification status |
|---|---|---|---|
| Latent Geometry Model stitching | Eq. stitching / Fig. 2(a) | VAE encoder wrapper, 3D Conv connector, geometry FM middle-layer feature extraction, MSE feature loss | No public code; only paper equations + source TeX/PDF verified |
| LGM outputs | Eq. latent_reward_outputs | lgm.forward(latents) -> cameras, depths, pointmaps, scene_flow | No public code |
| Camera motion smoothness reward | Eq. trans_smooth, rot_smooth, motion_reward | camera-center velocity/acceleration and SO(3) angular acceleration reward | No public code |
| Geometry reprojection consistency reward | Eq. depth_reproj_per_view, geometry_reward | pointmap aggregation, scene-flow static filtering, depth rasterization/reprojection, worst-3 view reduction | No public code |
| Latent-space GRPO | Eq. combined_adv, latent_grpo_obj | group sampling, reward normalization, ratio clipping, closed-form KL, LoRA update | No public code |
| Test-time reward guidance | Appendix listing | differentiable reward gradient through LGM modifies velocity field | No public code |
4. Setup(实验设置)
4.1 训练配置
可验证来源:arXiv PDF / TeX source;未发现作者 release 的 launch script 或 config,因此训练数字不能被源码二次验证。
| Component | Setting |
|---|---|
| Geometry FM | Any4D(支持 dynamic 4D reconstruction);ablation 也比较 VGGT |
| LGM training data | base diffusion model generated videos + DL3DV + RealEstate10K + MiraData real videos |
| LGM optimizer | AdamW, learning rate , no weight decay |
| LGM schedule | 20 epochs, cosine decay, first 100 optimization steps linear warmup |
| LGM gradient clipping | max norm 1.0 |
| Connector | 3D conv, kernel , stride , padding |
| LGM LoRA | rank , scaling |
| VGGRPO backbones | Wan2.1-1B, Wan2.2-5B |
| VGGRPO LoRA | rank , scaling |
| VGGRPO group size | |
| VGGRPO optimizer | AdamW, learning rate , weight decay |
| VGGRPO clipping / KL | , |
| VGGRPO gradient clipping | max norm 1.0 |
| Training compute | approximately 1536 GPU hours |
| Denoising reduction | training sample schedule example vs. inference |
4.2 Baselines and metrics
Baselines:Base Model、Supervised Fine-Tuning(SFT)、Epipolar-DPO、VideoGPA。评估覆盖 static split、dynamic split 与 general VBench captions。
主要指标:
- Static:VideoReward Visual Quality(VQ↑)、Motion Quality(MQ↑)、Sampson epipolar error(Epi.↓)。
- Dynamic:VideoReward VQ↑、MQ↑。
- VBench:Subject Consistency、Background Consistency、Aesthetic Quality、Imaging Quality、Motion Smoothness、Dynamic Degree。
4.3 开源代码检索与 github_ref
结论:代码搜索未找到开源实现。
检索记录(2026-05-16):
- Hugging Face paper page 仅链接 arXiv PDF 与 project page;无 model/dataset/space code link。
- arXiv page 的 code/data/media 区域未列出作者 implementation。
- Project page 只有 Paper 与 LinkedIn post 链接;未给 GitHub implementation。
- CatalyzeX 页面显示 paper metadata/project page,但未给可直接打开的 implementation URL。
- GitHub API / web search queries:
VGGRPO,"Visual Geometry GRPO","Towards World-Consistent Video Generation","2603.26599","Latent Geometry Model" "VGGRPO";结果为 paper lists/blog/project-page source repo(如ZhaochongAn/ZhaochongAn.github.io),未发现算法源码 repo。
5. Results(结果与分析)
5.1 主结果
| Base backbone | Method | Static VQ↑ | Static MQ↑ | Static Epi.↓ | Dynamic VQ↑ | Dynamic MQ↑ | Sub. Cons.↑ | Bg. Cons.↑ | Aes. Qual.↑ | Img. Qual.↑ | Mot. Smooth.↑ | Dyn. Deg.↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Wan2.1-1B | Base | - | - | 0.133 | - | - | 0.7941 | 0.8930 | 0.5233 | 0.6178 | 0.9552 | 0.9231 |
| Wan2.1-1B | SFT | 45.26 | 46.84 | 0.137 | 40.00 | 39.00 | 0.8032 | 0.8896 | 0.5472 | 0.6256 | 0.9646 | 0.8795 |
| Wan2.1-1B | Epipolar-DPO | 54.21 | 55.79 | 0.098 | 45.50 | 43.00 | 0.8125 | 0.8916 | 0.5578 | 0.6461 | 0.9671 | 0.8816 |
| Wan2.1-1B | VideoGPA | 53.68 | 56.32 | 0.105 | 42.50 | 41.00 | 0.8068 | 0.8931 | 0.5562 | 0.6507 | 0.9650 | 0.8734 |
| Wan2.1-1B | VGGRPO | 59.47 | 66.84 | 0.102 | 57.00 | 63.00 | 0.8255 | 0.8974 | 0.5623 | 0.6585 | 0.9753 | 0.9048 |
| Wan2.2-5B | Base | - | - | 0.142 | - | - | 0.8151 | 0.8958 | 0.4837 | 0.6402 | 0.9467 | 0.8692 |
| Wan2.2-5B | SFT | 46.32 | 52.63 | 0.129 | 33.00 | 51.00 | 0.8323 | 0.8925 | 0.4886 | 0.6159 | 0.9548 | 0.9026 |
| Wan2.2-5B | Epipolar-DPO | 52.11 | 58.95 | 0.101 | 38.00 | 54.50 | 0.8407 | 0.9054 | 0.4945 | 0.6275 | 0.9482 | 0.7603 |
| Wan2.2-5B | VideoGPA | 54.74 | 60.53 | 0.098 | 40.00 | 54.00 | 0.8511 | 0.9048 | 0.4920 | 0.6131 | 0.9518 | 0.7645 |
| Wan2.2-5B | VGGRPO | 62.63 | 68.42 | 0.093 | 56.50 | 66.00 | 0.8672 | 0.9056 | 0.5094 | 0.6843 | 0.9619 | 0.8421 |
主结论:VGGRPO 在两种 backbone 上都显著提升 Static/Dynamic VQ/MQ,并在 Wan2.2-5B 上取得最低 Static Epi. 0.093。相比静态假设更强的 Epipolar-DPO / VideoGPA,VGGRPO 在 dynamic split 上优势更明显:Wan2.2-5B 的 Dynamic MQ 从 VideoGPA 54.00 提升到 66.00。
Figure 3 解读:qualitative comparison 展示 static 与 dynamic prompt 的 first/middle/last frames。baseline 和 DPO baselines 会出现 temporal flicker、geometric drift 或 camera instability;VGGRPO 的 scene structure 更连续,相机轨迹更平滑。
5.2 Ablation:geometry FM 与 reward terms
| Study | Variant | VQ↑ | MQ↑ | Epi.↓ |
|---|---|---|---|---|
| Geometry FM | VGGT | 54.96 | 60.61 | 0.090 |
| Geometry FM | Any4D | 59.57 | 67.21 | 0.093 |
| Reward terms | only | 55.60 | 63.40 | 0.104 |
| Reward terms | 59.57 | 67.21 | 0.093 |
Any4D 的 dynamic 4D reconstruction 能力带来更高 VQ/MQ,而 VGGT 在静态 epipolar error 上略好。reward ablation 表明 motion reward 能稳定 camera,但加入 reprojection consistency 后几何质量与感知质量都更好。
Figure 4 解读:只优化 时,camera trajectory 变平滑,但场景结构仍有局部几何 artifacts;加入 后,重建几何更一致,说明两个 reward 是互补的。
5.3 Generalization and efficiency
VBench generalization(standard VBench captions):
| Model | Sub. Cons.↑ | Bg. Cons.↑ | Aes. Qual.↑ | Img. Qual.↑ | Mot. Smooth.↑ | Dyn. Deg.↑ |
|---|---|---|---|---|---|---|
| Baseline | 0.9542 | 0.9528 | 0.5966 | 0.6733 | 0.9841 | 0.4237 |
| VGGRPO | 0.9644 | 0.9583 | 0.5991 | 0.6861 | 0.9895 | 0.3962 |
Efficiency(reward computation, batch size 4):
| Reward | Time↓ | Peak Mem↓ |
|---|---|---|
| RGB-based | 54.73 s | 76.80 GB |
| Latent reward(VGGRPO) | 41.33 s | 68.57 GB |
效率结论:latent reward 比 RGB-based reward 快 13.40 s,即 24.5% time reduction;peak GPU memory 从 76.80 GB 降到 68.57 GB。Dynamic Degree 不是最高,作者解释为 motion reward 减少 camera jitter,会降低 RAFT optical flow magnitude;这并不等价于视频“动态性”变差,因为 Motion Quality 和 Motion Smoothness 同时提升。
Figure 5 解读:supplement analysis 用 latent perturbation 测试几何估计鲁棒性。RGB-based geometry model 在 decoded frames 视觉变化很小的情况下也会快速退化;LGM 因直接在 generated latents 上训练,面对 latent noise / distribution shift 更稳定。
5.4 局限与注意点
- 当前公开材料没有源码,因此无法验证 training script、actual reward weights、sampling schedule 的实现细节;笔记中的 config 数字来自 paper TeX/PDF,而非 launch config。
- LGM 依赖 geometry foundation model 的能力上限;Any4D 比 VGGT 更适合 dynamic scenes,但仍可能受 scene-flow/static filtering 错误影响。
- Reward 主要约束 camera smoothness 与 geometric reprojection,不直接保证 object identity、semantic consistency、physics validity;这些可能需要额外 reward 或 world-state representation。
- 1536 GPU hours 说明 online video RL 仍然昂贵;latent reward 降低了 reward computation cost,但没有消除 group sampling 的总体训练成本。