VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Paper: arXiv:2603.26599 / PDF / HF Paper Code: 代码搜索未找到开源实现 Code reference: N/A — GitHub/CatalyzeX/HF/project-page search found no public algorithm implementation as of 2026-05-16.

1. Motivation（动机）

大规模 video diffusion / rectified-flow 模型已经能生成高视觉质量视频，但在“世界一致性”上仍会失败：同一场景跨帧会出现 geometry drift、camera jitter、结构断裂或突然换景。对 embodied AI、physics-aware simulation、robotics data generation 来说，视频不仅要好看，还必须在相机运动和 3D/4D 场景结构上自洽。

已有路线主要有两个缺口：

改架构/加条件模块：例如引入 point cloud / depth / camera-conditioning，能提升局部 3D 一致性，但会增加结构复杂度，并可能削弱 internet-scale pretrained video model 的泛化能力。
RGB-space geometry reward / DPO：Epipolar-DPO、VideoGPA 等需要反复把 video latent VAE decode 成 RGB，再跑几何模型；这既贵，又把 reward model 暴露在 generated RGB 的分布偏移下，并且许多静态几何假设无法处理真实动态场景。

本文的核心问题是：能否不改 base video generator、不反复 decode RGB，而直接在 video latent 上构造可用于 RL post-training 的 4D geometry reward？

Figure 1 解读：teaser 对比 baseline 与 VGGRPO-aligned model。上方是 inferred 4D scene representation / reconstructed geometry，下方是代表性 keyframes。baseline 的几何结构和相机轨迹更容易漂移，VGGRPO 通过 latent-space geometry reward 使动态场景中的结构和相机运动更稳定。

2. Idea（核心想法）

核心 insight：把 video diffusion latent 直接接到 4D geometry foundation model 的中间特征空间，用 Latent Geometry Model（LGM）在 latent space 输出 camera、depth、point map、scene flow，然后把这些几何预测转成 GRPO reward。

这样做同时解决三件事：

reward computation 不再需要 repeated VAE decoding，降低 group-based online RL 的时间和显存成本；
reward model 输入从 RGB frames 变成 diffusion latents，缓解 generated RGB 与 real-image geometry model 之间的 distribution gap；
通过 Any4D 这类支持 dynamic 4D reconstruction 的 geometry model，reward 可覆盖动态场景，而不是只对静态多视角几何有效。

一句话概括：VGGRPO 是一个 latent geometry-guided video post-training framework，用 camera motion smoothness + geometry reprojection consistency 两个 latent rewards 做 Group Relative Policy Optimization，使 pretrained video generator 往 4D world-consistent generation 对齐。

3. Method（方法）

3.1 总体框架

Figure 2 解读：方法由两部分组成。左侧 LGM 用 video VAE encoder 的 latent 替换几何模型的 RGB input pathway，并用 lightweight 3D convolutional connector 对齐到 geometry foundation model 的中间层。右侧 VGGRPO 在 latent denoising trajectory 上采样 group videos，用 LGM 直接估计 4D geometry，并把 camera motion smoothness 与 reprojection consistency 作为 GRPO reward。

直觉上，LGM 相当于给 diffusion latent 装了一个“几何读头”：它不需要先还原成 RGB，也不要求修改 video generator 的主干，只要能从 denoised latent 里稳定读出 camera/depth/pointmap/scene-flow，就可以把几何错误变成 reward。GRPO 再利用同一 prompt 下多个 samples 的相对好坏来更新 LoRA policy，因此不需要额外训练 critic。

3.2 Latent Geometry Model（LGM）

设 video VAE encoder 为 $E$ ，把视频 $x = {I_{i}}_{i = 1}^{N}$ 编码成 latent $z = E (x)$ 。原始 geometry model $Φ$ 从 RGB sequence 输出每帧几何：

{O_{i}}_{i = 1}^{N} = Φ ({I_{i}}_{i = 1}^{N}), O_{i} = {C_{i}, D_{i}, P_{i}} .

VGGRPO 用 connector $S_{ψ}$ 替换 $Φ$ 的前 $\hat{ℓ}$ 层，并通过 feature stitching 训练：

\hat{ℓ}, ψ = ar g ℓ \in {1, \dots, L}, ψ min \frac{1}{M} m = 1 \sum M ∥ S_{ψ} (E (x^{m})) - Φ_{1 : ℓ} (x^{m}) ∥_{2}^{2} .

训练后 LGM 直接从 latent 输出用于 reward 的 4D 几何量：

{C_{i}, D_{i}, P_{i}, F_{i}}_{i = 1}^{N} = \hat{Φ}_{ψ} (z) .

其中 $C_{i}$ 是 camera parameters， $D_{i}$ 是 depth， $P_{i}$ 是 world-frame point map， $F_{i}$ 是 scene flow；scene flow 使动态区域可被过滤或单独处理，因此比静态-only epipolar reward 更适合 dynamic scenes。

3.3 Camera Motion Smoothness Reward

LGM 从 denoised video latent $z_{0}$ 预测 camera poses $C_{i}$ 。从相机中心 $c_{i}$ 构造速度 $v_{i} = c_{i + 1} - c_{i}$ 和加速度 $a_{i} = v_{i} - v_{i - 1}$ ，定义平移抖动误差：

e_{trans} (z_{0}) = \frac{1}{T - 2} i = 2 \sum T - 1 \frac{∥ a _{i} ∥ _{2}}{∥ v _{i} ∥ _{2} + ∥ v _{i - 1} ∥ _{2}} .

旋转平滑性类似：用 $ω_{i} = lo g_{SO (3)} (R_{i}^{⊤} R_{i + 1})$ 表示角速度，用 $α_{i} = ω_{i} - ω_{i - 1}$ 表示角加速度：

e_{rot} (z_{0}) = \frac{1}{T - 2} i = 2 \sum T - 1 \frac{∥ α _{i} ∥ _{2}}{∥ ω _{i} ∥ _{2} + ∥ ω _{i - 1} ∥ _{2}} .

最终 motion reward 是两个 smoothness score 的平均：

r_{motion} (z_{0}) = \frac{1}{2} (\frac{1}{1 + e _{trans} ( z _{0} )} + \frac{1}{1 + e _{rot} ( z _{0} )}) .

3.4 Geometry Reprojection Consistency Reward

LGM 预测 point maps $P_{i}$ 、depths $D_{i}$ 、camera parameters $C_{i}$ 与 scene flow $F_{i}$ 。方法先从 ${P_{i}}$ 构建 scene point cloud；静态场景聚合所有帧，动态场景用 $F_{i}$ 过滤 dynamic regions，仅聚合稳定静态点。再把 point cloud 投影到每个 view $i$ ，得到 rendered depth $\hat{D}_{i}$ ，并与预测 depth $D_{i}$ 比较：

e_{geo}^{(i)} (z_{0}) = \frac{1}{∣ Ω _{i} ∣} p \in Ω_{i} \sum \hat{D}_{i} (p) - D_{i} (p),

其中 $Ω_{i}$ 是 view $i$ 中有效投影像素。为了聚焦局部坏 case，reward 取 worst 3 views 的负平均：

r_{geo} (z_{0}) = - \frac{1}{3} i \in top-3 \sum e_{geo}^{(i)} (z_{0}) .

3.5 Latent-space GRPO Objective

对每个 prompt 采样 $K$ 条 denoising trajectories。普通 GRPO 用 group rewards 标准化 advantage：

A^{k} = \frac{r ( x _{0}^{k} , p ) - μ _{r}}{σ _{r}} .

VGGRPO 分别标准化 motion reward 和 geo reward，再平均：

A^{k} = \frac{1}{2} (\frac{r _{motion} ( z _{0}^{k} ) - μ _{motion}}{σ _{motion}} + \frac{r _{geo} ( z _{0}^{k} ) - μ _{geo}}{σ _{geo}}) .

每个 denoising step 的 policy ratio 为：

ρ_{t}^{k} (θ) = \frac{π _{θ} ( x _{t - 1}^{k} ∣ x _{t}^{k} , p )}{π _{θ_{old}} ( x _{t - 1}^{k} ∣ x _{t}^{k} , p )}, clip_{ε} (ρ) = clip (ρ, 1 - ε, 1 + ε) .

VGGRPO 的 clipped objective：

L_{VGGRPO} (θ) = \frac{1}{K} k = 1 \sum K \frac{1}{T} t = 0 \sum T - 1 [min (ρ_{t}^{k} (θ) A^{k}, clip_{ε} (ρ_{t}^{k} (θ)) A^{k}) - β D_{KL} (π_{θ} ∥ π_{ref})] .

3.6 论文伪代码（非官方实现）

代码搜索未找到开源实现；以下 pseudocode 根据 paper equations、appendix listing 与 method prose 重构，不代表作者源码。

A. LGM feature stitching training

import torch
import torch.nn.functional as F
from torch import nn
 
class LatentGeometryConnector(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.proj = nn.Conv3d(
            in_channels, out_channels,
            kernel_size=(5, 5, 5), stride=(1, 2, 2), padding=(2, 2, 2)
        )
 
    def forward(self, video_latents):
        return self.proj(video_latents)
 
 
def train_lgm_connector(vae_encoder, geometry_model, connector, videos, optimizer):
    with torch.no_grad():
        z = vae_encoder(videos)                       # z = E(x)
        target_feat = geometry_model.forward_to_layer(videos, layer="ell_hat")
 
    pred_feat = connector(z)                          # S_psi(E(x))
    loss = F.mse_loss(pred_feat, target_feat)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(connector.parameters(), max_norm=1.0)
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    return loss

B. latent geometry rewards

import torch
 
 
def camera_motion_reward(camera_poses):
    centers = camera_poses.camera_centers_world()      # [T, 3]
    rotations = camera_poses.rotations()               # [T, 3, 3]
 
    v = centers[1:] - centers[:-1]
    a = v[1:] - v[:-1]
    e_trans = (a.norm(dim=-1) / (v[1:].norm(dim=-1) + v[:-1].norm(dim=-1) + 1e-8)).mean()
 
    omega = so3_log(torch.matmul(rotations[:-1].transpose(-1, -2), rotations[1:]))
    alpha = omega[1:] - omega[:-1]
    e_rot = (alpha.norm(dim=-1) / (omega[1:].norm(dim=-1) + omega[:-1].norm(dim=-1) + 1e-8)).mean()
 
    return 0.5 * (1.0 / (1.0 + e_trans) + 1.0 / (1.0 + e_rot))
 
 
def geometry_reprojection_reward(pointmaps, depths, cameras, scene_flow, topk=3):
    static_points = aggregate_static_scene_points(pointmaps, scene_flow)
    errors = []
    for i, camera in enumerate(cameras):
        rendered_depth, valid = render_depth(static_points, camera)
        err = (rendered_depth[valid] - depths[i][valid]).abs().mean()
        errors.append(err)
    worst = torch.stack(errors).topk(k=topk, largest=True).values
    return -worst.mean()

C. VGGRPO policy update

def vggrpo_update(policy, old_policy, ref_policy, lgm, prompts, optimizer, group_size=64,
                  clip_eps=1e-3, beta=0.004):
    trajectories = policy.sample_latent_trajectories(prompts, group_size=group_size)
    z0 = trajectories.final_latents()
 
    geom = lgm(z0)  # cameras, depths, pointmaps, scene_flow
    r_motion = camera_motion_reward(geom.cameras)
    r_geo = geometry_reprojection_reward(geom.pointmaps, geom.depths, geom.cameras, geom.scene_flow)
 
    adv = 0.5 * (normalize_by_prompt_group(r_motion) + normalize_by_prompt_group(r_geo))
    ratios = policy.step_logprobs(trajectories) - old_policy.step_logprobs(trajectories)
    ratios = ratios.exp()
 
    clipped = torch.clamp(ratios, 1.0 - clip_eps, 1.0 + clip_eps)
    pg = torch.minimum(ratios * adv[:, None], clipped * adv[:, None]).mean()
    kl = closed_form_step_kl(policy, ref_policy, trajectories).mean()
 
    loss = -(pg - beta * kl)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(policy.lora_parameters(), max_norm=1.0)
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    return {"loss": loss, "r_motion": r_motion.mean(), "r_geo": r_geo.mean(), "kl": kl}

D. test-time latent reward guidance（appendix）

def reward_guided_sampling(model, lgm, latents, prompt_embeds, timesteps, dts,
                           reward_guidance_scale, reward_weights, guidance_interval):
    for i, t in enumerate(timesteps):
        latents = latents.detach().requires_grad_(True)
        v_pred = model(latents, t, prompt_embeds)
 
        if i in guidance_interval:
            geom = lgm(latents)
            reward_smooth = camera_motion_reward(geom.cameras)
            reward_geo = geometry_reprojection_reward(geom.pointmaps, geom.depths, geom.cameras, geom.scene_flow)
            reward = reward_weights["smooth"] * reward_smooth + reward_weights["geo"] * reward_geo
            grad = torch.autograd.grad(reward, latents)[0]
            v_pred = v_pred - reward_guidance_scale * t / (1.0 - t) * grad
 
        latents = latents - dts[i] * v_pred
    return latents.detach()

3.7 Code-to-paper mapping

代码搜索未找到开源实现，因此下表是 paper component → expected implementation artifact 的审计式 mapping，不是作者源码验证。若未来 release code，需要用实际文件/class/function 重新替换本表并设置真实 github_ref=<branch>@<short_sha> (date)。

Paper concept	Paper location	Expected implementation artifact	Current verification status
Latent Geometry Model stitching	Eq. stitching / Fig. 2(a)	VAE encoder wrapper, 3D Conv connector, geometry FM middle-layer feature extraction, MSE feature loss	No public code; only paper equations + source TeX/PDF verified
LGM outputs ${C, D, P, F}$	Eq. latent_reward_outputs	`lgm.forward(latents) -> cameras, depths, pointmaps, scene_flow`	No public code
Camera motion smoothness reward	Eq. trans_smooth, rot_smooth, motion_reward	camera-center velocity/acceleration and SO(3) angular acceleration reward	No public code
Geometry reprojection consistency reward	Eq. depth_reproj_per_view, geometry_reward	pointmap aggregation, scene-flow static filtering, depth rasterization/reprojection, worst-3 view reduction	No public code
Latent-space GRPO	Eq. combined_adv, latent_grpo_obj	group sampling, reward normalization, ratio clipping, closed-form KL, LoRA update	No public code
Test-time reward guidance	Appendix listing	differentiable reward gradient through LGM modifies velocity field	No public code

4. Setup（实验设置）

4.1 训练配置

可验证来源：arXiv PDF / TeX source；未发现作者 release 的 launch script 或 config，因此训练数字不能被源码二次验证。

Component	Setting
Geometry FM	Any4D（支持 dynamic 4D reconstruction）；ablation 也比较 VGGT
LGM training data	base diffusion model generated videos + DL3DV + RealEstate10K + MiraData real videos
LGM optimizer	AdamW, learning rate $2 \times 1 0^{- 4}$ , no weight decay
LGM schedule	20 epochs, cosine decay, first 100 optimization steps linear warmup
LGM gradient clipping	max norm 1.0
Connector	3D conv, kernel $5 \times 5 \times 5$ , stride $1 \times 2 \times 2$ , padding $2 \times 2 \times 2$
LGM LoRA	rank $r = 64$ , scaling $α = 32$
VGGRPO backbones	Wan2.1-1B, Wan2.2-5B
VGGRPO LoRA	rank $r = 32$ , scaling $α = 64$
VGGRPO group size	$G = 64$
VGGRPO optimizer	AdamW, learning rate $1 \times 1 0^{- 4}$ , weight decay $1 \times 1 0^{- 4}$
VGGRPO clipping / KL	$ε = 1 \times 1 0^{- 3}$ , $β = 0.004$
VGGRPO gradient clipping	max norm 1.0
Training compute	approximately 1536 GPU hours
Denoising reduction	training sample schedule example $T_{t r ain} = 10$ vs. inference $T_{in f er} = 40$

4.2 Baselines and metrics

Baselines：Base Model、Supervised Fine-Tuning（SFT）、Epipolar-DPO、VideoGPA。评估覆盖 static split、dynamic split 与 general VBench captions。

主要指标：

Static：VideoReward Visual Quality（VQ↑）、Motion Quality（MQ↑）、Sampson epipolar error（Epi.↓）。
Dynamic：VideoReward VQ↑、MQ↑。
VBench：Subject Consistency、Background Consistency、Aesthetic Quality、Imaging Quality、Motion Smoothness、Dynamic Degree。

4.3 开源代码检索与 github_ref

结论：代码搜索未找到开源实现。

检索记录（2026-05-16）：

Hugging Face paper page 仅链接 arXiv PDF 与 project page；无 model/dataset/space code link。
arXiv page 的 code/data/media 区域未列出作者 implementation。
Project page 只有 Paper 与 LinkedIn post 链接；未给 GitHub implementation。
CatalyzeX 页面显示 paper metadata/project page，但未给可直接打开的 implementation URL。
GitHub API / web search queries: VGGRPO, "Visual Geometry GRPO", "Towards World-Consistent Video Generation", "2603.26599", "Latent Geometry Model" "VGGRPO"；结果为 paper lists/blog/project-page source repo（如 ZhaochongAn/ZhaochongAn.github.io），未发现算法源码 repo。

5. Results（结果与分析）

5.1 主结果

Base backbone	Method	Static VQ↑	Static MQ↑	Static Epi.↓	Dynamic VQ↑	Dynamic MQ↑	Sub. Cons.↑	Bg. Cons.↑	Aes. Qual.↑	Img. Qual.↑	Mot. Smooth.↑	Dyn. Deg.↑
Wan2.1-1B	Base	-	-	0.133	-	-	0.7941	0.8930	0.5233	0.6178	0.9552	0.9231
Wan2.1-1B	SFT	45.26	46.84	0.137	40.00	39.00	0.8032	0.8896	0.5472	0.6256	0.9646	0.8795
Wan2.1-1B	Epipolar-DPO	54.21	55.79	0.098	45.50	43.00	0.8125	0.8916	0.5578	0.6461	0.9671	0.8816
Wan2.1-1B	VideoGPA	53.68	56.32	0.105	42.50	41.00	0.8068	0.8931	0.5562	0.6507	0.9650	0.8734
Wan2.1-1B	VGGRPO	59.47	66.84	0.102	57.00	63.00	0.8255	0.8974	0.5623	0.6585	0.9753	0.9048
Wan2.2-5B	Base	-	-	0.142	-	-	0.8151	0.8958	0.4837	0.6402	0.9467	0.8692
Wan2.2-5B	SFT	46.32	52.63	0.129	33.00	51.00	0.8323	0.8925	0.4886	0.6159	0.9548	0.9026
Wan2.2-5B	Epipolar-DPO	52.11	58.95	0.101	38.00	54.50	0.8407	0.9054	0.4945	0.6275	0.9482	0.7603
Wan2.2-5B	VideoGPA	54.74	60.53	0.098	40.00	54.00	0.8511	0.9048	0.4920	0.6131	0.9518	0.7645
Wan2.2-5B	VGGRPO	62.63	68.42	0.093	56.50	66.00	0.8672	0.9056	0.5094	0.6843	0.9619	0.8421

主结论：VGGRPO 在两种 backbone 上都显著提升 Static/Dynamic VQ/MQ，并在 Wan2.2-5B 上取得最低 Static Epi. 0.093。相比静态假设更强的 Epipolar-DPO / VideoGPA，VGGRPO 在 dynamic split 上优势更明显：Wan2.2-5B 的 Dynamic MQ 从 VideoGPA 54.00 提升到 66.00。

Figure 3 解读：qualitative comparison 展示 static 与 dynamic prompt 的 first/middle/last frames。baseline 和 DPO baselines 会出现 temporal flicker、geometric drift 或 camera instability；VGGRPO 的 scene structure 更连续，相机轨迹更平滑。

5.2 Ablation：geometry FM 与 reward terms

Study	Variant	VQ↑	MQ↑	Epi.↓
Geometry FM	VGGT	54.96	60.61	0.090
Geometry FM	Any4D	59.57	67.21	0.093
Reward terms	$r_{m o t i o n}$ only	55.60	63.40	0.104
Reward terms	$r_{m o t i o n} + r_{g eo}$	59.57	67.21	0.093

Any4D 的 dynamic 4D reconstruction 能力带来更高 VQ/MQ，而 VGGT 在静态 epipolar error 上略好。reward ablation 表明 motion reward 能稳定 camera，但加入 reprojection consistency 后几何质量与感知质量都更好。

Figure 4 解读：只优化 $r_{m o t i o n}$ 时，camera trajectory 变平滑，但场景结构仍有局部几何 artifacts；加入 $r_{g eo}$ 后，重建几何更一致，说明两个 reward 是互补的。

5.3 Generalization and efficiency

VBench generalization（standard VBench captions）：

Model	Sub. Cons.↑	Bg. Cons.↑	Aes. Qual.↑	Img. Qual.↑	Mot. Smooth.↑	Dyn. Deg.↑
Baseline	0.9542	0.9528	0.5966	0.6733	0.9841	0.4237
VGGRPO	0.9644	0.9583	0.5991	0.6861	0.9895	0.3962

Efficiency（reward computation, batch size 4）：

Reward	Time↓	Peak Mem↓
RGB-based	54.73 s	76.80 GB
Latent reward（VGGRPO）	41.33 s	68.57 GB

效率结论：latent reward 比 RGB-based reward 快 13.40 s，即 24.5% time reduction；peak GPU memory 从 76.80 GB 降到 68.57 GB。Dynamic Degree 不是最高，作者解释为 motion reward 减少 camera jitter，会降低 RAFT optical flow magnitude；这并不等价于视频“动态性”变差，因为 Motion Quality 和 Motion Smoothness 同时提升。

Figure 5 解读：supplement analysis 用 latent perturbation 测试几何估计鲁棒性。RGB-based geometry model 在 decoded frames 视觉变化很小的情况下也会快速退化；LGM 因直接在 generated latents 上训练，面对 latent noise / distribution shift 更稳定。

5.4 局限与注意点

当前公开材料没有源码，因此无法验证 training script、actual reward weights、sampling schedule 的实现细节；笔记中的 config 数字来自 paper TeX/PDF，而非 launch config。
LGM 依赖 geometry foundation model 的能力上限；Any4D 比 VGGT 更适合 dynamic scenes，但仍可能受 scene-flow/static filtering 错误影响。
Reward 主要约束 camera smoothness 与 geometric reprojection，不直接保证 object identity、semantic consistency、physics validity；这些可能需要额外 reward 或 world-state representation。
1536 GPU hours 说明 online video RL 仍然昂贵；latent reward 降低了 reward computation cost，但没有消除 group sampling 的总体训练成本。

Paper Notes

探索

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

1. Motivation（动机）

2. Idea（核心想法）

3. Method（方法）

3.1 总体框架

3.2 Latent Geometry Model（LGM）

3.3 Camera Motion Smoothness Reward

3.4 Geometry Reprojection Consistency Reward

3.5 Latent-space GRPO Objective

3.6 论文伪代码（非官方实现）

3.7 Code-to-paper mapping

4. Setup（实验设置）

4.1 训练配置

4.2 Baselines and metrics

4.3 开源代码检索与 github_ref

5. Results（结果与分析）

5.1 主结果

5.2 Ablation：geometry FM 与 reward terms

5.3 Generalization and efficiency

5.4 局限与注意点

目录