VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Paper: arXiv:2603.26599 / PDF / HF Paper Code: 代码搜索未找到开源实现 Code reference: N/A — GitHub/CatalyzeX/HF/project-page search found no public algorithm implementation as of 2026-05-16.

1. Motivation(动机)

大规模 video diffusion / rectified-flow 模型已经能生成高视觉质量视频,但在“世界一致性”上仍会失败:同一场景跨帧会出现 geometry drift、camera jitter、结构断裂或突然换景。对 embodied AI、physics-aware simulation、robotics data generation 来说,视频不仅要好看,还必须在相机运动和 3D/4D 场景结构上自洽。

已有路线主要有两个缺口:

  • 改架构/加条件模块:例如引入 point cloud / depth / camera-conditioning,能提升局部 3D 一致性,但会增加结构复杂度,并可能削弱 internet-scale pretrained video model 的泛化能力。
  • RGB-space geometry reward / DPO:Epipolar-DPO、VideoGPA 等需要反复把 video latent VAE decode 成 RGB,再跑几何模型;这既贵,又把 reward model 暴露在 generated RGB 的分布偏移下,并且许多静态几何假设无法处理真实动态场景。

本文的核心问题是:能否不改 base video generator、不反复 decode RGB,而直接在 video latent 上构造可用于 RL post-training 的 4D geometry reward?

Figure 1 解读:teaser 对比 baseline 与 VGGRPO-aligned model。上方是 inferred 4D scene representation / reconstructed geometry,下方是代表性 keyframes。baseline 的几何结构和相机轨迹更容易漂移,VGGRPO 通过 latent-space geometry reward 使动态场景中的结构和相机运动更稳定。

2. Idea(核心想法)

核心 insight:把 video diffusion latent 直接接到 4D geometry foundation model 的中间特征空间,用 Latent Geometry Model(LGM)在 latent space 输出 camera、depth、point map、scene flow,然后把这些几何预测转成 GRPO reward。

这样做同时解决三件事:

  • reward computation 不再需要 repeated VAE decoding,降低 group-based online RL 的时间和显存成本;
  • reward model 输入从 RGB frames 变成 diffusion latents,缓解 generated RGB 与 real-image geometry model 之间的 distribution gap;
  • 通过 Any4D 这类支持 dynamic 4D reconstruction 的 geometry model,reward 可覆盖动态场景,而不是只对静态多视角几何有效。

一句话概括:VGGRPO 是一个 latent geometry-guided video post-training framework,用 camera motion smoothness + geometry reprojection consistency 两个 latent rewards 做 Group Relative Policy Optimization,使 pretrained video generator 往 4D world-consistent generation 对齐。

3. Method(方法)

3.1 总体框架

Figure 2 解读:方法由两部分组成。左侧 LGM 用 video VAE encoder 的 latent 替换几何模型的 RGB input pathway,并用 lightweight 3D convolutional connector 对齐到 geometry foundation model 的中间层。右侧 VGGRPO 在 latent denoising trajectory 上采样 group videos,用 LGM 直接估计 4D geometry,并把 camera motion smoothness 与 reprojection consistency 作为 GRPO reward。

直觉上,LGM 相当于给 diffusion latent 装了一个“几何读头”:它不需要先还原成 RGB,也不要求修改 video generator 的主干,只要能从 denoised latent 里稳定读出 camera/depth/pointmap/scene-flow,就可以把几何错误变成 reward。GRPO 再利用同一 prompt 下多个 samples 的相对好坏来更新 LoRA policy,因此不需要额外训练 critic。

3.2 Latent Geometry Model(LGM)

设 video VAE encoder 为 ,把视频 编码成 latent 。原始 geometry model 从 RGB sequence 输出每帧几何:

VGGRPO 用 connector 替换 的前 层,并通过 feature stitching 训练:

训练后 LGM 直接从 latent 输出用于 reward 的 4D 几何量:

其中 是 camera parameters, 是 depth, 是 world-frame point map, 是 scene flow;scene flow 使动态区域可被过滤或单独处理,因此比静态-only epipolar reward 更适合 dynamic scenes。

3.3 Camera Motion Smoothness Reward

LGM 从 denoised video latent 预测 camera poses 。从相机中心 构造速度 和加速度 ,定义平移抖动误差:

旋转平滑性类似:用 表示角速度,用 表示角加速度:

最终 motion reward 是两个 smoothness score 的平均:

3.4 Geometry Reprojection Consistency Reward

LGM 预测 point maps 、depths 、camera parameters 与 scene flow 。方法先从 构建 scene point cloud;静态场景聚合所有帧,动态场景用 过滤 dynamic regions,仅聚合稳定静态点。再把 point cloud 投影到每个 view ,得到 rendered depth ,并与预测 depth 比较:

其中 是 view 中有效投影像素。为了聚焦局部坏 case,reward 取 worst 3 views 的负平均:

3.5 Latent-space GRPO Objective

对每个 prompt 采样 条 denoising trajectories。普通 GRPO 用 group rewards 标准化 advantage:

VGGRPO 分别标准化 motion reward 和 geo reward,再平均:

每个 denoising step 的 policy ratio 为:

VGGRPO 的 clipped objective:

3.6 论文伪代码(非官方实现)

代码搜索未找到开源实现;以下 pseudocode 根据 paper equations、appendix listing 与 method prose 重构,不代表作者源码。

A. LGM feature stitching training

import torch
import torch.nn.functional as F
from torch import nn
 
class LatentGeometryConnector(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.proj = nn.Conv3d(
            in_channels, out_channels,
            kernel_size=(5, 5, 5), stride=(1, 2, 2), padding=(2, 2, 2)
        )
 
    def forward(self, video_latents):
        return self.proj(video_latents)
 
 
def train_lgm_connector(vae_encoder, geometry_model, connector, videos, optimizer):
    with torch.no_grad():
        z = vae_encoder(videos)                       # z = E(x)
        target_feat = geometry_model.forward_to_layer(videos, layer="ell_hat")
 
    pred_feat = connector(z)                          # S_psi(E(x))
    loss = F.mse_loss(pred_feat, target_feat)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(connector.parameters(), max_norm=1.0)
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    return loss

B. latent geometry rewards

import torch
 
 
def camera_motion_reward(camera_poses):
    centers = camera_poses.camera_centers_world()      # [T, 3]
    rotations = camera_poses.rotations()               # [T, 3, 3]
 
    v = centers[1:] - centers[:-1]
    a = v[1:] - v[:-1]
    e_trans = (a.norm(dim=-1) / (v[1:].norm(dim=-1) + v[:-1].norm(dim=-1) + 1e-8)).mean()
 
    omega = so3_log(torch.matmul(rotations[:-1].transpose(-1, -2), rotations[1:]))
    alpha = omega[1:] - omega[:-1]
    e_rot = (alpha.norm(dim=-1) / (omega[1:].norm(dim=-1) + omega[:-1].norm(dim=-1) + 1e-8)).mean()
 
    return 0.5 * (1.0 / (1.0 + e_trans) + 1.0 / (1.0 + e_rot))
 
 
def geometry_reprojection_reward(pointmaps, depths, cameras, scene_flow, topk=3):
    static_points = aggregate_static_scene_points(pointmaps, scene_flow)
    errors = []
    for i, camera in enumerate(cameras):
        rendered_depth, valid = render_depth(static_points, camera)
        err = (rendered_depth[valid] - depths[i][valid]).abs().mean()
        errors.append(err)
    worst = torch.stack(errors).topk(k=topk, largest=True).values
    return -worst.mean()

C. VGGRPO policy update

def vggrpo_update(policy, old_policy, ref_policy, lgm, prompts, optimizer, group_size=64,
                  clip_eps=1e-3, beta=0.004):
    trajectories = policy.sample_latent_trajectories(prompts, group_size=group_size)
    z0 = trajectories.final_latents()
 
    geom = lgm(z0)  # cameras, depths, pointmaps, scene_flow
    r_motion = camera_motion_reward(geom.cameras)
    r_geo = geometry_reprojection_reward(geom.pointmaps, geom.depths, geom.cameras, geom.scene_flow)
 
    adv = 0.5 * (normalize_by_prompt_group(r_motion) + normalize_by_prompt_group(r_geo))
    ratios = policy.step_logprobs(trajectories) - old_policy.step_logprobs(trajectories)
    ratios = ratios.exp()
 
    clipped = torch.clamp(ratios, 1.0 - clip_eps, 1.0 + clip_eps)
    pg = torch.minimum(ratios * adv[:, None], clipped * adv[:, None]).mean()
    kl = closed_form_step_kl(policy, ref_policy, trajectories).mean()
 
    loss = -(pg - beta * kl)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(policy.lora_parameters(), max_norm=1.0)
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    return {"loss": loss, "r_motion": r_motion.mean(), "r_geo": r_geo.mean(), "kl": kl}

D. test-time latent reward guidance(appendix)

def reward_guided_sampling(model, lgm, latents, prompt_embeds, timesteps, dts,
                           reward_guidance_scale, reward_weights, guidance_interval):
    for i, t in enumerate(timesteps):
        latents = latents.detach().requires_grad_(True)
        v_pred = model(latents, t, prompt_embeds)
 
        if i in guidance_interval:
            geom = lgm(latents)
            reward_smooth = camera_motion_reward(geom.cameras)
            reward_geo = geometry_reprojection_reward(geom.pointmaps, geom.depths, geom.cameras, geom.scene_flow)
            reward = reward_weights["smooth"] * reward_smooth + reward_weights["geo"] * reward_geo
            grad = torch.autograd.grad(reward, latents)[0]
            v_pred = v_pred - reward_guidance_scale * t / (1.0 - t) * grad
 
        latents = latents - dts[i] * v_pred
    return latents.detach()

3.7 Code-to-paper mapping

代码搜索未找到开源实现,因此下表是 paper component → expected implementation artifact 的审计式 mapping,不是作者源码验证。若未来 release code,需要用实际文件/class/function 重新替换本表并设置真实 github_ref=<branch>@<short_sha> (date)

Paper conceptPaper locationExpected implementation artifactCurrent verification status
Latent Geometry Model stitchingEq. stitching / Fig. 2(a)VAE encoder wrapper, 3D Conv connector, geometry FM middle-layer feature extraction, MSE feature lossNo public code; only paper equations + source TeX/PDF verified
LGM outputs Eq. latent_reward_outputslgm.forward(latents) -> cameras, depths, pointmaps, scene_flowNo public code
Camera motion smoothness rewardEq. trans_smooth, rot_smooth, motion_rewardcamera-center velocity/acceleration and SO(3) angular acceleration rewardNo public code
Geometry reprojection consistency rewardEq. depth_reproj_per_view, geometry_rewardpointmap aggregation, scene-flow static filtering, depth rasterization/reprojection, worst-3 view reductionNo public code
Latent-space GRPOEq. combined_adv, latent_grpo_objgroup sampling, reward normalization, ratio clipping, closed-form KL, LoRA updateNo public code
Test-time reward guidanceAppendix listingdifferentiable reward gradient through LGM modifies velocity fieldNo public code

4. Setup(实验设置)

4.1 训练配置

可验证来源:arXiv PDF / TeX source;未发现作者 release 的 launch script 或 config,因此训练数字不能被源码二次验证。

ComponentSetting
Geometry FMAny4D(支持 dynamic 4D reconstruction);ablation 也比较 VGGT
LGM training database diffusion model generated videos + DL3DV + RealEstate10K + MiraData real videos
LGM optimizerAdamW, learning rate , no weight decay
LGM schedule20 epochs, cosine decay, first 100 optimization steps linear warmup
LGM gradient clippingmax norm 1.0
Connector3D conv, kernel , stride , padding
LGM LoRArank , scaling
VGGRPO backbonesWan2.1-1B, Wan2.2-5B
VGGRPO LoRArank , scaling
VGGRPO group size
VGGRPO optimizerAdamW, learning rate , weight decay
VGGRPO clipping / KL,
VGGRPO gradient clippingmax norm 1.0
Training computeapproximately 1536 GPU hours
Denoising reductiontraining sample schedule example vs. inference

4.2 Baselines and metrics

Baselines:Base Model、Supervised Fine-Tuning(SFT)、Epipolar-DPO、VideoGPA。评估覆盖 static split、dynamic split 与 general VBench captions。

主要指标:

  • Static:VideoReward Visual Quality(VQ↑)、Motion Quality(MQ↑)、Sampson epipolar error(Epi.↓)。
  • Dynamic:VideoReward VQ↑、MQ↑。
  • VBench:Subject Consistency、Background Consistency、Aesthetic Quality、Imaging Quality、Motion Smoothness、Dynamic Degree。

4.3 开源代码检索与 github_ref

结论:代码搜索未找到开源实现

检索记录(2026-05-16):

  • Hugging Face paper page 仅链接 arXiv PDF 与 project page;无 model/dataset/space code link。
  • arXiv page 的 code/data/media 区域未列出作者 implementation。
  • Project page 只有 Paper 与 LinkedIn post 链接;未给 GitHub implementation。
  • CatalyzeX 页面显示 paper metadata/project page,但未给可直接打开的 implementation URL。
  • GitHub API / web search queries: VGGRPO, "Visual Geometry GRPO", "Towards World-Consistent Video Generation", "2603.26599", "Latent Geometry Model" "VGGRPO";结果为 paper lists/blog/project-page source repo(如 ZhaochongAn/ZhaochongAn.github.io),未发现算法源码 repo。

5. Results(结果与分析)

5.1 主结果

Base backboneMethodStatic VQ↑Static MQ↑Static Epi.↓Dynamic VQ↑Dynamic MQ↑Sub. Cons.↑Bg. Cons.↑Aes. Qual.↑Img. Qual.↑Mot. Smooth.↑Dyn. Deg.↑
Wan2.1-1BBase--0.133--0.79410.89300.52330.61780.95520.9231
Wan2.1-1BSFT45.2646.840.13740.0039.000.80320.88960.54720.62560.96460.8795
Wan2.1-1BEpipolar-DPO54.2155.790.09845.5043.000.81250.89160.55780.64610.96710.8816
Wan2.1-1BVideoGPA53.6856.320.10542.5041.000.80680.89310.55620.65070.96500.8734
Wan2.1-1BVGGRPO59.4766.840.10257.0063.000.82550.89740.56230.65850.97530.9048
Wan2.2-5BBase--0.142--0.81510.89580.48370.64020.94670.8692
Wan2.2-5BSFT46.3252.630.12933.0051.000.83230.89250.48860.61590.95480.9026
Wan2.2-5BEpipolar-DPO52.1158.950.10138.0054.500.84070.90540.49450.62750.94820.7603
Wan2.2-5BVideoGPA54.7460.530.09840.0054.000.85110.90480.49200.61310.95180.7645
Wan2.2-5BVGGRPO62.6368.420.09356.5066.000.86720.90560.50940.68430.96190.8421

主结论:VGGRPO 在两种 backbone 上都显著提升 Static/Dynamic VQ/MQ,并在 Wan2.2-5B 上取得最低 Static Epi. 0.093。相比静态假设更强的 Epipolar-DPO / VideoGPA,VGGRPO 在 dynamic split 上优势更明显:Wan2.2-5B 的 Dynamic MQ 从 VideoGPA 54.00 提升到 66.00。

Figure 3 解读:qualitative comparison 展示 static 与 dynamic prompt 的 first/middle/last frames。baseline 和 DPO baselines 会出现 temporal flicker、geometric drift 或 camera instability;VGGRPO 的 scene structure 更连续,相机轨迹更平滑。

5.2 Ablation:geometry FM 与 reward terms

StudyVariantVQ↑MQ↑Epi.↓
Geometry FMVGGT54.9660.610.090
Geometry FMAny4D59.5767.210.093
Reward terms only55.6063.400.104
Reward terms59.5767.210.093

Any4D 的 dynamic 4D reconstruction 能力带来更高 VQ/MQ,而 VGGT 在静态 epipolar error 上略好。reward ablation 表明 motion reward 能稳定 camera,但加入 reprojection consistency 后几何质量与感知质量都更好。

Figure 4 解读:只优化 时,camera trajectory 变平滑,但场景结构仍有局部几何 artifacts;加入 后,重建几何更一致,说明两个 reward 是互补的。

5.3 Generalization and efficiency

VBench generalization(standard VBench captions):

ModelSub. Cons.↑Bg. Cons.↑Aes. Qual.↑Img. Qual.↑Mot. Smooth.↑Dyn. Deg.↑
Baseline0.95420.95280.59660.67330.98410.4237
VGGRPO0.96440.95830.59910.68610.98950.3962

Efficiency(reward computation, batch size 4):

RewardTime↓Peak Mem↓
RGB-based54.73 s76.80 GB
Latent reward(VGGRPO)41.33 s68.57 GB

效率结论:latent reward 比 RGB-based reward 快 13.40 s,即 24.5% time reduction;peak GPU memory 从 76.80 GB 降到 68.57 GB。Dynamic Degree 不是最高,作者解释为 motion reward 减少 camera jitter,会降低 RAFT optical flow magnitude;这并不等价于视频“动态性”变差,因为 Motion Quality 和 Motion Smoothness 同时提升。

Figure 5 解读:supplement analysis 用 latent perturbation 测试几何估计鲁棒性。RGB-based geometry model 在 decoded frames 视觉变化很小的情况下也会快速退化;LGM 因直接在 generated latents 上训练,面对 latent noise / distribution shift 更稳定。

5.4 局限与注意点

  • 当前公开材料没有源码,因此无法验证 training script、actual reward weights、sampling schedule 的实现细节;笔记中的 config 数字来自 paper TeX/PDF,而非 launch config。
  • LGM 依赖 geometry foundation model 的能力上限;Any4D 比 VGGT 更适合 dynamic scenes,但仍可能受 scene-flow/static filtering 错误影响。
  • Reward 主要约束 camera smoothness 与 geometric reprojection,不直接保证 object identity、semantic consistency、physics validity;这些可能需要额外 reward 或 world-state representation。
  • 1536 GPU hours 说明 online video RL 仍然昂贵;latent reward 降低了 reward computation cost,但没有消除 group sampling 的总体训练成本。