RLVR-World: Training World Models with Reinforcement Learning

Authors: Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long Affiliations: Tsinghua University (BNRist, School of Software / Zhili College) GitHub: thuml/RLVR-World Venue: NeurIPS 2025

1. Motivation (研究动机)

核心问题:MLE 训练目标与 world model 任务目标不对齐

World model 的核心任务是预测状态转移 ,评估指标通常是 prediction accuracy(语言)或 perceptual quality(视觉,如 LPIPS、SSIM)。然而现有 world model 普遍采用 Maximum Likelihood Estimation (MLE) 作为训练目标(语言模型用 next-token prediction,视频模型用 MSE/VQ loss),这带来了三个关键问题:

  1. Surrogate objective 与 task metric 不一致:MLE 优化的是 log-likelihood,而非下游评估指标(如 accuracy、F1、LPIPS),两者可能 agnostic 甚至 diverge
  2. Teacher-forcing 导致多步累积误差:训练时用 ground-truth 前缀,无法感知多步预测中的 error accumulation
  3. Likelihood 目标引发 repetition 问题:在语言模型中已被广泛观察到的 repetition/hallucination 问题,在 video world model 中同样存在(约 48.6% 的 repetition rate)

启发来源

RLVR (Reinforcement Learning with Verifiable Rewards) 在 LLM 推理领域(如 DeepSeek-R1 的数学/代码推理)已取得显著成功,其核心思想是用 rule-based verifiable reward 替代 learned reward model。World modeling 天然适合 RLVR:prediction accuracy 本身就是一个可验证的 reward。

Figure 1 解读:左侧展示传统 MLE 训练范式(Pre-train + SFT),优点是 scalable 但使用 surrogate optimization,训练目标与任务目标不对齐。右侧展示 RLVR-World 提出的 post-training 范式(Pre-train + SFT + RL),通过采样生成预测、解码后与 ground-truth 比较计算 verifiable reward,实现 task-aligned optimization。这一范式虽然 compute-heavy,但直接优化任务指标。


2. Idea (核心思想)

核心思想

将 world modeling 统一为 autoregressive sequence prediction 问题,然后用 RLVR(具体使用 GRPO 算法) 进行 post-training,直接优化 task-specific prediction metrics 作为 verifiable rewards。

三个关键设计选择

  1. 统一序列建模框架:不同模态(语言/视频)的 world model 统一为 question-response 的 autoregressive generation,states 和 actions 通过 modality-specific tokenization 编码为 token 序列
  2. Prediction metrics 作为 verifiable rewards:语言 world model 用 accuracy/F1-score,视频 world model 用负的 L1+LPIPS 感知损失
  3. GRPO 算法:group-relative advantage estimation,无需 value function,采样一组 responses 后相对排序计算 advantage

应用范围

  • Language world model:文本游戏状态预测、网页状态预测、web agent 的 model predictive control
  • Video world model:机器人操作轨迹预测(单步/多步)、Real2Sim 策略评估

3. Method (方法)

Figure 2 解读:RLVR-World 框架的完整流程图。上半部分 (a) 展示语言 world model:将游戏状态 JSON 和动作文本 tokenize 后输入 Language World Model,采样一组预测结果,detokenize 后提取预测状态,与 ground-truth 比较计算 Accuracy/F1 作为 verifiable reward,通过 GRPO 更新模型。下半部分 (b) 展示视频 world model:将视觉观测通过 Visual Encoder 编码、动作通过 Quantization 离散化后输入 Video World Model,采样生成 token 序列后通过 Visual Decoder 解码为预测帧,与 ground-truth 帧计算 MSE/LPIPS/SSIM 作为 verifiable reward。

3.1 Problem Formulation

环境建模为 MDP

  • 状态空间 :可以是文本(JSON 对象)或视觉帧
  • 动作空间 :文本动作或机器人控制向量
  • World model 的目标:学习转移分布

3.2 World Models as Sequence Modeling

将 world model 统一为 autoregressive sequence prediction:

  • Input token sequence (“question”):当前状态和动作的 tokenized 序列
  • Output token sequence (“response”):预测的下一状态的 tokenized 序列

Tokenization 方案

  • 语言:标准 BPE text tokenization
  • 视觉:VQGAN-based visual tokenizer(per-frame 或 compressive)
  • 低维连续值(如机器人动作):均匀离散化到 256 bins

MLE pre-training 目标

3.3 Prediction Metrics as Verifiable Rewards

RLVR post-training 的核心:给定输入 ,模型采样一组预测 ,通过 modality-specific 解码提取预测状态 ,计算 reward:

其中 当指标越低越好(如 MSE、LPIPS), 当指标越高越好(如 accuracy)。

3.4 GRPO 优化

GRPO 的 advantage estimation(group-relative,无需 value function):

GRPO 目标函数(clipped objective + KL divergence penalty):

3.5 Language World Model 具体实现

Text Game State Prediction

# 伪代码:Text Game RLVR Pipeline
# Base model: DeepSeek-R1-Distill-Qwen-1.5B/7B
 
# Step 1: SFT (用 DeepSeek-R1 生成的 CoT 数据)
sft_data = reject_sample(deepseek_r1, changed_cases, n=4237)
model = lora_sft(base_model, sft_data, rank=32, epochs=10)
 
# Step 2: RLVR with binary or task-specific reward
for step in range(num_steps):
    q = tokenize(current_state, action)                # question
    outputs = sample_group(model, q, G=5)              # sample G responses
    for o_i in outputs:
        predicted_state = extract_json(detokenize(o_i))
        # Binary reward: 完全匹配才给 1
        R_binary = 1 if predicted_state == ground_truth else 0
        # Task-specific reward:
        # R = 0.1 * acc_all + 1.0 * acc_changed + 0.2 * I(correct)
    advantages = group_relative_normalize(rewards)
    update_grpo(model, outputs, advantages)

Task-specific reward 公式

其中

Web Page State Prediction

# 伪代码:Web Page State RLVR Pipeline
# Reward: F1 score between predicted and ground-truth item changes
 
def compute_f1_reward(predicted_changes, gt_changes):
    """F1 score as verifiable reward for web page prediction"""
    tp = len(set(predicted_changes) & set(gt_changes))
    precision = tp / len(predicted_changes) if predicted_changes else (1 if not gt_changes else 0)
    recall = tp / len(gt_changes) if gt_changes else (1 if not predicted_changes else 0)
    return 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

3.6 Video World Model 具体实现

Visual Tokenization

  • Per-frame tokenizer(用于 single-step prediction):VQGAN + FSQ,每帧 tokens,codebook size
  • Compressive tokenizer(用于 multi-step prediction):conditional VQGAN with cross-attention,每帧仅 tokens(压缩 4x),context frame tokens

Sequence Construction

Single-step prediction

序列长度 ,其中下划线部分(321 tokens)参与 loss 计算。

Multi-step prediction

序列长度 ,codebook size = 9006。

# 伪代码:Video World Model RLVR Pipeline
# Autoregressive Transformer: 138M params (LLaMA architecture, GPT-2 small scale)
 
# Step 1: Pre-train with MLE (next-token prediction)
# Single-step: 9.9 × 10^5 steps; Multi-step: 4.5 × 10^5 steps
 
# Step 2: RLVR post-training
for step in range(num_rlvr_steps):  # typically ~100-300 steps
    # Sample group of G=16 token sequences
    token_seqs = sample_group(model, context_tokens, G=16)
 
    for seq in token_seqs:
        # Decode tokens back to frames via visual decoder
        predicted_frames = visual_decoder(seq)
 
        # Single-step reward:
        # R = -L1(predicted, gt) - LPIPS(predicted, gt)
 
        # Multi-step reward:
        # R = -sum_{tau=t+1}^{t+7} [L1(s_hat_tau, s_tau) + LPIPS(s_hat_tau, s_tau)]
 
    advantages = group_relative_normalize(rewards)
    update_grpo(model, token_seqs, advantages)
    # NOTE: visual tokenizer 参数冻结,不参与更新

Video world model 的 reward 函数

Single-step:

Multi-step:


4. Experimental Setup (实验设置)

4.1 Language World Model

配置项Text GameWeb Page
数据集ByteSized32-State-Prediction (76,369 transitions, 31 games, 2954 test)WebArena (WMA) (~7K samples, 99% train / 1% test)
Base ModelDeepSeek-R1-Distill-Qwen-1.5B / 7BDeepSeek-R1-Distill-Qwen-1.5B
SFTLoRA rank=32, =16, 10 epochs, lr=LoRA rank=32, =16, 4 epochs, lr=
RLVRbatch=128, group=5, lr=, KL coeff=batch=64, group=5, lr=, KL coeff=
RewardBinary / Task-specific (=0.1, =1, =0.2)F1 score
EvaluationAccuracy (unchanged / changed / overall)Precision, Recall, F1

4.2 Video World Model

配置项Value
数据集RT-1 (87,212 trajectories, 256x320, 13-dim actions), PushT, Rope, Granular
Visual TokenizerVQGAN + FSQ (per-frame: 320 tokens; compressive: 80 tokens)
Autoregressive Transformer138M params, 12 layers, hidden=768, 12 heads (LLaMA arch)
Pre-trainingSingle-step: 9.9x10^5 steps; Multi-step: 4.5x10^5 steps; lr=
RLVRbatch=128, group=16, lr=, KL coeff=, ~100-300 steps
Reward
EvaluationMSE, PSNR, SSIM, LPIPS (scaled by 100)

4.3 下游应用

  • Model Predictive Control (Web Agent):policy model 采样 20 个候选动作 → 取 top-3 频率最高的 → world model 预测 → summarization model 提取 top-10 变化 → value model 评分 (1-5),重复 20 次取平均最高分动作。Policy/summarization/value 均用 DeepSeek-V3
  • Real2Sim Policy Evaluation:用 video world model 替代手工 simulator (SIMPLER),评估 RT-1/RT-1-X 策略在 open/close drawer 等 6 个任务上的表现

5. Experimental Results (实验结果)

5.1 Language World Model: Text Game

Figure (Text Game Training Curves) 解读:右侧训练曲线显示,RLVR 训练过程中 training reward(绿线)持续上升,test accuracy 在 unchanged cases(蓝线)上快速提升,在 changed cases(红线)上也有稳定提升。训练仅需 ~300 步即可收敛。

关键结果 (Table 1):

ModelUnchanged AccChanged AccOverall Acc
Base (R1-Distill-Qwen-1.5B)11.98%0.08%7.11%
SFT38.88%24.21%32.87%
RLVR-World (binary)73.57%33.14%57.01%
RLVR-World (task-specific)83.66%33.80%63.24%
Base (R1-Distill-Qwen-7B)46.90%5.53%29.92%
SFT (7B)65.94%31.32%51.76%
RLVR-World (7B, binary)83.08%40.33%65.53%
GPT-473.90%51.60%64.76%
  • 1.5B 模型:binary reward 相比 SFT 提升 +34.7% unchanged, +8.9% changed
  • Task-specific reward 进一步提升:+44.8% unchanged, +9.6% changed
  • 7B 模型 RLVR 后整体超过 GPT-4(65.53% vs 64.76%)

5.2 Language World Model: Web Page

Figure (Web Page Training Curves) 解读:右侧训练曲线显示 F1 score 在训练集(蓝线)和测试集(红线)上的变化。F1 在约 250 步后稳定在 0.65 左右,无明显过拟合。

关键结果 (Table 2):

ModelPrecisionRecallF1Web Agent Success Rate
Base (R1-Dist.-Qwen-1.5B)15.59%15.70%11.83%n/a
SFT48.99%56.05%49.94%12.06%
RLVR-World72.77%64.55%65.11%14.29%
+48.5%+15.1%+30.3%+18.4%

5.3 Video World Model: RT-1

Figure 3 解读:视频 world model 在 RT-1 上的训练曲线。左图 (single-step) 和右图 (multi-step) 展示了 pre-training(MLE,x 轴单位 )和 post-training(RLVR,x 轴单位 )阶段的 LPIPS 变化。关键观察:RLVR 仅需 ~100 步(绿色虚线后)就能显著降低 LPIPS,达到甚至超过 MLE 训练数十万步的效果。右图中橙色虚线表示额外延长 pre-train 到 600K 步,LPIPS 仍停留在 ~14.5,远不如 RLVR 后的 ~13.4。这说明 RLVR 的效率优势极其显著(~1000x 更少的梯度步数)。

关键结果 (Table 3 - RT-1):

TaskModelRepetition RateMSE↓PSNR↑SSIM↑LPIPS↓
Single-stepBasen/a0.33625.381.713.0
Single-stepRLVR-Worldn/a0.28725.983.112.2
+14.3%+2.6%+1.6%+6.0%
Multi-stepBase48.6%0.65923.180.914.8
Multi-stepBase (w/ rep. rejection)0.0%0.59323.381.014.4
Multi-stepRLVR-World9.9%0.48624.182.413.4
+79.6%+26.1%+4.5%+1.9%+9.2%

5.4 与 SOTA 对比 (Table 4)

ModelPushT LPIPS↓PushT SSIM↑Rope LPIPS↓Granular LPIPS↓Granular SSIM↑
DINO-WM (Reported)0.798.50.93.594.0
AVDC (Diffusion)4.695.96.010.690.9
Base (Ours)0.8398.283.033.1494.79
RLVR-World0.7098.462.082.4295.42

RLVR 后模型在 PushT 上达到 DINO-WM 水平,在 Granular(最难数据集)上显著超越 DINO-WM。

5.5 Model Analysis

Figure 4 解读:三个分析实验。(a) Test-time scaling:随着采样数 增大,取 best-of- 的 LPIPS,RLVR-World 在 时就优于 base model 的 best-of-5。但当 增大到 100 时,base model 逐渐追上甚至超过 RLVR,说明 RLVR 当前方法仍有提升空间。(b) RL training scaling:增大 GRPO 的 group size (4→8→16→32)可以提升收敛速度和最终性能。(c) Metric-oriented optimization:用不同指标(MAE/MSE/PSNR/SSIM/LPIPS)作为 reward 训练,每个模型在对应指标上取得最优,验证了 RLVR 的 metric-specific 优化能力。

5.6 Repetition 缓解

Figure 6 (上半部分) 解读:多步视频预测的定性对比。上方两行展示 ground-truth 和 RLVR-World 的预测,画面清晰且持续变化。底部行展示 base model 的预测,从 开始出现明显的 repetition(重复帧),机械臂完全停滞。RLVR 将 repetition rate 从 48.6% 降至 9.9%。

5.7 Real2Sim Policy Evaluation

Figure 5 解读:Real2Sim 策略评估结果。横轴为真实成功率(real success rate),纵轴为模拟评估成功率(evaluated success rate)。理想情况下所有点应在对角线上。相比手工 SIMPLER simulator(红色/蓝色),video world model(绿色/紫色)与对角线的偏差更小,说明 neural world model 是更好的 real-world simulator 近似。RLVR-World(紫色)进一步优于 base model(绿色),提供更准确的策略评估。

5.8 Training Reward Curves

Figure 8 解读:RLVR-World 单步预测的训练过程曲线。左图为 reward(负 L1+LPIPS),中图为 L1 loss,右图为 LPIPS loss。三者均在 ~300 步内持续改善,且 smooth 曲线(橙色)显示稳定的单调趋势。这验证了 GRPO 在视觉 world model 上的有效性和训练稳定性。


Code-to-Paper Mapping

Paper 组件代码位置说明
Language WM SFTlang_wm/verl/examples/sft/LoRA supervised fine-tuning
Language WM RLVRlang_wm/verl/examples/grpo_trainer/GRPO training with binary/task-specific reward
LoRA Mergelang_wm/verl/merge_lora.py合并 LoRA 权重
Model Merge (RLVR)lang_wm/verl/scripts/model_merger.py合并 RLVR checkpoint
Web Agent MPClang_wm/webagent/Model Predictive Control for WebArena
Text Game Datalang_wm/data_process/text_game/数据集生成脚本
Video Tokenizervid_wm/ivideogpt/VQGAN + FSQ tokenizer (per-frame & compressive)
Video WM Trainingvid_wm/ivideogpt/Transformer pre-training
Video WM RLVRvid_wm/verl/GRPO fine-tuning for video prediction
Data Convertervid_wm/oxe_data_converter.pyRT-1 Open X-Embodiment 数据预处理
HuggingFace ModelsHuggingFace HubSFT/RLVR checkpoints, tokenizers, datasets

关键依赖:VERL (RL training framework), iVideoGPT (video tokenizer/transformer), WMA-Agents (web agent), SimplerEnv (policy evaluation)


总结与局限性

核心贡献

  1. 首次将 RLVR 范式应用于 world model training,在语言和视频两种模态上均验证有效
  2. 统一的序列建模框架:将不同模态的 world model 统一为 autoregressive prediction + RLVR post-training
  3. 训练效率极高:RLVR 仅需几百步梯度更新(vs MLE 需要数十万步),且性能显著提升

局限性

  1. Performance barrier:训练通常在几百步后收敛,如何突破需要更深入分析
  2. OOD 泛化:RLVR 能否提升 world model 对训练域外动作的泛化能力,尤其是反事实推理
  3. 通用 video world model:当前在单一数据集上训练,未来需在通用视频模型上验证
  4. 更多模型架构:GRPO 理论上 model-agnostic,但 diffusion model 的 GRPO 算法尚在发展中
  5. Task-aligned rewards:视觉指标(MSE/LPIPS)仍不完全捕捉物理规则和时序一致性