RLVR-World: Training World Models with Reinforcement Learning
Authors: Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long Affiliations: Tsinghua University (BNRist, School of Software / Zhili College) GitHub: thuml/RLVR-World Venue: NeurIPS 2025
1. Motivation (研究动机)
核心问题:MLE 训练目标与 world model 任务目标不对齐
World model 的核心任务是预测状态转移 ,评估指标通常是 prediction accuracy(语言)或 perceptual quality(视觉,如 LPIPS、SSIM)。然而现有 world model 普遍采用 Maximum Likelihood Estimation (MLE) 作为训练目标(语言模型用 next-token prediction,视频模型用 MSE/VQ loss),这带来了三个关键问题:
- Surrogate objective 与 task metric 不一致:MLE 优化的是 log-likelihood,而非下游评估指标(如 accuracy、F1、LPIPS),两者可能 agnostic 甚至 diverge
- Teacher-forcing 导致多步累积误差:训练时用 ground-truth 前缀,无法感知多步预测中的 error accumulation
- Likelihood 目标引发 repetition 问题:在语言模型中已被广泛观察到的 repetition/hallucination 问题,在 video world model 中同样存在(约 48.6% 的 repetition rate)
启发来源
RLVR (Reinforcement Learning with Verifiable Rewards) 在 LLM 推理领域(如 DeepSeek-R1 的数学/代码推理)已取得显著成功,其核心思想是用 rule-based verifiable reward 替代 learned reward model。World modeling 天然适合 RLVR:prediction accuracy 本身就是一个可验证的 reward。
Figure 1 解读:左侧展示传统 MLE 训练范式(Pre-train + SFT),优点是 scalable 但使用 surrogate optimization,训练目标与任务目标不对齐。右侧展示 RLVR-World 提出的 post-training 范式(Pre-train + SFT + RL),通过采样生成预测、解码后与 ground-truth 比较计算 verifiable reward,实现 task-aligned optimization。这一范式虽然 compute-heavy,但直接优化任务指标。
2. Idea (核心思想)
核心思想
将 world modeling 统一为 autoregressive sequence prediction 问题,然后用 RLVR(具体使用 GRPO 算法) 进行 post-training,直接优化 task-specific prediction metrics 作为 verifiable rewards。
三个关键设计选择
- 统一序列建模框架:不同模态(语言/视频)的 world model 统一为 question-response 的 autoregressive generation,states 和 actions 通过 modality-specific tokenization 编码为 token 序列
- Prediction metrics 作为 verifiable rewards:语言 world model 用 accuracy/F1-score,视频 world model 用负的 L1+LPIPS 感知损失
- GRPO 算法:group-relative advantage estimation,无需 value function,采样一组 responses 后相对排序计算 advantage
应用范围
- Language world model:文本游戏状态预测、网页状态预测、web agent 的 model predictive control
- Video world model:机器人操作轨迹预测(单步/多步)、Real2Sim 策略评估
3. Method (方法)
Figure 2 解读:RLVR-World 框架的完整流程图。上半部分 (a) 展示语言 world model:将游戏状态 JSON 和动作文本 tokenize 后输入 Language World Model,采样一组预测结果,detokenize 后提取预测状态,与 ground-truth 比较计算 Accuracy/F1 作为 verifiable reward,通过 GRPO 更新模型。下半部分 (b) 展示视频 world model:将视觉观测通过 Visual Encoder 编码、动作通过 Quantization 离散化后输入 Video World Model,采样生成 token 序列后通过 Visual Decoder 解码为预测帧,与 ground-truth 帧计算 MSE/LPIPS/SSIM 作为 verifiable reward。
3.1 Problem Formulation
环境建模为 MDP :
- 状态空间 :可以是文本(JSON 对象)或视觉帧
- 动作空间 :文本动作或机器人控制向量
- World model 的目标:学习转移分布
3.2 World Models as Sequence Modeling
将 world model 统一为 autoregressive sequence prediction:
- Input token sequence (“question”):当前状态和动作的 tokenized 序列
- Output token sequence (“response”):预测的下一状态的 tokenized 序列
Tokenization 方案:
- 语言:标准 BPE text tokenization
- 视觉:VQGAN-based visual tokenizer(per-frame 或 compressive)
- 低维连续值(如机器人动作):均匀离散化到 256 bins
MLE pre-training 目标:
3.3 Prediction Metrics as Verifiable Rewards
RLVR post-training 的核心:给定输入 ,模型采样一组预测 ,通过 modality-specific 解码提取预测状态 ,计算 reward:
其中 当指标越低越好(如 MSE、LPIPS), 当指标越高越好(如 accuracy)。
3.4 GRPO 优化
GRPO 的 advantage estimation(group-relative,无需 value function):
GRPO 目标函数(clipped objective + KL divergence penalty):
3.5 Language World Model 具体实现
Text Game State Prediction
# 伪代码:Text Game RLVR Pipeline
# Base model: DeepSeek-R1-Distill-Qwen-1.5B/7B
# Step 1: SFT (用 DeepSeek-R1 生成的 CoT 数据)
sft_data = reject_sample(deepseek_r1, changed_cases, n=4237)
model = lora_sft(base_model, sft_data, rank=32, epochs=10)
# Step 2: RLVR with binary or task-specific reward
for step in range(num_steps):
q = tokenize(current_state, action) # question
outputs = sample_group(model, q, G=5) # sample G responses
for o_i in outputs:
predicted_state = extract_json(detokenize(o_i))
# Binary reward: 完全匹配才给 1
R_binary = 1 if predicted_state == ground_truth else 0
# Task-specific reward:
# R = 0.1 * acc_all + 1.0 * acc_changed + 0.2 * I(correct)
advantages = group_relative_normalize(rewards)
update_grpo(model, outputs, advantages)Task-specific reward 公式:
其中 。
Web Page State Prediction
# 伪代码:Web Page State RLVR Pipeline
# Reward: F1 score between predicted and ground-truth item changes
def compute_f1_reward(predicted_changes, gt_changes):
"""F1 score as verifiable reward for web page prediction"""
tp = len(set(predicted_changes) & set(gt_changes))
precision = tp / len(predicted_changes) if predicted_changes else (1 if not gt_changes else 0)
recall = tp / len(gt_changes) if gt_changes else (1 if not predicted_changes else 0)
return 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 03.6 Video World Model 具体实现
Visual Tokenization
- Per-frame tokenizer(用于 single-step prediction):VQGAN + FSQ,每帧 tokens,codebook size
- Compressive tokenizer(用于 multi-step prediction):conditional VQGAN with cross-attention,每帧仅 tokens(压缩 4x),context frame tokens
Sequence Construction
Single-step prediction :
序列长度 ,其中下划线部分(321 tokens)参与 loss 计算。
Multi-step prediction :
序列长度 ,codebook size = 9006。
# 伪代码:Video World Model RLVR Pipeline
# Autoregressive Transformer: 138M params (LLaMA architecture, GPT-2 small scale)
# Step 1: Pre-train with MLE (next-token prediction)
# Single-step: 9.9 × 10^5 steps; Multi-step: 4.5 × 10^5 steps
# Step 2: RLVR post-training
for step in range(num_rlvr_steps): # typically ~100-300 steps
# Sample group of G=16 token sequences
token_seqs = sample_group(model, context_tokens, G=16)
for seq in token_seqs:
# Decode tokens back to frames via visual decoder
predicted_frames = visual_decoder(seq)
# Single-step reward:
# R = -L1(predicted, gt) - LPIPS(predicted, gt)
# Multi-step reward:
# R = -sum_{tau=t+1}^{t+7} [L1(s_hat_tau, s_tau) + LPIPS(s_hat_tau, s_tau)]
advantages = group_relative_normalize(rewards)
update_grpo(model, token_seqs, advantages)
# NOTE: visual tokenizer 参数冻结,不参与更新Video world model 的 reward 函数:
Single-step:
Multi-step:
4. Experimental Setup (实验设置)
4.1 Language World Model
| 配置项 | Text Game | Web Page |
|---|---|---|
| 数据集 | ByteSized32-State-Prediction (76,369 transitions, 31 games, 2954 test) | WebArena (WMA) (~7K samples, 99% train / 1% test) |
| Base Model | DeepSeek-R1-Distill-Qwen-1.5B / 7B | DeepSeek-R1-Distill-Qwen-1.5B |
| SFT | LoRA rank=32, =16, 10 epochs, lr= | LoRA rank=32, =16, 4 epochs, lr= |
| RLVR | batch=128, group=5, lr=, KL coeff= | batch=64, group=5, lr=, KL coeff= |
| Reward | Binary / Task-specific (=0.1, =1, =0.2) | F1 score |
| Evaluation | Accuracy (unchanged / changed / overall) | Precision, Recall, F1 |
4.2 Video World Model
| 配置项 | Value |
|---|---|
| 数据集 | RT-1 (87,212 trajectories, 256x320, 13-dim actions), PushT, Rope, Granular |
| Visual Tokenizer | VQGAN + FSQ (per-frame: 320 tokens; compressive: 80 tokens) |
| Autoregressive Transformer | 138M params, 12 layers, hidden=768, 12 heads (LLaMA arch) |
| Pre-training | Single-step: 9.9x10^5 steps; Multi-step: 4.5x10^5 steps; lr= |
| RLVR | batch=128, group=16, lr=, KL coeff=, ~100-300 steps |
| Reward | |
| Evaluation | MSE, PSNR, SSIM, LPIPS (scaled by 100) |
4.3 下游应用
- Model Predictive Control (Web Agent):policy model 采样 20 个候选动作 → 取 top-3 频率最高的 → world model 预测 → summarization model 提取 top-10 变化 → value model 评分 (1-5),重复 20 次取平均最高分动作。Policy/summarization/value 均用 DeepSeek-V3
- Real2Sim Policy Evaluation:用 video world model 替代手工 simulator (SIMPLER),评估 RT-1/RT-1-X 策略在 open/close drawer 等 6 个任务上的表现
5. Experimental Results (实验结果)
5.1 Language World Model: Text Game
Figure (Text Game Training Curves) 解读:右侧训练曲线显示,RLVR 训练过程中 training reward(绿线)持续上升,test accuracy 在 unchanged cases(蓝线)上快速提升,在 changed cases(红线)上也有稳定提升。训练仅需 ~300 步即可收敛。
关键结果 (Table 1):
| Model | Unchanged Acc | Changed Acc | Overall Acc |
|---|---|---|---|
| Base (R1-Distill-Qwen-1.5B) | 11.98% | 0.08% | 7.11% |
| SFT | 38.88% | 24.21% | 32.87% |
| RLVR-World (binary) | 73.57% | 33.14% | 57.01% |
| RLVR-World (task-specific) | 83.66% | 33.80% | 63.24% |
| Base (R1-Distill-Qwen-7B) | 46.90% | 5.53% | 29.92% |
| SFT (7B) | 65.94% | 31.32% | 51.76% |
| RLVR-World (7B, binary) | 83.08% | 40.33% | 65.53% |
| GPT-4 | 73.90% | 51.60% | 64.76% |
- 1.5B 模型:binary reward 相比 SFT 提升 +34.7% unchanged, +8.9% changed
- Task-specific reward 进一步提升:+44.8% unchanged, +9.6% changed
- 7B 模型 RLVR 后整体超过 GPT-4(65.53% vs 64.76%)
5.2 Language World Model: Web Page
Figure (Web Page Training Curves) 解读:右侧训练曲线显示 F1 score 在训练集(蓝线)和测试集(红线)上的变化。F1 在约 250 步后稳定在 0.65 左右,无明显过拟合。
关键结果 (Table 2):
| Model | Precision | Recall | F1 | Web Agent Success Rate |
|---|---|---|---|---|
| Base (R1-Dist.-Qwen-1.5B) | 15.59% | 15.70% | 11.83% | n/a |
| SFT | 48.99% | 56.05% | 49.94% | 12.06% |
| RLVR-World | 72.77% | 64.55% | 65.11% | 14.29% |
| +48.5% | +15.1% | +30.3% | +18.4% |
5.3 Video World Model: RT-1
Figure 3 解读:视频 world model 在 RT-1 上的训练曲线。左图 (single-step) 和右图 (multi-step) 展示了 pre-training(MLE,x 轴单位 )和 post-training(RLVR,x 轴单位 )阶段的 LPIPS 变化。关键观察:RLVR 仅需 ~100 步(绿色虚线后)就能显著降低 LPIPS,达到甚至超过 MLE 训练数十万步的效果。右图中橙色虚线表示额外延长 pre-train 到 600K 步,LPIPS 仍停留在 ~14.5,远不如 RLVR 后的 ~13.4。这说明 RLVR 的效率优势极其显著(~1000x 更少的梯度步数)。
关键结果 (Table 3 - RT-1):
| Task | Model | Repetition Rate | MSE↓ | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|---|---|
| Single-step | Base | n/a | 0.336 | 25.3 | 81.7 | 13.0 |
| Single-step | RLVR-World | n/a | 0.287 | 25.9 | 83.1 | 12.2 |
| +14.3% | +2.6% | +1.6% | +6.0% | |||
| Multi-step | Base | 48.6% | 0.659 | 23.1 | 80.9 | 14.8 |
| Multi-step | Base (w/ rep. rejection) | 0.0% | 0.593 | 23.3 | 81.0 | 14.4 |
| Multi-step | RLVR-World | 9.9% | 0.486 | 24.1 | 82.4 | 13.4 |
| +79.6% | +26.1% | +4.5% | +1.9% | +9.2% |
5.4 与 SOTA 对比 (Table 4)
| Model | PushT LPIPS↓ | PushT SSIM↑ | Rope LPIPS↓ | Granular LPIPS↓ | Granular SSIM↑ |
|---|---|---|---|---|---|
| DINO-WM (Reported) | 0.7 | 98.5 | 0.9 | 3.5 | 94.0 |
| AVDC (Diffusion) | 4.6 | 95.9 | 6.0 | 10.6 | 90.9 |
| Base (Ours) | 0.83 | 98.28 | 3.03 | 3.14 | 94.79 |
| RLVR-World | 0.70 | 98.46 | 2.08 | 2.42 | 95.42 |
RLVR 后模型在 PushT 上达到 DINO-WM 水平,在 Granular(最难数据集)上显著超越 DINO-WM。
5.5 Model Analysis
Figure 4 解读:三个分析实验。(a) Test-time scaling:随着采样数 增大,取 best-of- 的 LPIPS,RLVR-World 在 时就优于 base model 的 best-of-5。但当 增大到 100 时,base model 逐渐追上甚至超过 RLVR,说明 RLVR 当前方法仍有提升空间。(b) RL training scaling:增大 GRPO 的 group size (4→8→16→32)可以提升收敛速度和最终性能。(c) Metric-oriented optimization:用不同指标(MAE/MSE/PSNR/SSIM/LPIPS)作为 reward 训练,每个模型在对应指标上取得最优,验证了 RLVR 的 metric-specific 优化能力。
5.6 Repetition 缓解

Figure 6 (上半部分) 解读:多步视频预测的定性对比。上方两行展示 ground-truth 和 RLVR-World 的预测,画面清晰且持续变化。底部行展示 base model 的预测,从 开始出现明显的 repetition(重复帧),机械臂完全停滞。RLVR 将 repetition rate 从 48.6% 降至 9.9%。
5.7 Real2Sim Policy Evaluation
Figure 5 解读:Real2Sim 策略评估结果。横轴为真实成功率(real success rate),纵轴为模拟评估成功率(evaluated success rate)。理想情况下所有点应在对角线上。相比手工 SIMPLER simulator(红色/蓝色),video world model(绿色/紫色)与对角线的偏差更小,说明 neural world model 是更好的 real-world simulator 近似。RLVR-World(紫色)进一步优于 base model(绿色),提供更准确的策略评估。
5.8 Training Reward Curves
Figure 8 解读:RLVR-World 单步预测的训练过程曲线。左图为 reward(负 L1+LPIPS),中图为 L1 loss,右图为 LPIPS loss。三者均在 ~300 步内持续改善,且 smooth 曲线(橙色)显示稳定的单调趋势。这验证了 GRPO 在视觉 world model 上的有效性和训练稳定性。
Code-to-Paper Mapping
| Paper 组件 | 代码位置 | 说明 |
|---|---|---|
| Language WM SFT | lang_wm/verl/examples/sft/ | LoRA supervised fine-tuning |
| Language WM RLVR | lang_wm/verl/examples/grpo_trainer/ | GRPO training with binary/task-specific reward |
| LoRA Merge | lang_wm/verl/merge_lora.py | 合并 LoRA 权重 |
| Model Merge (RLVR) | lang_wm/verl/scripts/model_merger.py | 合并 RLVR checkpoint |
| Web Agent MPC | lang_wm/webagent/ | Model Predictive Control for WebArena |
| Text Game Data | lang_wm/data_process/text_game/ | 数据集生成脚本 |
| Video Tokenizer | vid_wm/ivideogpt/ | VQGAN + FSQ tokenizer (per-frame & compressive) |
| Video WM Training | vid_wm/ivideogpt/ | Transformer pre-training |
| Video WM RLVR | vid_wm/verl/ | GRPO fine-tuning for video prediction |
| Data Converter | vid_wm/oxe_data_converter.py | RT-1 Open X-Embodiment 数据预处理 |
| HuggingFace Models | HuggingFace Hub | SFT/RLVR checkpoints, tokenizers, datasets |
关键依赖:VERL (RL training framework), iVideoGPT (video tokenizer/transformer), WMA-Agents (web agent), SimplerEnv (policy evaluation)
总结与局限性
核心贡献
- 首次将 RLVR 范式应用于 world model training,在语言和视频两种模态上均验证有效
- 统一的序列建模框架:将不同模态的 world model 统一为 autoregressive prediction + RLVR post-training
- 训练效率极高:RLVR 仅需几百步梯度更新(vs MLE 需要数十万步),且性能显著提升
局限性
- Performance barrier:训练通常在几百步后收敛,如何突破需要更深入分析
- OOD 泛化:RLVR 能否提升 world model 对训练域外动作的泛化能力,尤其是反事实推理
- 通用 video world model:当前在单一数据集上训练,未来需在通用视频模型上验证
- 更多模型架构:GRPO 理论上 model-agnostic,但 diffusion model 的 GRPO 算法尚在发展中
- Task-aligned rewards:视觉指标(MSE/LPIPS)仍不完全捕捉物理规则和时序一致性