RLVR-World: Training World Models with Reinforcement Learning

Authors: Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long Affiliations: Tsinghua University (BNRist, School of Software / Zhili College) GitHub: thuml/RLVR-World Venue: NeurIPS 2025

1. Motivation (研究动机)

核心问题：MLE 训练目标与 world model 任务目标不对齐

World model 的核心任务是预测状态转移 $p (s_{t + 1} ∣ s_{t}, a_{t})$ ，评估指标通常是 prediction accuracy（语言）或 perceptual quality（视觉，如 LPIPS、SSIM）。然而现有 world model 普遍采用 Maximum Likelihood Estimation (MLE) 作为训练目标（语言模型用 next-token prediction，视频模型用 MSE/VQ loss），这带来了三个关键问题：

Surrogate objective 与 task metric 不一致：MLE 优化的是 log-likelihood，而非下游评估指标（如 accuracy、F1、LPIPS），两者可能 agnostic 甚至 diverge
Teacher-forcing 导致多步累积误差：训练时用 ground-truth 前缀，无法感知多步预测中的 error accumulation
Likelihood 目标引发 repetition 问题：在语言模型中已被广泛观察到的 repetition/hallucination 问题，在 video world model 中同样存在（约 48.6% 的 repetition rate）

启发来源

RLVR (Reinforcement Learning with Verifiable Rewards) 在 LLM 推理领域（如 DeepSeek-R1 的数学/代码推理）已取得显著成功，其核心思想是用 rule-based verifiable reward 替代 learned reward model。World modeling 天然适合 RLVR：prediction accuracy 本身就是一个可验证的 reward。

Figure 1 解读：左侧展示传统 MLE 训练范式（Pre-train + SFT），优点是 scalable 但使用 surrogate optimization，训练目标与任务目标不对齐。右侧展示 RLVR-World 提出的 post-training 范式（Pre-train + SFT + RL），通过采样生成预测、解码后与 ground-truth 比较计算 verifiable reward，实现 task-aligned optimization。这一范式虽然 compute-heavy，但直接优化任务指标。

2. Idea (核心思想)

核心思想

将 world modeling 统一为 autoregressive sequence prediction 问题，然后用 RLVR（具体使用 GRPO 算法） 进行 post-training，直接优化 task-specific prediction metrics 作为 verifiable rewards。

三个关键设计选择

统一序列建模框架：不同模态（语言/视频）的 world model 统一为 question-response 的 autoregressive generation，states 和 actions 通过 modality-specific tokenization 编码为 token 序列
Prediction metrics 作为 verifiable rewards：语言 world model 用 accuracy/F1-score，视频 world model 用负的 L1+LPIPS 感知损失
GRPO 算法：group-relative advantage estimation，无需 value function，采样一组 responses 后相对排序计算 advantage

应用范围

Language world model：文本游戏状态预测、网页状态预测、web agent 的 model predictive control
Video world model：机器人操作轨迹预测（单步/多步）、Real2Sim 策略评估

3. Method (方法)

Figure 2 解读：RLVR-World 框架的完整流程图。上半部分 (a) 展示语言 world model：将游戏状态 JSON 和动作文本 tokenize 后输入 Language World Model，采样一组预测结果，detokenize 后提取预测状态，与 ground-truth 比较计算 Accuracy/F1 作为 verifiable reward，通过 GRPO 更新模型。下半部分 (b) 展示视频 world model：将视觉观测通过 Visual Encoder 编码、动作通过 Quantization 离散化后输入 Video World Model，采样生成 token 序列后通过 Visual Decoder 解码为预测帧，与 ground-truth 帧计算 MSE/LPIPS/SSIM 作为 verifiable reward。

3.1 Problem Formulation

环境建模为 MDP $M = (S, A, p, r, γ)$ ：

状态空间 $S$ ：可以是文本（JSON 对象）或视觉帧
动作空间 $A$ ：文本动作或机器人控制向量
World model 的目标：学习转移分布 $p (s_{t + 1} ∣ s_{t - k + 1 : t}, a_{t - k + 1 : t})$

3.2 World Models as Sequence Modeling

将 world model 统一为 autoregressive sequence prediction：

Input token sequence $q (s, a)$ （“question”）：当前状态和动作的 tokenized 序列
Output token sequence $o (s^{'})$ （“response”）：预测的下一状态的 tokenized 序列

Tokenization 方案：

语言：标准 BPE text tokenization
视觉：VQGAN-based visual tokenizer（per-frame 或 compressive）
低维连续值（如机器人动作）：均匀离散化到 256 bins

MLE pre-training 目标：

J_{MLE} (θ) = lo g p_{θ} (o (s^{'}) ∣ q (s, a)) = l = 1 \sum ∣ o (s^{'}) ∣ lo g p_{θ} (o_{l} (s^{'}) ∣ q (s, a), o_{< l} (s^{'}))

3.3 Prediction Metrics as Verifiable Rewards

RLVR post-training 的核心：给定输入 $q (s, a)$ ，模型采样一组预测 ${o_{i}}_{i = 1}^{G}$ ，通过 modality-specific 解码提取预测状态 $\overset{s}{^}_{i}^{'}$ ，计算 reward：

R_{i} = sign (D) \cdot D (\overset{s}{^}_{i}^{'}, s^{'})

其中 $sign (D) = - 1$ 当指标越低越好（如 MSE、LPIPS）， $sign (D) = 1$ 当指标越高越好（如 accuracy）。

3.4 GRPO 优化

GRPO 的 advantage estimation（group-relative，无需 value function）：

\hat{A}_{i, t} = \frac{R _{i} - mean ({ R _{i} } _{i = 1}^{G} )}{std ({ R _{i} } _{i = 1}^{G} )}

GRPO 目标函数（clipped objective + KL divergence penalty）：

J_{GRPO} (θ) = E_{q \sim D, {o_{i}}_{i = 1}^{G} \sim p_{θ_{old}} (\cdot ∣ q)} \frac{1}{G} i = 1 \sum G \frac{1}{∣ o _{i} ∣} t = 1 \sum ∣ o_{i} ∣ (min (\frac{p _{θ}^{i, t}}{p _{θ_{old}}^{i, t}} \hat{A}_{i, t}, clip (\frac{p _{θ}^{i, t}}{p _{θ_{old}}^{i, t}}, 1 - ε, 1 + ε) \hat{A}_{i, t}) - β D_{KL} [p_{θ} ∥ p_{ref}])

3.5 Language World Model 具体实现

Text Game State Prediction

# 伪代码：Text Game RLVR Pipeline
# Base model: DeepSeek-R1-Distill-Qwen-1.5B/7B
 
# Step 1: SFT (用 DeepSeek-R1 生成的 CoT 数据)
sft_data = reject_sample(deepseek_r1, changed_cases, n=4237)
model = lora_sft(base_model, sft_data, rank=32, epochs=10)
 
# Step 2: RLVR with binary or task-specific reward
for step in range(num_steps):
    q = tokenize(current_state, action)                # question
    outputs = sample_group(model, q, G=5)              # sample G responses
    for o_i in outputs:
        predicted_state = extract_json(detokenize(o_i))
        # Binary reward: 完全匹配才给 1
        R_binary = 1 if predicted_state == ground_truth else 0
        # Task-specific reward:
        # R = 0.1 * acc_all + 1.0 * acc_changed + 0.2 * I(correct)
    advantages = group_relative_normalize(rewards)
    update_grpo(model, outputs, advantages)

Task-specific reward 公式：

R = α_{1} \cdot acc_{all} + α_{2} \cdot acc_{changed} + α_{3} \cdot I (correct)

其中 $α_{1} = 0.1, α_{2} = 1, α_{3} = 0.2$ 。

Web Page State Prediction

# 伪代码：Web Page State RLVR Pipeline
# Reward: F1 score between predicted and ground-truth item changes
 
def compute_f1_reward(predicted_changes, gt_changes):
    """F1 score as verifiable reward for web page prediction"""
    tp = len(set(predicted_changes) & set(gt_changes))
    precision = tp / len(predicted_changes) if predicted_changes else (1 if not gt_changes else 0)
    recall = tp / len(gt_changes) if gt_changes else (1 if not predicted_changes else 0)
    return 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

3.6 Video World Model 具体实现

Visual Tokenization

Per-frame tokenizer（用于 single-step prediction）：VQGAN + FSQ，每帧 $16 \times 20 = 320$ tokens，codebook size $K = 7 \times 5 \times 5 \times 5 \times 5 = 4375$
Compressive tokenizer（用于 multi-step prediction）：conditional VQGAN with cross-attention，每帧仅 $8 \times 10 = 80$ tokens（压缩 4x），context frame $32 \times 40 = 1280$ tokens

Sequence Construction

Single-step prediction $p (s_{t + 1} ∣ s_{t - 3 : t}, a_{t - 3 : t})$ ：

x = concat (z_{t - 3}, b_{t - 3}, z_{t - 2}, b_{t - 2}, \dots, z_{t}, b_{t}, [bos], \underline{z_{t + 1}}, [eos])

序列长度 $= 4 \times (320 + 13) + 1 + 320 + 1 = 1654$ ，其中下划线部分（321 tokens）参与 loss 计算。

Multi-step prediction $p (s_{t + 1 : t + 7} ∣ s_{t}, a_{t : t + 6}, s_{c})$ ：

x = concat (z_{c}, z_{t}, b_{t}, \underline{z_{t + 1}}, b_{t + 1}, \underline{z_{t + 2}}, b_{t + 2}, \dots, \underline{z_{t + 7}}, b_{t + 7})

序列长度 $= 1280 + 8 \times (80 + 13) = 2024$ ，codebook size = 9006。

# 伪代码：Video World Model RLVR Pipeline
# Autoregressive Transformer: 138M params (LLaMA architecture, GPT-2 small scale)
 
# Step 1: Pre-train with MLE (next-token prediction)
# Single-step: 9.9 × 10^5 steps; Multi-step: 4.5 × 10^5 steps
 
# Step 2: RLVR post-training
for step in range(num_rlvr_steps):  # typically ~100-300 steps
    # Sample group of G=16 token sequences
    token_seqs = sample_group(model, context_tokens, G=16)
 
    for seq in token_seqs:
        # Decode tokens back to frames via visual decoder
        predicted_frames = visual_decoder(seq)
 
        # Single-step reward:
        # R = -L1(predicted, gt) - LPIPS(predicted, gt)
 
        # Multi-step reward:
        # R = -sum_{tau=t+1}^{t+7} [L1(s_hat_tau, s_tau) + LPIPS(s_hat_tau, s_tau)]
 
    advantages = group_relative_normalize(rewards)
    update_grpo(model, token_seqs, advantages)
    # NOTE: visual tokenizer 参数冻结，不参与更新

Video world model 的 reward 函数：

Single-step: $R (\overset{s}{^}_{t + 1}, s_{t + 1}) = - L_{1} (\overset{s}{^}_{t + 1}, s_{t + 1}) - LPIPS (\overset{s}{^}_{t + 1}, s_{t + 1})$

Multi-step: $R (\overset{s}{^}_{t + 1 : t + 7}, s_{t + 1 : t + 7}) = - \sum_{τ = t + 1}^{t + 7} [L_{1} (\overset{s}{^}_{τ}, s_{τ}) + LPIPS (\overset{s}{^}_{τ}, s_{τ})]$

4. Experimental Setup (实验设置)

4.1 Language World Model

配置项	Text Game	Web Page
数据集	ByteSized32-State-Prediction (76,369 transitions, 31 games, 2954 test)	WebArena (WMA) (~7K samples, 99% train / 1% test)
Base Model	DeepSeek-R1-Distill-Qwen-1.5B / 7B	DeepSeek-R1-Distill-Qwen-1.5B
SFT	LoRA rank=32, $α$ =16, 10 epochs, lr= $1 0^{- 5}$	LoRA rank=32, $α$ =16, 4 epochs, lr= $1 0^{- 5}$
RLVR	batch=128, group=5, lr= $1 0^{- 6}$ , KL coeff= $1 0^{- 3}$	batch=64, group=5, lr= $1 0^{- 6}$ , KL coeff= $1 0^{- 3}$
Reward	Binary / Task-specific ( $α_{1}$ =0.1, $α_{2}$ =1, $α_{3}$ =0.2)	F1 score
Evaluation	Accuracy (unchanged / changed / overall)	Precision, Recall, F1

4.2 Video World Model

配置项	Value
数据集	RT-1 (87,212 trajectories, 256x320, 13-dim actions), PushT, Rope, Granular
Visual Tokenizer	VQGAN + FSQ (per-frame: 320 tokens; compressive: 80 tokens)
Autoregressive Transformer	138M params, 12 layers, hidden=768, 12 heads (LLaMA arch)
Pre-training	Single-step: 9.9x10^5 steps; Multi-step: 4.5x10^5 steps; lr= $5 \times 1 0^{- 5}$
RLVR	batch=128, group=16, lr= $5 \times 1 0^{- 5}$ , KL coeff= $1 0^{- 3}$ , ~100-300 steps
Reward	$- L_{1} - LPIPS$
Evaluation	MSE, PSNR, SSIM, LPIPS (scaled by 100)

4.3 下游应用

Model Predictive Control (Web Agent)：policy model 采样 20 个候选动作 → 取 top-3 频率最高的 → world model 预测 → summarization model 提取 top-10 变化 → value model 评分 (1-5)，重复 20 次取平均最高分动作。Policy/summarization/value 均用 DeepSeek-V3
Real2Sim Policy Evaluation：用 video world model 替代手工 simulator (SIMPLER)，评估 RT-1/RT-1-X 策略在 open/close drawer 等 6 个任务上的表现

5. Experimental Results (实验结果)

5.1 Language World Model: Text Game

Figure (Text Game Training Curves) 解读：右侧训练曲线显示，RLVR 训练过程中 training reward（绿线）持续上升，test accuracy 在 unchanged cases（蓝线）上快速提升，在 changed cases（红线）上也有稳定提升。训练仅需 ~300 步即可收敛。

关键结果 (Table 1):

Model	Unchanged Acc	Changed Acc	Overall Acc
Base (R1-Distill-Qwen-1.5B)	11.98%	0.08%	7.11%
SFT	38.88%	24.21%	32.87%
RLVR-World (binary)	73.57%	33.14%	57.01%
RLVR-World (task-specific)	83.66%	33.80%	63.24%
Base (R1-Distill-Qwen-7B)	46.90%	5.53%	29.92%
SFT (7B)	65.94%	31.32%	51.76%
RLVR-World (7B, binary)	83.08%	40.33%	65.53%
GPT-4	73.90%	51.60%	64.76%

1.5B 模型：binary reward 相比 SFT 提升 +34.7% unchanged, +8.9% changed
Task-specific reward 进一步提升：+44.8% unchanged, +9.6% changed
7B 模型 RLVR 后整体超过 GPT-4（65.53% vs 64.76%）

5.2 Language World Model: Web Page

Figure (Web Page Training Curves) 解读：右侧训练曲线显示 F1 score 在训练集（蓝线）和测试集（红线）上的变化。F1 在约 250 步后稳定在 0.65 左右，无明显过拟合。

关键结果 (Table 2):

Model	Precision	Recall	F1	Web Agent Success Rate
Base (R1-Dist.-Qwen-1.5B)	15.59%	15.70%	11.83%	n/a
SFT	48.99%	56.05%	49.94%	12.06%
RLVR-World	72.77%	64.55%	65.11%	14.29%
$Δ$	+48.5%	+15.1%	+30.3%	+18.4%

5.3 Video World Model: RT-1

Figure 3 解读：视频 world model 在 RT-1 上的训练曲线。左图 (single-step) 和右图 (multi-step) 展示了 pre-training（MLE，x 轴单位 $\times 1 0^{4}$ ）和 post-training（RLVR，x 轴单位 $\times 1$ ）阶段的 LPIPS 变化。关键观察：RLVR 仅需 ~100 步（绿色虚线后）就能显著降低 LPIPS，达到甚至超过 MLE 训练数十万步的效果。右图中橙色虚线表示额外延长 pre-train 到 600K 步，LPIPS 仍停留在 ~14.5，远不如 RLVR 后的 ~13.4。这说明 RLVR 的效率优势极其显著（~1000x 更少的梯度步数）。

关键结果 (Table 3 - RT-1):

Task	Model	Repetition Rate	MSE↓	PSNR↑	SSIM↑	LPIPS↓
Single-step	Base	n/a	0.336	25.3	81.7	13.0
Single-step	RLVR-World	n/a	0.287	25.9	83.1	12.2
	$Δ$		+14.3%	+2.6%	+1.6%	+6.0%
Multi-step	Base	48.6%	0.659	23.1	80.9	14.8
Multi-step	Base (w/ rep. rejection)	0.0%	0.593	23.3	81.0	14.4
Multi-step	RLVR-World	9.9%	0.486	24.1	82.4	13.4
	$Δ$	+79.6%	+26.1%	+4.5%	+1.9%	+9.2%

5.4 与 SOTA 对比 (Table 4)

Model	PushT LPIPS↓	PushT SSIM↑	Rope LPIPS↓	Granular LPIPS↓	Granular SSIM↑
DINO-WM (Reported)	0.7	98.5	0.9	3.5	94.0
AVDC (Diffusion)	4.6	95.9	6.0	10.6	90.9
Base (Ours)	0.83	98.28	3.03	3.14	94.79
RLVR-World	0.70	98.46	2.08	2.42	95.42

RLVR 后模型在 PushT 上达到 DINO-WM 水平，在 Granular（最难数据集）上显著超越 DINO-WM。

5.5 Model Analysis

Figure 4 解读：三个分析实验。(a) Test-time scaling：随着采样数 $N$ 增大，取 best-of- $N$ 的 LPIPS，RLVR-World 在 $N = 1$ 时就优于 base model 的 best-of-5。但当 $N$ 增大到 100 时，base model 逐渐追上甚至超过 RLVR，说明 RLVR 当前方法仍有提升空间。(b) RL training scaling：增大 GRPO 的 group size $G$ （4→8→16→32）可以提升收敛速度和最终性能。(c) Metric-oriented optimization：用不同指标（MAE/MSE/PSNR/SSIM/LPIPS）作为 reward 训练，每个模型在对应指标上取得最优，验证了 RLVR 的 metric-specific 优化能力。

5.6 Repetition 缓解

Figure 6 (上半部分) 解读：多步视频预测的定性对比。上方两行展示 ground-truth 和 RLVR-World 的预测，画面清晰且持续变化。底部行展示 base model 的预测，从 $t = 3$ 开始出现明显的 repetition（重复帧），机械臂完全停滞。RLVR 将 repetition rate 从 48.6% 降至 9.9%。

5.7 Real2Sim Policy Evaluation

Figure 5 解读：Real2Sim 策略评估结果。横轴为真实成功率（real success rate），纵轴为模拟评估成功率（evaluated success rate）。理想情况下所有点应在对角线上。相比手工 SIMPLER simulator（红色/蓝色），video world model（绿色/紫色）与对角线的偏差更小，说明 neural world model 是更好的 real-world simulator 近似。RLVR-World（紫色）进一步优于 base model（绿色），提供更准确的策略评估。

5.8 Training Reward Curves

Figure 8 解读：RLVR-World 单步预测的训练过程曲线。左图为 reward（负 L1+LPIPS），中图为 L1 loss，右图为 LPIPS loss。三者均在 ~300 步内持续改善，且 smooth 曲线（橙色）显示稳定的单调趋势。这验证了 GRPO 在视觉 world model 上的有效性和训练稳定性。

Code-to-Paper Mapping

Paper 组件	代码位置	说明
Language WM SFT	`lang_wm/verl/examples/sft/`	LoRA supervised fine-tuning
Language WM RLVR	`lang_wm/verl/examples/grpo_trainer/`	GRPO training with binary/task-specific reward
LoRA Merge	`lang_wm/verl/merge_lora.py`	合并 LoRA 权重
Model Merge (RLVR)	`lang_wm/verl/scripts/model_merger.py`	合并 RLVR checkpoint
Web Agent MPC	`lang_wm/webagent/`	Model Predictive Control for WebArena
Text Game Data	`lang_wm/data_process/text_game/`	数据集生成脚本
Video Tokenizer	`vid_wm/ivideogpt/`	VQGAN + FSQ tokenizer (per-frame & compressive)
Video WM Training	`vid_wm/ivideogpt/`	Transformer pre-training
Video WM RLVR	`vid_wm/verl/`	GRPO fine-tuning for video prediction
Data Converter	`vid_wm/oxe_data_converter.py`	RT-1 Open X-Embodiment 数据预处理
HuggingFace Models	HuggingFace Hub	SFT/RLVR checkpoints, tokenizers, datasets

关键依赖：VERL (RL training framework), iVideoGPT (video tokenizer/transformer), WMA-Agents (web agent), SimplerEnv (policy evaluation)

总结与局限性

核心贡献

首次将 RLVR 范式应用于 world model training，在语言和视频两种模态上均验证有效
统一的序列建模框架：将不同模态的 world model 统一为 autoregressive prediction + RLVR post-training
训练效率极高：RLVR 仅需几百步梯度更新（vs MLE 需要数十万步），且性能显著提升

局限性

Performance barrier：训练通常在几百步后收敛，如何突破需要更深入分析
OOD 泛化：RLVR 能否提升 world model 对训练域外动作的泛化能力，尤其是反事实推理
通用 video world model：当前在单一数据集上训练，未来需在通用视频模型上验证
更多模型架构：GRPO 理论上 model-agnostic，但 diffusion model 的 GRPO 算法尚在发展中
Task-aligned rewards：视觉指标（MSE/LPIPS）仍不完全捕捉物理规则和时序一致性

Paper Notes

探索

RLVR-World: Training World Models with Reinforcement Learning

RLVR-World: Training World Models with Reinforcement Learning

1. Motivation (研究动机)

核心问题：MLE 训练目标与 world model 任务目标不对齐

启发来源

2. Idea (核心思想)

核心思想

三个关键设计选择

应用范围

3. Method (方法)

3.1 Problem Formulation

3.2 World Models as Sequence Modeling

3.3 Prediction Metrics as Verifiable Rewards

3.4 GRPO 优化

3.5 Language World Model 具体实现

Text Game State Prediction

Web Page State Prediction

3.6 Video World Model 具体实现

Visual Tokenization

Sequence Construction

4. Experimental Setup (实验设置)

4.1 Language World Model

4.2 Video World Model

4.3 下游应用

5. Experimental Results (实验结果)

5.1 Language World Model: Text Game

5.2 Language World Model: Web Page

5.3 Video World Model: RT-1

5.4 与 SOTA 对比 (Table 4)

5.5 Model Analysis

5.6 Repetition 缓解

5.7 Real2Sim Policy Evaluation

5.8 Training Reward Curves

Code-to-Paper Mapping

总结与局限性

核心贡献

局限性

目录