RynnVLA-002: A Unified Vision-Language-Action and World Model

Authors: Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen Affiliations: DAMO Academy (Alibaba Group), Hupan Lab, Zhejiang University arXiv: 2511.17502 GitHub: alibaba-damo-academy/RynnVLA-002

1. Motivation (研究动机)

VLA 模型的三大根本缺陷: 标准 VLA 架构面临三个核心问题：(1) 无法完全理解 action，因为 action 仅出现在输出端，模型无法形成对 action dynamics 的显式内部表征；(2) 缺乏想象力，无法预测给定 action 后世界状态的变化，阻碍了前瞻性推理和反事实推理；(3) 没有对物理规律的显式理解，无法内化物体交互、接触和稳定性等物理动力学。
World Model 的功能缺失: 虽然 World Model 通过学习预测未来观测来弥补上述不足，能提供 action-aware 的内部状态、想象力和物理动力学表征，但 World Model 自身无法直接生成 action 输出，限制了其在需要显式 action planning 场景中的应用。
离散 Action Chunk 的误差累积问题: 在自回归模型中，将 action 离散化并与 image/text token 统一到同一 vocabulary 中生成 action chunk 时，由于预训练 MLLM 主要接触的是 image 和 text 而非 action，其在 action domain 的泛化能力有限。后续 action 依赖于前序 action 的生成，导致错误不断传播累积。
离散模型在真实机器人上的泛化困难: 离散自回归架构在仿真中表现良好，但在真实机器人实验中泛化能力差、推理速度慢。大型自回归架构在有限真实数据上容易过拟合，且离散 action 生成的顺序性导致动作不连续、推理延迟大。

2. Idea (核心思想)

RynnVLA-002 的核心思想是将 VLA 模型和 World Model 统一在一个自回归 Action World Model 框架中，实现 action 和 image 的理解与生成的联合学习。通过共享的 LLM backbone 参数，VLA 分支学习从视觉观测生成 action，World Model 分支学习从 action 和当前图像预测未来图像状态，两者相互增强：World Model 的图像生成训练目标迫使模型关注被操作物体的运动，从而增强 VLA 对物体交互动力学的注意力；VLA 的视觉理解能力则反过来提升 World Model 的图像生成精度。

与现有方法的根本区别在于：(1) 通过在 Chameleon 架构上构建统一 vocabulary（65536 tokens），实现了 text、image、action、state 四种模态的原生统一；(2) 提出了专门的 action attention mask 策略来解决离散 action chunk 的误差累积问题；(3) 引入了连续 Action Transformer head 来弥补离散模型在真实世界中的泛化和平滑性不足。

3. Method (方法)

3.1 整体框架

Figure 1 解读: 这张图对比了三种模型范式。(a) VLA Model（如 OpenVLA）仅支持 image understanding 和 action generation，不支持 image generation 和 action understanding；(b) World Model（如 iVideoGPT）支持 image understanding/generation 和 action understanding，但不支持 action generation；(c) Action World Model（如 WorldVLA/RynnVLA-002）同时支持四项能力——image understanding、image generation、action understanding 和 action generation。每种模型底部有对应的 tokenizer（Image/Text/Action Tokenizer），顶部有对应的 de-tokenizer。

Figure 2 解读: RynnVLA-002 的整体架构图。模型由一个共享的 LLM backbone（中间彩色 token 序列）同时服务两个分支：左侧为 VLA Model 分支，输入包括 Text Tokenizer（语言指令）、State Tokenizer（本体感知状态 $[p, q, g]$ ）和 Image Tokenizer（ $M$ 帧历史前置+腕部相机图像），输出通过 Action De-Tokenizer 生成 $K$ 步离散 action，以及通过 Action Transformer 并行生成 $K$ 步连续 action $[Δ x, Δ θ, Δ G r i p]$ ；右侧为 World Model 分支，输入为 Text Tokenizer（生成指令）、Image Tokenizer（当前帧）和 Action Tokenizer（ $N$ 步 action），输出通过 Image De-Tokenizer 生成 $N$ 帧预测的未来图像。训练时两个分支混合数据联合优化。

3.2 Data Tokenization

四种 Tokenizer 统一到 65536 大小的共享 vocabulary：

Image Tokenizer: 基于 VQ-GAN 模型（Esser et al., 2021），带有额外的 perceptual loss 用于关注 specific image regions（如人脸和显著物体）。压缩比为 16，codebook 大小为 8192。256×256 图像生成 256 tokens，512×512 图像生成 1024 tokens。
Text Tokenizer: 继承自 Chameleon 的 BPE tokenizer。
State & Action Tokenizer: 将连续的 robot proprioceptive state 和 action 的每个维度离散化为 256 bins，bin 宽度由训练数据范围决定。归一化公式为：

a_{norm} = 2 \cdot \frac{a - a _{m i n}}{a _{m a x} - a _{m i n} + ϵ} - 1

然后通过 np.digitize 映射到对应 bin ID，加上 action start token 的偏移量。

Continuous Action: Action Transformer 生成的连续 action 为原始 raw action，不经过 tokenization。

3.3 VLA Model 数据格式

VLA 模型的 token 序列结构：

\times M {text} {state} {image-front-wrist} \times K {action}

文本输入为 “What action should the robot take to + <task> + ?”
$M$ 为历史帧数， $K$ 为 action chunk 大小
离散 action 的训练损失 $L_{dis_action}$ 为 discrete action token 的 cross-entropy loss

VLA 模型的策略形式化为：

a_{t} \sim π (a_{t} ∣ l, s_{t - 1}, o_{t - h : t}) (1)

3.4 World Model 数据格式

World Model 的 token 序列结构：

\times N {text} {images-front-wrist} {action} {images-front-wrist} L_{im g}

文本前缀为 “Generate the next frame based on the current image and the action.”
$N$ 为预测轮数
$L_{im g}$ 为 discrete image token 的 cross-entropy loss

World Model 形式化为：

\overset{o}{^}_{t} \sim f (o_{t} ∣ o_{t - h : t - 1}, a_{t - h : t - 1}) (2)

3.5 训练目标

混合 VLA 和 World Model 数据联合训练，离散部分的总损失：

L_{d i s} = L_{dis_action} + L_{im g}

加上连续 Action Transformer 的 L1 regression loss 后，最终总损失：

L = L_{d i s} + α L_{conti_action} = L_{dis_action} + L_{im g} + α L_{conti_action}

其中 $α = 10$ 为连续 action loss 的权重系数。

3.6 Attention Mask for Discrete Action Chunk

Figure 3 解读: 三种 attention mask 的对比。(a) 默认 VLA 模型使用标准 causal mask，所有 token 都能看到之前的所有 token（包括之前的 action token），这导致生成 action chunk 时后续 action 依赖于前序 action 的预测结果，误差不断累积；(b) RynnVLA-002 提出的 VLA attention mask，action token 之间互相不可见（每个 action 只能看到 text 和 image token），但都能看到所有前序的 text 和 image token，使得每个 action 独立地基于视觉和语言输入生成，有效阻断了误差传播链；(c) World Model 部分保持传统 causal mask，因为图像生成需要 token 之间的顺序依赖。

代码实现（generate_att_mask_3 方法）：

构建标准下三角 causal mask
识别 image block（token 8197/8196 标记）和 action block（token 10004/15004 标记）
将最后一个 image block 之后的 action block 之间的 attention 设为不可见

3.7 Action Transformer for Continuous Action Chunk

离散 action chunk 虽然在仿真中表现出色，但在真实机器人中有两个关键问题：(1) 大型自回归架构在有限真实数据上严重过拟合；(2) action mask 使每个 action 独立生成，无法保证 chunk 内动作的时序连续性，导致严重抖动。

Action Transformer 设计：

接收 LLM backbone 的 hidden states 作为 context
使用可学习的 action token embeddings（action_token_embeddings）
通过 hidden_projection 降维（默认压缩至 0.25 倍）
经过多层 Transformer encoder（4 个 attention heads）处理
最终由 L1RegressionActionHead（MLP with residual blocks）输出连续 action

关键优势：

Action Transformer 比 base LLM 小得多，不易过拟合
双向 attention 和并行解码，一次 forward pass 输出整个 action chunk
生成的动作更平滑连续

3.8 Pseudocode

Algorithm 1: 数据 Tokenization 与序列构建

Algorithm: VLA Data Sequence Construction
Input: language_instruction l, state s, M historical image pairs (front, wrist), K action steps
Output: token_sequence, labels
 
1: text_tokens = BPE_tokenize("What action should the robot take to + l + ?")
2: state_tokens = []
3: for dim in s:   # 7-dim proprioceptive state
4:     s_norm = 2 * (dim - s_min) / (s_max - s_min + eps) - 1
5:     s_bin = digitize(clip(s_norm, -1, 1), linspace(-1, 1, 256))
6:     state_tokens.append(s_bin + ACTION_START_TOKEN_ID + 1)
7: image_tokens = []
8: for i in 1..M:
9:     front_img = center_crop(front_images[i], 256)
10:    wrist_img = center_crop(wrist_images[i], 256)
11:    front_toks = VQGAN_encode(front_img)  # 256 tokens
12:    wrist_toks = VQGAN_encode(wrist_img)  # 256 tokens
13:    image_tokens.extend([front_toks, wrist_toks])
14: action_tokens = []
15: for k in 1..K:
16:    for dim in action[k]:  # 7-dim action
17:        a_norm = 2 * (dim - a_min) / (a_max - a_min + eps) - 1
18:        a_bin = digitize(clip(a_norm, -1, 1), linspace(-1, 1, 256))
19:        action_tokens.append(a_bin + ACTION_START_TOKEN_ID + 1)
20: token_sequence = text_tokens + state_tokens + image_tokens + action_tokens
21: labels = [-100]*len(text_tokens + state_tokens + image_tokens) + action_tokens
22: return token_sequence, labels

Algorithm 2: Action Attention Mask 构建

Algorithm: Action Attention Mask Generation (generate_att_mask_3)
Input: input_ids (batch of token sequences)
Output: attention_mask [batch_size, seq_len, seq_len]
 
1: mask = lower_triangular(seq_len)  # standard causal mask
2: for each sequence in batch:
3:     img_blocks = find_blocks(input_ids, start=8197, end=8196)
4:     act_blocks = find_blocks(input_ids, start=10004, end=15004)
5:     last_img_block = img_blocks[-1]
6:     # find action blocks AFTER the last image block
7:     post_img_actions = [b for b in act_blocks if b.start > last_img_block.end]
8:     # make these action blocks invisible to each other
9:     for i, block_i in enumerate(post_img_actions):
10:        for j, block_j in enumerate(post_img_actions):
11:            if i != j:
12:                mask[block_i.start:block_i.end, block_j.start:block_j.end] = 0
13: return mask

Algorithm 3: Action Transformer Forward Pass

Algorithm: Action Transformer Inference
Input: hidden_states H from LLM backbone, input_ids, action_dim=7, time_horizon=K
Output: continuous_actions [batch, K, action_dim]
 
1: # Extract context hidden states (everything before action tokens)
2: context_positions = find_positions_before(input_ids, target_token_id=10004)
3: context_hs = H[context_positions]  # [batch, ctx_len, hidden_dim]
4: # Project to lower dimension
5: context_proj = hidden_projection(context_hs)  # compress by 0.25x
6: # Create learnable action queries
7: action_queries = action_token_embeddings.expand(batch, K, -1)
8: # Concatenate context with action queries
9: combined = concat(context_proj, action_queries, dim=1)
10: # Process through Transformer encoder (bidirectional attention)
11: encoded = transformer_encoder(combined)
12: # Extract action outputs
13: action_hs = encoded[:, -K:, :]  # last K positions
14: # Regress continuous actions via MLPResNet
15: continuous_actions = L1_regression_head(action_hs)  # [batch, K, 7]
16: return continuous_actions

Algorithm 4: 联合训练

Algorithm: RynnVLA-002 Joint Training
Input: VLA dataset D_vla, World Model dataset D_wm
Output: trained model M_psi
 
1: Initialize M_psi from Chameleon pretrained weights
2: for epoch in 1..num_epochs:
3:     for batch in shuffle(mix(D_vla, D_wm)):
4:         if batch is VLA data:
5:             tokens, labels = construct_vla_sequence(batch)
6:             att_mask = generate_att_mask_3(tokens)  # action-masked
7:         else:  # World Model data
8:             tokens, labels = construct_wm_sequence(batch)
9:             att_mask = causal_mask(tokens)  # standard causal
10:        # Forward through shared LLM backbone
11:        logits, hidden_states = M_psi(tokens, attention_mask=att_mask)
12:        # Discrete losses
13:        L_dis_action = cross_entropy(logits[action_positions], labels[action_positions])
14:        L_img = cross_entropy(logits[image_positions], labels[image_positions])
15:        # Continuous action loss (only for VLA data)
16:        if batch is VLA data:
17:            cont_actions = ActionTransformer(hidden_states, tokens)
18:            L_conti = L1_loss(cont_actions, ground_truth_actions)
19:        else:
20:            L_conti = 0
21:        # Total loss
22:        L = L_dis_action + L_img + alpha * L_conti  # alpha = 10
23:        L.backward()
24:        optimizer.step()

3.9 Code-to-Paper Mapping

Paper Concept	Source File	Key Class/Function
整体模型架构	`rynnvla-002/model/modeling_xllmx_chameleon_ck_action_head.py`	`ChameleonXLLMXForConditionalGeneration_ck_action_head`
Action Transformer Head	`rynnvla-002/model/modeling_xllmx_chameleon_ck_action_head.py`	`ActionHead`, `L1RegressionActionHead`, `MLPResNet`
Action Attention Mask	`rynnvla-002/model/modeling_xllmx_chameleon_ck_action_head.py`	`generate_att_mask_3()`
模型配置	`rynnvla-002/model/configuration_xllmx_chameleon.py`	`ChameleonXLLMXConfig`
VLA 数据序列构建	`rynnvla-002/data/item_processor.py`	`FlexARItemProcessor_Action_State`
Action/State 离散化	`rynnvla-002/data/item_processor.py`	`norm_action()`, `np.digitize`
World Model 数据构建	`rynnvla-002/data/world_model_bi_views_conv_generation.py`	-
联合训练 Solver	`xllmx/solvers/pretrain/pretrain_ck_action_head.py`	`PretrainSolverBase_ck_action_head`
VLA 推理	`rynnvla-002/eval_solver_libero_continous_w_state.py`	`get_action_Chameleon_dis_awm_ck_continous_action()`
LIBERO 评估	`rynnvla-002/libero_util/run_libero_eval.py`	-
Image Tokenizer (VQGAN)	`rynnvla-002/model/chameleon_vae_ori/vqgan.py`	-
Chameleon Base Model	`rynnvla-002/model/chameleon/modeling_chameleon.py`	`ChameleonForConditionalGeneration`

4. Experimental Setup (实验设置)

数据集

仿真 (LIBERO Benchmark):

LIBERO-Spatial: 空间关系任务，将碗放置在不同位置
LIBERO-Object: 物体识别与操作，唯一物体
LIBERO-Goal: 固定物体集合下的程序学习，变化任务目标
LIBERO-Long: 10 个复杂长时序操作任务
数据预处理：移除失败轨迹，过滤 no-operation actions
World Model 评估：90%/10% 训练/验证集划分

真实机器人 (LeRobot SO100):

Task 1: Place the block inside the circle（248 demonstrations）
Task 2: Place strawberries into the cup（249 demonstrations）
所有轨迹由专家通过 teleoperation 收集

Baseline 方法

离散 Action 方法: LAPA, TraceVLA, OpenVLA, SpatialVLA, NORA, CoT-VLA, $π_{0}$ -FAST, MolmoAct, FlowVLA, UniVLA

连续 Action 方法: Diffusion Policy, Octo, MDT, DiT Policy, MaIL, ThinkAct, $π_{0}$ , SmolVLA, OpenVLA-OFT, Seer, UVA

真实机器人 Baseline: GR00T N1.5, $π_{0}$

评估指标

VLA: 50 次 deployment rollout 的 success rate（每次不同初始状态）
World Model: FVD (Frechet Video Distance), PSNR, SSIM, LPIPS

训练配置

基础模型: Chameleon 7B
历史帧数 $M = 2$
Action chunk size: $K = 10$ （LIBERO-Long, LIBERO-Spatial）, $K = 5$ （LIBERO-Object, LIBERO-Goal）
World Model 预测轮数: $N = 1$
连续 action loss 权重: $α = 10$
Image token loss 权重: 0.04
VLA 图像分辨率: 256×256
World Model 图像分辨率: 512×512
Action 维度: 7（ $Δ x, Δ θ, Δ G r i p$ ）

5. Experimental Results (实验结果)

LIBERO Benchmark 主要结果

Method	Pretraining	Action Type	Spatial	Object	Goal	Long	Average
OpenVLA	Yes	Discrete	84.7	88.4	79.2	53.7	76.5
$π_{0}$ -FAST	Yes	Discrete	96.4	96.8	88.6	60.2	85.5
UniVLA	Yes	Discrete	96.5	96.8	95.6	92.0	95.2
$π_{0}$	Yes	Continuous	90.0	86.0	95.0	73.0	86.0
OpenVLA-OFT	Yes	Continuous	97.6	98.4	97.9	94.5	97.1
RynnVLA-002-Discrete	No	Discrete	94.2	96.8	94.6	87.6	93.3
RynnVLA-002-Continuous	No	Continuous	99.0	99.8	96.4	94.4	97.4

关键发现: RynnVLA-002 在不使用任何预训练的情况下，连续 action 版本达到 97.4% 平均成功率，与使用大规模预训练的 OpenVLA-OFT (97.1%) 持平甚至超越。

真实机器人结果 (SO100)

Task / Setting	GR00T N1.5	$π_{0}$	RynnVLA-002
Block → Circle / Single	90.0	100.0	90.0
Block → Circle / Multi	60.0	70.0	90.0
Block → Circle / w/ Distractors	50.0	50.0	80.0
Strawberry → Cup / Single	50.0	80.0	80.0
Strawberry → Cup / Multi	50.0	70.0	80.0
Strawberry → Cup / w/ Distractors	70.0	40.0	50.0

关键发现: 无预训练的 RynnVLA-002 在 multi-target 和 distractor 场景中显著优于使用预训练的 baseline（超出 10%~30%），表明 World Model 联合训练增强了对复杂场景的鲁棒性。

Ablation Study 关键发现

World Model 增强 VLA:

离散 action: 加入 World Model 从 62.8% 提升到 67.2%（Table 3, Line 1→2）
连续 action: 从 91.6% 提升到 94.6%（Table 4, Line 2→3）
真实机器人：无 World Model 训练的模型成功率低于 30%，加入后提升至 80%+（Table 5）

VLA 增强 World Model (Table 6):

Action World Model 在所有 LIBERO 子集上的 FVD、PSNR、SSIM、LPIPS 均优于纯 World Model
例如 Object 子集 FVD: 1141.6 → 877.2

Attention Mask 的效果:

加入 action attention mask 后离散 action 从 54.0% 提升到 76.6%（Table 3, Line 3→4）
在 chunk size 增大时优势更加明显（Figure 6）

Wrist Camera 与 Proprioceptive State: 真实机器人实验中，去除任何一个都会导致成功率降至 0%（Table 5, Line 2-3）

离散 Action 加速连续 Action 训练收敛: 保留离散 action token 训练不仅加速收敛，还提升最终成功率（Figure 8）

World Model 预训练: 用 World Model 目标预训练可以进一步提升 VLA（Table 8, Long: 23.0% → 30.2%）

推理效率 (Table 7): 连续 action 推理显著快于离散 action（如 chunk=5 时 24.94 Hz vs 3.69 Hz），且频率几乎不随 chunk size 增加。

局限性

论文未讨论在更大规模跨 embodiment 数据上的预训练效果
World Model 仅使用单步预测 ( $N = 1$ )，多步预测的效果未充分探索
真实机器人实验限于 SO100 单一平台，两个 pick-and-place 任务

关于 Reinforcement Learning 的说明

本文不使用 Reinforcement Learning (RL)。RynnVLA-002 完全基于 supervised learning（模仿学习），训练目标包括：

离散 action/image token 的 cross-entropy loss（监督信号来自专家 demonstration）
连续 action 的 L1 regression loss（监督信号来自专家 demonstration 的 ground-truth action）

不涉及任何 Reward Model、VLM zero-shot 判断、或 reward signal 设计。

Paper Notes

探索

RynnVLA-002: A Unified Vision-Language-Action and World Model

RynnVLA-002: A Unified Vision-Language-Action and World Model

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 整体框架

3.2 Data Tokenization

3.3 VLA Model 数据格式

3.4 World Model 数据格式

3.5 训练目标

3.6 Attention Mask for Discrete Action Chunk

3.7 Action Transformer for Continuous Action Chunk

3.8 Pseudocode

Algorithm 1: 数据 Tokenization 与序列构建

Algorithm 2: Action Attention Mask 构建

Algorithm 3: Action Transformer Forward Pass

Algorithm 4: 联合训练

3.9 Code-to-Paper Mapping

4. Experimental Setup (实验设置)

数据集

Baseline 方法

评估指标

训练配置

5. Experimental Results (实验结果)

LIBERO Benchmark 主要结果

真实机器人结果 (SO100)

Ablation Study 关键发现

局限性

关于 Reinforcement Learning 的说明

目录