Unified Video Action Model (UVA)

Authors: Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song Affiliations: Stanford University arXiv: 2503.00200 Project Page: unified-video-action-model.github.io GitHub: ShuangLI59/unified_video_action Venue: RSS 2025

1. Motivation (研究动机)

视频生成与动作预测的矛盾需求: Action modeling 需要 高时间频率 来捕捉密集、细粒度的运动，而 Video generation 需要 高空间分辨率 来生成高保真视觉输出，后者往往导致处理速度变慢。现有方法难以同时满足这两个需求。
Action-only 方法的局限: Diffusion Policy 等纯动作方法完全跳过视频生成，虽然减少了计算复杂度，但丧失了视频生成带来的好处——视频监督帮助模型学习场景动态，减少对动作历史的过拟合，并增强对视觉扰动的鲁棒性。
Video generation 方法的局限: UniPi 等先生成视频再预测动作的分层方法，存在 推理速度慢 和 误差传播 两大问题。生成的视频中的误差会累积传递到动作预测中。
核心问题: 如何在一个统一框架中同时进行视频和动作建模，既保留视频生成对动作学习的增益，又实现高速动作推理？

2. Idea (核心思想)

UVA 的核心洞察是：联合训练视频和动作的统一 latent 表示，但在解码阶段解耦它们。

具体来说，UVA 通过三个关键设计实现这一目标：

统一 Latent Video-Action 表示: 通过一个 Transformer 学习融合视觉和动作信息的联合 latent space，让视频和动作域之间的交互信息在 latent 层面充分建模。
解耦 Video-Action Diffusion: 使用两个独立的轻量级 Diffusion Head 分别解码视频和动作。推理时可以跳过视频生成，直接从联合 latent 解码动作，达到接近 action-only 方法的推理速度。
Masked Training 实现多功能性: 通过随机 mask 输入输出组合，一个模型可同时充当 policy、video model、forward/inverse dynamics model 和 combined policy+planner。

与现有方法的本质区别在于：UVA 不是分层地先生成视频再预测动作（如 UniPi），也不是完全抛弃视频信息（如 Diffusion Policy），而是在 latent space 中实现了 video 和 action 的 真正融合与灵活解耦。

关于 Reinforcement Learning: 本文 不使用 RL。UVA 是一个基于 imitation learning 的框架，通过 Diffusion loss 进行监督学习训练，不涉及 Reward Model、VLM 判断或任何 reward signal。

3. Method (方法)

3.1 Overall Framework

Figure 1 解读: 左侧 (a) 展示 UVA 的整体概览：历史观测和动作输入到一个 Joint Sequence Model（Transformer），产生 Joint Latent 表示，然后通过两个独立的 Diffusion Head 分别解码生成动作和视频观测。右侧 (b) 展示通过不同的 mask 组合实现多种功能：Policy（输入历史观测+动作，输出未来动作）、Video Model（输入历史观测，输出未来视频）、Forward Dynamics（输入观测+动作，输出未来观测）、Inverse Dynamics（输入观测序列，输出动作）、Policy+Planner（输入历史，同时输出未来视频和动作）。

Figure 2 解读: 详细的网络架构图。左侧展示了输入编码过程：历史观测通过 VAE 编码并 flatten 为 $N$ 个 token，历史动作 chunk 被 repeat $M$ 次以匹配视觉 token 数量，未来观测帧的部分 token 被随机 mask。这些 token 经过 channel-wise concatenation 后送入 Temporal-wise Concatenation 模块，再经过 Transformer 得到 Joint Latent $Z$ 。右侧展示了解码过程：每个 latent token 送入 Video Diffusion 解码为未来观测帧的 patch，所有 latent token 经 aggregation 后送入 Action Diffusion 解码为未来动作序列。

3.2 Encode History (历史编码)

视觉观测编码:

历史图像观测 ${O_{t - h + 1}, \dots, O_{t}}$ 通过预训练的 VAE encoder (kl-f16) 编码为 latent map $R^{w \times h \times c}$
每张图像被 flatten 后通过 FC 层投影为 $N$ 个 $d$ 维 token（ $N = 256$ ，对应 $16 \times 16$ spatial grid）

动作编码:

动作采样频率高于视觉观测，每个观测对应 $L$ 个动作
动作 chunk $A_{t} \in R^{L \times m}$ 被 repeat $M$ 次以匹配视觉 token 数量
通过 FC 层投影为 $N$ 个 $d$ 维 action token

位置编码:

使用可学习的 temporal position embedding + spatial position embedding（可加性组合）

3.3 Masked Autoencoder for Observation Prediction (视频预测)

UVA 的视频预测方法基于 Masked Autoencoder (MAE) + Diffusion 的组合，受 MaskGIT 和 MAR 方法启发：

未来观测帧 ${O_{t + 1}, \dots, O_{t + h}}$ 同样通过 VAE 编码为 latent token
训练时: 随机 mask 部分 visual token（使用 truncated Gaussian 采样 mask ratio，中心在 100%，std=0.25，最小值 0.7），模型学习重建被 mask 的 token
推理时: 从全空 mask 开始，逐步自回归地生成所有 token
一致性 mask: 为避免信息泄露，所有视频帧在同一空间位置使用相同的 mask pattern

Temporal Concatenation: 来自 $h$ 个时间步的 latent 表示沿时间维度拼接，形成 $N \times h$ 的 latent 序列，与历史视觉和动作 token 一起送入 Transformer。

Language Conditioning: 对于需要语言指令的任务（如 Libero10），语言通过 CLIP text encoder 编码为 $d$ 维 token，repeat $M$ 次后 append 到序列中，通过 Transformer 的 cross-attention 融合。

3.4 Decoupled Video and Action Diffusions (解耦 Diffusion 解码)

这是 UVA 最关键的设计之一。与传统方法在整个模型上执行 denoising 不同，UVA 将 denoising 限制在两个轻量级 decoder 中。

Joint Latent $Z$ : Transformer 输出的联合 video-action latent 表示 ${Z_{t + 1}, \dots, Z_{t + h}}$ ，每个 $Z_{t + 1}$ 包含 $N$ 个 latent token。

Video Diffusion Decoder:

对每个 latent token $z_{i} \in Z_{t + 1} = {z_{1}, \dots, z_{N}}$ ，使用 Video Diffusion Head 预测对应 patch
预测结果 reshape 后送入 VAE decoder 重建完整帧 $O_{t + 1}$
架构: SimpleMLPAdaLN — 一个带 AdaLN modulation 的轻量 MLP，包含 TimestepEmbedder 和多个 ResBlock

Action Diffusion Decoder:

对所有 latent token 通过 Conv + FC 网络进行空间聚合（将 $16 \times 16$ spatial grid 降至 $4 \times 4$ 再 flatten），生成 action latent
通过时间 interpolation（从 4 帧插值到 16 个动作步）和 refinement MLP 得到最终 action condition
使用与 Video Diffusion 相同结构的 SimpleMLPAdaLN 作为 Diffusion Head 生成动作 chunk $A_{t}$

3.5 Loss Functions (损失函数)

Action Diffusion Loss:

L_{action} (Z, A) = E_{ϵ, k} [∥ ϵ - ϵ_{θ} (A^{(k)} ∣ k, Z) ∥^{2}]

其中 $A^{(k)}$ 表示加噪后的动作， $k$ 是 diffusion timestep， $ϵ$ 是添加的噪声， $ϵ_{θ} (A^{(k)} ∣ k, Z)$ 是以 joint latent $Z$ 为条件的噪声预测。

Video Diffusion Loss:

L_{video} (Z, O) = E_{ϵ, k} [\frac{1}{N} i = 1 \sum N ∥ ϵ_{i} - ϵ_{ϕ} (O^{i, (k)} ∣ k, z_{i}) ∥^{2}]

其中 $O^{i, (k)}$ 是第 $i$ 个 visual token 在 diffusion timestep $k$ 的加噪版本， $z_{i}$ 是对应的 latent token。

总损失: 每个时间步的损失为两个 diffusion loss 之和：

L = L_{action} + L_{video}

总损失为所有时间步 $h$ 上的求和。

3.6 Masked Training with Flexible Objectives (多任务 Mask 训练)

通过在训练中随机选择不同的 task mode，实现一个模型支持多种功能：

Task Mode	输入 mask	输出	Loss
`policy_model`	历史观测+动作 → mask 未来视频	未来动作	$L_{action}$
`video_model`	历史观测 → mask 部分未来视频	未来视频	$L_{video}$
`dynamic_model`	历史观测+动作（含未来动作）	未来视频	$L_{video}$
`inverse_model`	历史+未来观测 → mask 动作	动作	$L_{action}$
`full_dynamic_model`	完整输入	视频+动作	$L_{video} + L_{action}$

训练时每个 batch 随机采样一种 task mode，未使用的组件用 learned mask token 替代。

3.7 Pseudocode

Algorithm 1: UVA Forward Pass (Training)

Algorithm: UVA Training Forward Pass
Input: imgs (B, T, C, H, W), cond (B, T, C, H, W), history_nactions, nactions, text_latents, task_mode
Output: loss, video_loss, action_loss
 
1:  # Patchify images and conditions via VAE
2:  x = patchify(imgs)           # (B, T, seq_len, token_dim)
3:  cond = patchify(cond)        # (B, T, seq_len, token_dim)
4:  gt_latents = x.clone()
5:
6:  # Generate random mask (consistent across temporal frames)
7:  orders = sample_random_orders(B, seq_len)
8:  mask_rate ~ TruncatedGaussian(loc=1.0, std=0.25, min=0.7)
9:  mask = create_spatial_mask(orders, mask_rate)  # same mask for all T frames
10:
11: # === MAE Encoder ===
12: if task_mode == "policy_model":
13:     cond_tokens = z_proj_cond(cond)
14:     x_tokens = fake_latent_x.expand(B, T*S, d)   # all masked
15: elif task_mode == "inverse_model":
16:     x_tokens = z_proj(x)                           # future obs visible
17:     cond_tokens = fake_latent_x.expand(...)         # history masked
18: else:
19:     cond_tokens = z_proj_cond(cond)
20:     x_tokens = z_proj(x)
21:     x_tokens[mask == 1] = fake_latent_x            # partial mask
22:
23: # Concatenate all modality tokens
24: action_tokens = action_proj_cond(nactions) if dynamic_model else fake_action_latent
25: combined = cat([x_tokens, cond_tokens, action_tokens], dim=-1)
26: combined = proj_cond_x_layer(combined)
27: combined = combined + temporal_pos_embed + spatial_pos_embed
28:
29: # Optional: append language tokens
30: if text_latents is not None:
31:     combined = cat([text_proj(text_latents), combined], dim=1)
32:
33: # Transformer Encoder
34: for block in encoder_blocks:
35:     combined = block(combined)
36: encoder_out = encoder_norm(combined)
37:
38: # === MAE Decoder ===
39: z = decoder_embed(encoder_out)
40: z = z + decoder_pos_embed
41: for block in decoder_blocks:
42:     z = block(z)
43: z = decoder_norm(z) + diffusion_pos_embed
44:
45: # === Compute Losses ===
46: video_loss = diffloss(z, gt_latents, mask)
47: action_loss = diffactloss(z, nactions)
48: loss = video_loss + action_loss  # depends on task_mode
49: return loss, video_loss, action_loss

Algorithm 2: Action Diffusion Head (DiffActLoss)

Algorithm: Action Diffusion Decoder
Input: z (B, T*S, d) - joint latent, target (B, num_actions, act_dim) - ground truth actions
Output: action_loss
 
1:  # Spatial aggregation: reshape latent to spatial grid
2:  z = rearrange(z, "B (T S) d -> (B*T) S d", T=4)
3:  z = rearrange(z, "(B*T) (W H) d -> (B*T) d W H", W=16)
4:
5:  # Conv + Pool to reduce spatial dimensions
6:  z = Conv2d(z)                    # (B*T, d, 16, 16) -> (B*T, d, 4, 4)
7:  z = AdaptiveAvgPool2d(z, (4,4))
8:  z = flatten(z)                   # (B*T, d*4*4)
9:  z = FC(z)                        # (B*T, d)
10:
11: # Temporal interpolation: 4 frames -> 16 action steps
12: z = rearrange(z, "(B T) d -> B T d", T=4)
13: z = Linear_interp(z, 4 -> 16)    # (B, 16, d)
14: z = refine_MLP(z)                 # (B, 16, d)
15:
16: # Diffusion training loss
17: z = rearrange(z, "B T d -> (B*T) d")
18: target = rearrange(target, "B T act_dim -> (B*T) act_dim")
19: t = randint(0, num_diffusion_steps, (B*T,))
20: loss = diffusion_training_loss(SimpleMLPAdaLN, target, t, condition=z)
21: return mean(loss)

Algorithm 3: Video Diffusion Head (DiffLoss)

Algorithm: Video Diffusion Decoder
Input: z (B, T*S, d) - joint latent per token, target (B, T*S, token_dim) - GT visual tokens, mask (B, T*S)
Output: video_loss
 
1:  # Reshape for per-token diffusion
2:  target = rearrange(target, "B S d -> (B*S) d")
3:  z = rearrange(z, "B S d -> (B*S) d")
4:  mask = rearrange(mask, "B S -> (B*S)")
5:
6:  # Sample diffusion timestep per token
7:  t = randint(0, num_diffusion_steps, (B*S,))
8:
9:  # Compute diffusion loss (noise prediction MSE)
10: loss = diffusion_training_loss(SimpleMLPAdaLN, target, t, condition=z)
11:
12: # Only compute loss on masked positions
13: loss = (loss * mask).sum() / mask.sum()
14: return loss

Algorithm 4: Masked Training Strategy

Algorithm: Masked Training Mode Selection
Input: batch data, task_modes list
Output: total loss
 
1:  # Randomly select a task mode for this batch
2:  selected_mode = random.choice(task_modes)
3:  # task_modes = ["video_model", "dynamic_model", "policy_model",
4:  #               "inverse_model", "full_dynamic_model"]
5:
6:  # Process data: VAE encode, normalize, extract trajectories
7:  x, z, cond = get_vae_latent(batch_images, vae_model)
8:  history_traj, future_traj = get_trajectory(actions)
9:
10: # Forward pass with selected mode
11: loss, video_loss, act_loss = model(z, cond, history_traj, future_traj,
12:                                     text_latents, task_mode=selected_mode)
13:
14: # In forward_loss:
15: if selected_mode in ["video_model", "dynamic_model"]:
16:     total_loss = video_diffusion_loss          # only video
17: elif selected_mode in ["policy_model", "inverse_model"]:
18:     total_loss = action_diffusion_loss          # only action
19: elif selected_mode == "full_dynamic_model":
20:     total_loss = video_diffusion_loss + action_diffusion_loss  # both
21:
22: return total_loss

3.8 Code-to-Paper Mapping Table

Paper Concept	Source File	Key Class/Function
整体 Policy 封装	`policy/unified_video_action_policy.py`	`UnifiedVideoActionPolicy`
MAR Transformer (Joint Latent)	`model/autoregressive/mar_con_unified.py`	`MAR`
MAE Encoder	`model/autoregressive/mar_con_unified.py`	`MAR.forward_mae_encoder()`
MAE Decoder	`model/autoregressive/mar_con_unified.py`	`MAR.forward_mae_decoder()`
Video Diffusion Head	`model/autoregressive/diffusion_loss.py`	`DiffLoss`, `SimpleMLPAdaLN`
Action Diffusion Head	`model/autoregressive/diffusion_action_loss.py`	`DiffActLoss`
Gaussian Diffusion 实现	`model/autoregressive/diffusion/gaussian_diffusion.py`	`GaussianDiffusion`
损失计算 (task mode routing)	`model/autoregressive/mar_con_unified.py`	`MAR.forward_loss()`
训练 Workspace	`workspace/train_unified_video_action_workspace.py`	`TrainUnifiedVideoActionWorkspace.run()`
Random Masking	`model/autoregressive/mar_con_unified.py`	`MAR.random_masking()`
Token Sampling (推理)	`model/autoregressive/mar_con_unified.py`	`MAR.sample_tokens()`
VAE Encoder/Decoder	`vae/vaekl.py`	`AutoencoderKL`
数据处理与归一化	`utils/data_utils.py`	`process_data()`, `get_vae_latent()`
训练配置	`config/model/uva.yaml`	YAML config

4. Experimental Setup (实验设置)

数据集与任务

仿真环境:

PushT: 推灰色 T 块对齐目标位置（单任务），50 rollouts 取最佳 checkpoint 成功率
ToolHang: 将钩子插入底座并挂上扳手（单任务），RoboMimic 最难任务之一
PushT-M (Multi-goal): PushT 扩展，目标 T 位置多样化（多任务）
Libero10: 10 个长 horizon 任务，语言指令描述目标，50 个不同环境评估

真实世界:

使用 UMI 手持设备采集的公开数据集
3 个任务: Cup Arrangement, Towel Folding, Mouse Arrangement
在 ARX X5 机械臂上测试，评估包含 OOD 场景（不同物体、背景、机械手颜色）

Baseline 方法

方法	类型	说明
DP-C (Diffusion Policy CNN)	Action-only	CNN-based Diffusion Policy
DP-T (Diffusion Policy Transformer)	Action-only	Transformer-based Diffusion Policy
DP-UMI	Action-only	针对 UMI 数据优化的 Diffusion Policy + CLIP ViT-B/16
OpenVLA	VLA	7B Llama 2 基础的 Vision-Language-Action model
UniPi	Video-based	先生成视频再提取动作
$π_{0}$ / $π_{0}$ -FAST	VLA	3.0B/3.3B flow matching VLA model
UVA-action	Ablation	UVA 去掉视频生成部分

训练配置

模型大小: MAR-Base backbone (~171.27M 参数，0.5B 总参数量)
VAE: 预训练 kl-f16 (16x downsampling, 16-dim latent)
图像分辨率: 256x256
Diffusion steps: 训练 1000 步（action），100 步采样（仿真），16 步采样（真实世界）
Optimizer: AdamW, lr=1e-4, weight_decay=0.02, betas=(0.9, 0.95)
LR Schedule: Cosine with 1000 warmup steps
Batch size: 32
混合精度: FP16
EMA: 使用，power=0.75, max_value=0.9999
硬件: 8x H100 GPU，视频预训练约 2 天 + 联合训练约 2 天（UMI 数据集）

5. Experimental Results (实验结果)

5.1 Policy Learning — 仿真

Figure 3 解读: 展示了所有仿真评估环境，包括单任务（PushT, ToolHang）和多任务设置（Multi-Goal PushT-M, Multi-Goal Libero10）。

方法	PushT (单任务)	Tool (单任务)	PushT-M (多任务)	Libero10 (多任务)	Speed
DP-C	0.91	0.95	0.68	0.53	0.50s
DP-T	0.78	0.76	0.63	0.58	0.36s
OpenVLA	0.35	0.18	0.22	0.54	1.52s
UniPi	0.42	0.00	0.19	0.00	24.07s
$π_{0}$	-	-	-	0.85	0.09s
$π_{0}$ -FAST	-	-	-	0.60	0.09s
UVA-action	0.45	0.62	0.46	0.86	0.22s
UVA	0.98	0.88	0.88	0.90	0.23s

关键发现:

UVA 在多任务场景中表现尤其突出: PushT-M 比最佳 baseline 高 20%，Libero10 高 5%
单任务中 UVA 也能匹配 SOTA（PushT 0.98 vs DP-C 0.91）
推理速度 (0.23s) 接近 action-only 方法 DP-T (0.36s)
UVA-action (去掉视频生成) 性能显著下降，证明联合 video-action 训练的有效性

5.2 Policy Learning — 真实世界

方法	Cup (单任务)	Cup (OOD多任务)	Towel (OOD)	Mouse (OOD)	Speed
DP-UMI	0.95	0.50	0.70	0.40	70ms
UVA	0.85	0.65	0.70	0.80	95ms

Figure 4 解读: 真实世界 OOD 测试场景，包括不同初始配置 (10%)、不同物体/干扰物 (10%)、不同背景 (8.3%)、不同机械手颜色（未见过的绿色）以及完全不同的任务（Mouse 33.3%, Towel 33.3%）。

关键发现:

多任务 OOD 设置下 UVA 明显优于 DP-UMI: Cup +15%, Mouse +40%
单任务中 DP-UMI 略优（因其数据含大量 failure recovery 数据，适合无历史依赖的模型）

5.3 视觉鲁棒性

方法	Normal	BgColor	BgObject	GoalColor
DP-C	0.91	0.12	0.21	0.17
DP-T	0.78	0.22	0.17	0.28
OpenVLA	0.35	0.17	0.13	0.32
UniPi	0.42	0.31	0.36	0.40
UVA	0.98	0.35	0.31	0.64

视频生成方法（UVA, UniPi）在视觉扰动下表现更好，UVA 在 GoalColor 变化时达到 64%，远超 OpenVLA 的 32%。

5.4 历史长度鲁棒性

Figure 6 解读: 在 PushT-M 上，DP-C 随着历史长度增加性能下降（过拟合），而 UVA 保持稳定甚至提升。这表明联合 video-action 建模帮助模型更好地利用长时间上下文。

5.5 Video Generation

方法	Libero10 (FVD)	CupArrange (FVD)
UniPi	56.55	71.37
UVA (1 step)	89.36	51.34
UVA (8 steps)	51.10	29.72

UVA 在两个数据集上均优于 UniPi，8 步自回归生成显著提升质量。

5.6 Forward Dynamics Model

方法	R-R	R-G	G-R	G-G	Avg
DP-C	0.20	0.50	0.60	0.20	0.38
UVA	0.80	0.70	0.50	0.40	0.60
GT-Dynamics	0.80	0.80	0.70	0.70	0.75

UVA 作为 forward dynamics model 引导 DP-C policy，将成功率从 38% 提升到 60%。

5.7 Inverse Dynamics Model

方法	Position (L2)	Rotation (L2)
UniPi Inverse	1.92 cm	2.21°
UVA	0.75 cm	1.11°
Visual Inertial SLAM	0.41 cm	0.30°

UVA 大幅优于 UniPi 的逆动力学模型，虽略逊于 SLAM 但实现更简单。

5.8 Inference Speed 分解

模块	时间 (ms)
VAE Image Encoder	40
Transformer (Attention)	40
Transformer (Flash Attention)	30
Action Diffusion (16 steps)	15
Action Diffusion (100 steps)	93
Video Diffusion (16 steps)	100
UVA 总计 (16 steps)	95
UVA 总计 (100 steps)	173

5.9 Limitations

目前未利用大规模无动作标签视频数据进行预训练，限制了在真实世界任务上的泛化能力
单任务真实世界场景中仅与 DP-UMI comparable（因 failure recovery 数据对 DP-UMI 更有利）
Transformer 的 attention 计算占推理时间约一半，可通过 Flash Attention 优化

Paper Notes

探索

Unified Video Action Model (UVA)

Unified Video Action Model (UVA)

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall Framework

3.2 Encode History (历史编码)

3.3 Masked Autoencoder for Observation Prediction (视频预测)

3.4 Decoupled Video and Action Diffusions (解耦 Diffusion 解码)

3.5 Loss Functions (损失函数)

3.6 Masked Training with Flexible Objectives (多任务 Mask 训练)

3.7 Pseudocode

3.8 Code-to-Paper Mapping Table

4. Experimental Setup (实验设置)

数据集与任务

Baseline 方法

训练配置

5. Experimental Results (实验结果)

5.1 Policy Learning — 仿真

5.2 Policy Learning — 真实世界

5.3 视觉鲁棒性

5.4 历史长度鲁棒性

5.5 Video Generation

5.6 Forward Dynamics Model

5.7 Inverse Dynamics Model

5.8 Inference Speed 分解

5.9 Limitations

目录