DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Authors: Jiangran Lyu, Ziming Li, Xuesong Shi, Chaoyi Xu, Yizhou Wang, He Wang Affiliations: Center on Frontiers of Computing Studies & School of Computer Science, Peking University, Galbot, Inst. for Artificial Intelligence, Peking University, State Key Laboratory of General Artificial Intelligence, Peking University arXiv: 2503.16806 Project Page: pku-epic.github.io/DyWA GitHub: jiangranlv/DyWA Venue: ICCV 2025

1. Motivation (研究动机)

Non-prehensile manipulation（非抓取操作）的现实需求: 推、滑、翻转等非抓取操作对于处理太薄、太大或无法抓取的物体至关重要，极大地扩展了机器人在非结构化环境中的能力。

现有学习方法的两大局限:

依赖多视角相机和精确 Pose Tracking: 现有方法（如 HACMan、CORN）严重依赖多视角设置和精确的物体姿态追踪模块。在实际部署中，多视角设置不总是可用，追踪模块也经常不准确。
无法泛化到不同物理条件: 现有模型主要关注几何形状，忽略了底层动力学特性（如物体质量、桌面摩擦系数），导致在物理条件变化时性能严重下降。

Teacher-Student Distillation 框架的问题: 虽然 RL Teacher Policy 在获得 privileged information 时表现优秀，但蒸馏得到的 Student Policy 在 partial observability 下性能大幅下降，原因有三：

单视角导致严重的几何信息缺失
Markovian Student Model 只能学习跨不同动力学条件的”平均”行为
传统蒸馏方法仅监督 latent features 和最终 actions，不足以学习 contact-rich 交互的底层动力学

2. Idea (核心思想)

DyWA 的核心洞察是：将动作学习与未来状态预测联合建模，同时从历史轨迹中自适应地捕获动力学信息。

具体来说，DyWA 提出了三个关键创新：

World Action Model: 将传统的 action model 扩展为同时预测动作和下一步状态的 world action model，通过 next state prediction 提供额外的监督信号，形成 action learning 和 world modeling 的协同效应。
Dynamics Adaptation Module: 受 RMA 启发，利用历史 observation-action pairs 提取 dynamics embedding，捕获物体质量、摩擦系数等物理参数的变化。
FiLM Conditioning: 使用 Feature-wise Linear Modulation 将 dynamics embedding 注入 world action model，实现对不同动力学条件的结构化适应。

与现有方法的根本区别：DyWA 实现了仅使用单视角点云、无需 Pose Tracking 的端到端 6D 非抓取操作，并能零样本 Sim-to-Real 迁移，跨越不同物体几何形状和物理条件泛化。

3. Method (方法)

3.1 Overall Framework（整体框架）

Figure 2 解读: 此图展示了 DyWA 的完整 pipeline。左侧为 Vision-based Student Policy（部署时使用），接收当前 partial point cloud、end-effector pose、joint state 和 goal point cloud 作为输入，通过各自的 Encoder 得到特征表示。这些特征输入 World Action Model，同时预测动作 $A_{t}^{s}$ 和下一步状态 $S_{t + 1}$ 。上方的 Adaptation Module 编码历史 observation-action pairs，解码为 Dynamics Embedding，通过 FiLM 层调制 World Action Model 的中间表示。右侧为 Oracle Teacher Policy（仅训练时使用），拥有完整点云、物理参数、任务状态等 privileged information，通过 Encoders 和 Policy Net 产生教师动作 $A_{t}^{t}$ 。教师策略提供三种监督信号：Imitation Loss（动作模仿）、World Model Loss（状态预测）、Adaptation Loss（adaptation embedding 对齐）。

3.2 Task Formulation（任务定义）

任务目标：通过非抓取操作（推、翻转等）将桌面物体从初始 6D pose 移动到目标 6D pose。

Goal Pose: 定义为相对于初始 pose 的 6DoF 变换 $G$
Task State: $S_{t}$ 为物体当前 pose 与目标 pose 之间的相对变换
Observations: 包含 partial point cloud $P_{t}$ 、joint states $J_{t}$ 、end-effector pose $E_{t}$
成功标准: 物体最终 pose 与目标相差 0.05m 以内且 0.1 radians 以内

3.3 Training Pipeline（训练流程）

采用标准的 Teacher-Student Policy Distillation 框架：

Stage 1: 训练 RL Teacher Policy（200K iterations, PPO）

输入 privileged information：完整点云、物理参数（质量、摩擦系数）、精确 task state
使用 PPO 算法训练，reward design 与 CORN 一致（详见补充材料）
采用 Variable Impedance Control 作为底层动作执行机制

Stage 2: 训练 Student Policy（500K iterations, DAgger）

使用 DAgger 进行蒸馏：初始使用教师动作执行，逐渐增加学生动作权重
Student 仅使用 partial observation（单视角点云、关节状态、end-effector pose）
Domain Randomization：随机化物体质量、缩放比例、摩擦系数、恢复系数
注入小扰动到 torque commands、点云和 goal pose 以增强 Sim-to-Real 迁移

3.4 World Action Model（世界动作模型）

World Action Model 是一种同时预测动作和未来状态的策略模型，核心思想是通过 next state prediction 创造协同学习效应。

Observation and Goal Encoding:

Partial Point Cloud → 简化的 PointNet++ → $f_{t}^{P}$
Joint positions/velocities → shallow MLPs → $f_{t}^{J}$
End-effector pose → shallow MLP → $f_{t}^{E}$
Goal Description: 将初始点云 $P_{0}$ 通过目标 pose 变换得到 $P_{G} = G P_{0}$ ，共享同一 point cloud encoder

State-based World Modeling: observation 和 goal embeddings 经过 MLPs 同时产生动作 $A_{t}$ 和下一步 task state $S_{t + 1}$ 。

采用 object-centric 的 task state 表示（而非高维视觉信号），使 world model 聚焦于 task-relevant dynamics。旋转表示采用 9D representation。

World Model Loss:

L_{world} = ∥ T_{t + 1} - \hat{T}_{t + 1} ∥_{2}^{2} + ∥ R_{t + 1} - \hat{R}_{t + 1} ∥_{1} (1)

其中 $T_{t + 1} \in R^{3}$ 和 $R_{t + 1} \in SO (3)$ 为预测值， $\hat{T}_{t + 1}$ 和 $\hat{R}_{t + 1}$ 为仿真器提供的 ground truth。

Imitation Loss:

L_{imitation} = ∥ A_{t}^{s} - A_{t}^{t} ∥^{2} (2)

Figure 3 解读: Loss 曲线展示了 Dynamics Adaptation（D.A.）和 World Model 的协同效应。左图：对比仅用 D.A. 和同时加入 World Model 的 Imitation Loss，可以看到加入 World Model 后 imitation loss 收敛更快更低，表明 next state prediction 对 action learning 有促进作用。右图：对比仅用 World Model 和同时加入 D.A. 的 World Model Loss，D.A. 帮助 world model 更好地预测未来状态。这验证了两个模块的互补性。

3.5 Dynamics Adaptation（动力学自适应）

受 RMA（Rapid Motor Adaptation）启发，设计了 Adaptation Module 来从历史轨迹中提取环境动力学信息。

Adaptation Embedding 计算:

在每个 timestep，将 observation embedding $f_{t}^{O} = {f_{t}^{P}, f_{t}^{J}, f_{t}^{E}}$ 与上一步 action embedding $f_{t - 1}^{A}$ 拼接，构造长度为 $L$ 的 observation-action 序列，通过 1D CNN 提取 adaptation embedding：

z_{t} = Embed ([concat (f_{t - i - 1}^{O}, f_{t - i - 2}^{A})]_{i = 1}^{L}) (3)

Adaptation Loss: 监督 adaptation embedding 对齐教师 encoder 的完整点云和物理参数 embedding：

L_{adapt} = ∥ z_{t}^{G eo, P h y} - concat (f_{t}^{G eo}, f_{t}^{P h y}) ∥^{2} (4)

3.6 FiLM Conditioning（动力学条件注入）

Adaptation embedding 解码为 Dynamics Embedding 后，通过 Feature-wise Linear Modulation (FiLM) 注入 World Action Model。

每个 FiLM block 包含两个 shallow MLPs，从 dynamics embedding 产生调制参数 $γ$ 和 $β$ ，对中间特征 $f$ 进行 affine transformation：

FiLM (f ∣ γ, β) = γ f + β (5)

FiLM blocks 密集地集成在 World Action Model 的前几层，后面几层保持无条件。这种设计在视觉编码器中集成语言引导时已被证明高效。

3.7 Action Space with Variable Impedance（变阻抗动作空间）

动作空间包含 end-effector 的 subgoal residual $Δ T_{ee} \in SE (3)$ 以及 joint-space impedance 参数：

位置增益 $P \in R^{7}$
阻尼因子 $ρ \in R^{7}$ ，速度增益 $D = ρ P$

目标关节位置通过 damped least squares 逆运动学求解：

q_{d} = q_{t} + I K (Δ T_{ee}) (6)

使用 Polymetis API 实现 joint-space impedance controller。

3.8 Overall Training Objective（总体训练目标）

L = L_{imitation} + L_{world} + L_{adapt} (7)

3.9 Pseudocode

Algorithm 1: Teacher Policy Training (PPO)

Algorithm: Teacher Policy Training via PPO
Input: IsaacGym env with privileged state (full PC, physics params, task state),
       StateEncoder, PiNet (actor), VNet (critic)
Output: Trained teacher policy π_teacher
 
1: Initialize PPO agent with StateEncoder, PiNet, VNet
2: for step = 1 to 200K:
3:     # Collect rollouts with privileged information
4:     obs = env.get_privileged_obs()  # full point cloud, physics params, task state, joint & EE
5:     features = StateEncoder(obs)
6:     action_dist = PiNet(features)  # Gaussian distribution
7:     action = action_dist.sample()
8:     next_obs, reward, done, info = env.step(action)
9:     value = VNet(features)
10:    # Store transition (obs, action, reward, done, value, log_prob)
11:
12:    # PPO update with GAE
13:    advantages = gae_ax1(rewards, values, dones, gamma, gae_lambda)
14:    returns = advantages + values
15:    for epoch in range(ppo_epochs):
16:        ratio = exp(new_log_prob - old_log_prob)
17:        surr1 = ratio * advantages
18:        surr2 = clip(ratio, 1-eps, 1+eps) * advantages
19:        policy_loss = -min(surr1, surr2).mean()
20:        value_loss = (returns - VNet(features))^2
21:        loss = policy_loss + value_coef * value_loss
22:        optimizer.step(loss)
23:
24:    # Reward bootstrapping on timeout
25:    if info['timeout']:
26:        reward += gamma * VNet(next_obs)

Algorithm 2: Student Policy Distillation (DAgger + World Action Model)

Algorithm: DyWA Student Distillation via DAgger
Input: Trained teacher π_teacher, env, PointNet++, AdaptationModule, WorldActionModel
Output: Trained student policy (DyWA)
 
1: Initialize student encoders, WorldActionModel, AdaptationModule
2: alpha = 0  # DAgger mixing: 0=teacher, 1=student
3: for step = 1 to 500K:
4:     # Anneal alpha from 0 to 1 over training
5:     alpha = anneal_schedule(step)
6:
7:     # Get observations
8:     partial_pc, joint_state, ee_pose = env.get_student_obs()
9:     goal_pc = transform(initial_pc, goal_pose)  # P_G = G * P_0
10:
11:    # Encode observations
12:    f_P = PointNetPP(partial_pc)      # point cloud feature
13:    f_J = MLP_joint(joint_state)       # joint feature
14:    f_E = MLP_ee(ee_pose)              # end-effector feature
15:    f_G = PointNetPP(goal_pc)          # goal feature (shared encoder)
16:
17:    # Adaptation: encode L history obs-action pairs
18:    f_O = concat(f_P, f_J, f_E)
19:    history = [(f_{t-i-1}^O, f_{t-i-2}^A) for i in 1..L]
20:    z_t = Conv1D_Adapter(history)       # adaptation embedding
21:    dynamics_emb = Decoder(z_t)         # dynamics embedding
22:
23:    # FiLM conditioning on WorldActionModel
24:    features = concat(f_P, f_J, f_E, f_G)
25:    for layer in WorldActionModel.early_layers:
26:        gamma, beta = FiLM_MLP(dynamics_emb)
27:        features = gamma * layer(features) + beta
28:    for layer in WorldActionModel.late_layers:
29:        features = layer(features)  # unconditioned
30:
31:    A_s, S_hat_{t+1} = WorldActionModel.heads(features)  # predict action + next state
32:
33:    # Get teacher action with privileged info
34:    A_t = teacher.get_action(env.get_privileged_obs())
35:
36:    # Execute mixed action (DAgger)
37:    action = (1 - alpha) * A_t + alpha * A_s
38:    env.step(action)
39:
40:    # Compute losses
41:    L_imitation = ||A_s - A_t||^2
42:    L_world = ||T_{t+1} - T_hat_{t+1}||_2^2 + ||R_{t+1} - R_hat_{t+1}||_1
43:    L_adapt = ||z_t^{Geo,Phy} - concat(f_t^{Geo}, f_t^{Phy})||^2
44:    L_total = L_imitation + L_world + L_adapt
45:
46:    optimizer.step(L_total)

Algorithm 3: FiLM Conditioning Module

Algorithm: FiLM Dynamics Conditioning
Input: latent feature f, dynamics_embedding z
Output: conditioned feature f'
 
1: class FiLMBlock(nn.Module):
2:     def __init__(self, feature_dim, dynamics_dim):
3:         self.gamma_mlp = MLP(dynamics_dim, feature_dim)  # scaling
4:         self.beta_mlp = MLP(dynamics_dim, feature_dim)   # shifting
5:
6:     def forward(self, f, dynamics_emb):
7:         gamma = self.gamma_mlp(dynamics_emb)
8:         beta = self.beta_mlp(dynamics_emb)
9:         return gamma * f + beta  # affine modulation

Algorithm 4: Adaptation Module

Algorithm: Dynamics Adaptation via History Encoding
Input: history of observations {f_O} and actions {f_A} of length L
Output: adaptation embedding z_t, dynamics embedding
 
1: class AdaptationModule(nn.Module):
2:     def __init__(self, obs_dim, action_dim, embed_dim, L):
3:         self.action_encoder = MLP(action_dim, embed_dim)
4:         self.conv1d = nn.Sequential(
5:             Conv1d(obs_dim + embed_dim, hidden_dim, kernel_size=k),
6:             ReLU(),
7:             Conv1d(hidden_dim, embed_dim, kernel_size=k),
8:         )
9:         self.decoder = MLP(embed_dim, dynamics_dim)  # z -> dynamics embedding
10:
11:    def forward(self, obs_history, action_history):
12:        # obs_history: [B, L, obs_dim], action_history: [B, L, action_dim]
13:        f_A = self.action_encoder(action_history)
14:        pairs = concat(obs_history, f_A, dim=-1)  # [B, L, obs_dim + embed_dim]
15:        pairs = pairs.transpose(1, 2)  # [B, C, L] for Conv1d
16:        z_t = self.conv1d(pairs).mean(dim=-1)  # [B, embed_dim]
17:        dynamics_emb = self.decoder(z_t)
18:        return z_t, dynamics_emb

3.10 Code-to-Paper Mapping Table

Paper Concept	Source File	Key Class/Function
PPO Teacher Training	`dywa/exp/train/train_ppo_arm.py`	`inner_main()`, `load_agent()`
PPO Algorithm	`dywa/src/models/rl/v6/ppo.py`	`PPO`, `gae_ax1()`
Student Distillation (DAgger)	`dywa/exp/train/train_rma.py`	`DAggerTrainerEnv`, `RMATrainerEnv`
Student Agent (RMA + World Model)	`dywa/exp/train/distill.py`	`StudentAgentRMA`
Point Cloud Encoder (PointNet++)	`dywa/src/models/pointnet2.py`	PointNet++
Point Cloud (PointMAE)	`dywa/src/models/cloud/`	Cloud models
Common NN Modules (MLP, GRU, CNN, Attention)	`dywa/src/models/common.py`	`MLP`, `SingleGRU`, `SimpleCNN`, `FiLM`*
Environment Wrappers (Reward, Obs)	`dywa/src/env/env/wrap/`	60+ wrapper classes
Robot Controllers	`dywa/src/env/robot/`	Franka, UR5 implementations
Domain Randomization / Config	`dywa/src/data/cfg/`	YAML configs
CUDA Kinematics	`dywa/c_src/`	Franka/UR5 forward kinematics

*注: FiLM 的具体实现可能在 StudentAgentRMA 或单独的模型文件中，common.py 包含基础 NN 组件。

4. Experimental Setup (实验设置)

仿真环境

Simulator: IsaacGym
训练物体: DexGraspNet 323-object asset
测试物体: 10 个几何多样的 unseen 物体，每个缩放至 5 种尺寸 → 共 50 个评估物体
Domain Randomization: 物体质量、缩放、摩擦系数、恢复系数随机化

评估维度

维度	设置
视角	Single-view (1 camera) vs. Multi-view (3 cameras)
状态	Known state (GT pose) vs. Unknown state (goal point cloud)
物体	Seen (训练集) vs. Unseen (测试集)

Baselines

方法	类型	说明
HACMan	Primitive-based	基于点云的接触位置选择 + 运动原语
CORN	Closed-loop	基于物体表示的 teacher-student 蒸馏
CORN (PN++)	Closed-loop	替换 CORN 的 point cloud encoder 为 PointNet++

Ablation 变体

DAgger (baseline): 无 World Model、无 D.A.、无 FiLM
World Model only
RMA (D.A.) only
Ours w/o W.M.: 有 D.A. + FiLM，无 World Model
Ours w/o FiLM: 有 W.M. + D.A.，无 FiLM

训练配置

Teacher: PPO, 200K iterations
Student: DAgger, 500K iterations
成功标准: 位置误差 < 0.05m 且旋转误差 < 0.1 rad

真实世界设置

机器人: Franka robot arm
相机: RealSense D435 (side view)
物体: 10 个 unseen 真实物体（含 slippery 物体、半满水瓶等）
Pose 评估: Iterative Closest Point (ICP)

5. Experimental Results (实验结果)

5.1 仿真主实验结果 (Table 1)

Methods	Action Type	Known State (3 view) Seen/Unseen	Unknown State (3 view) Seen/Unseen	Unknown State (1 view) Seen/Unseen
HACMan	Primitive	3.8(42.2) / 5.7(39.4)	3.0(23.6) / 4.1(26.5)	1.5(17.9) / 2.9(18.3)
CORN	Closed-loop	86.8 / 79.9	46.0 / 47.8	29.0 / 29.8
CORN (PN++)	Closed-loop	87.3 / 84.3	76.1 / 75.7	50.7 / 49.4
Ours	Closed-loop	87.9 / 85.0	85.8 / 82.3	82.2 / 75.0

关键发现：

DyWA 在所有评估 track 上均超越 baselines，最具挑战性的 Unknown State + 1 view 场景下提升最为显著（+31.5% success rate）
Unknown state + single view 是最能体现 DyWA 动力学建模优势的场景

5.2 Ablation Study (Table 2) — Unknown State, 1 View

Methods	W.M.	D.A.	FiLM	Seen	Unseen
DAgger	-	-	-	59.9	57.5
World Model	✓	-	-	61.6	59.4
RMA	-	✓	-	65.6	57.9
Ours w/o W.M.	-	✓	✓	70.0	63.7
Ours w/o FiLM	✓	✓	-	73.3	59.4
Ours (full)	✓	✓	✓	82.2	75.0

关键发现：

World Model 和 Dynamics Adaptation 互补性: 单独使用其中一个仅带来微小提升（+1.7% / +5.7%），但两者结合后性能从 59.9% 跳升到 73.3%（+13.4%）
FiLM 的有效性: FiLM 提供比 direct concatenation 更有效的结构化条件注入，额外贡献 +8.9% 提升（73.3% → 82.2%）
三模块完整组合达到最佳性能 82.2% / 75.0%

5.3 真实世界实验 (Table 3)

Methods	Mug	Bulldozer	Card	Book	Dinosaur	Chips Can	Switch	YCB-Bottle	Half-full Bottle	Coffee jar	Avg.
CORN w/ tracking	1/5	3/5	4/5	4/5	2/5	0/5	2/5	0/5	0/5	2/5	36%
Ours	3/5	4/5	4/5	4/5	3/5	2/5	4/5	3/5	4/5	3/5	68%

关键发现：

DyWA 在无需外部 Pose Tracking 的情况下，平均成功率 68%，远超依赖 tracking 的 CORN（36%）
在 slippery 物体（YCB-Bottle）和 non-uniform mass 物体（Half-full Bottle）上优势尤为明显
实现了零样本 Sim-to-Real 迁移

5.4 摩擦系数鲁棒性 (Table 4)

Methods	$μ_{1}$ S.R./Time	$μ_{2}$ S.R./Time	$μ_{3}$ S.R./Time	$μ_{4}$ S.R./Time
Ours w/o D.A.	3/5, 65s	3/5, 81s	4/5, 96s	3/5, 124s
Ours	4/5, 45s	4/5, 50s	4/5, 49s	4/5, 51s

关键发现：无 D.A. 的模型随摩擦系数变化执行时间大幅波动（65s→124s），而完整的 DyWA 保持稳定的成功率和执行时间。

5.5 VLM 应用

Figure 5 解读: DyWA 与 Vision-Language Models (VLMs) 集成的应用示例。通过 SoFar 模型将自然语言指令（如 “Put the grip of the electric drill into a person’s hand”）转换为语义物体 pose，作为 DyWA 的 goal 输入。这展示了 DyWA 的 goal-conditioned policy 可以与 VLM 结合实现自然语言驱动的操作。

Figure 6 解读: DyWA 作为 pre-grasping 步骤的应用。对于难以直接抓取的物体（如平放的薄卡片、超过夹爪跨度的饼干盒），DyWA 先将其翻转/旋转到适合抓取的姿态，再配合 grasping model 完成抓取，显著提高抓取成功率。

5.6 局限性

仅依赖点云作为视觉输入，对称物体存在几何歧义
透明和镜面物体因深度信息不完整而存在困难
未来方向：引入外观信息提供更丰富的视觉线索

RL 使用情况分析

本文使用了 Reinforcement Learning (RL)，但仅用于训练 Teacher Policy，而非直接用于最终部署的 Student Policy。

Reward Model

论文声明 reward design 与 CORN 一致，具体细节在补充材料中
RL Teacher 使用 PPO 算法训练，基于 IsaacGym 仿真环境中的 state-based reward
Reward 由环境 wrapper 系统计算（代码中可见 AddWrenchPenalty, AddTrackingReward, AddSuccessAsObs 等 wrapper），而非单独的 Reward Model

VLM 作为 Reward Judge

否，本文未使用 VLM 作为零样本 reward judge
VLM 仅在应用阶段（Section 4.4）用于将自然语言指令转换为 goal pose，不参与 reward 计算

具体 Reward Signals

论文未详细说明具体 reward function（标注为参考 CORN 和补充材料）
从代码结构推断，reward 包括：
- Tracking Reward: 基于物体 pose 与目标 pose 之间的距离
- Wrench Penalty: 惩罚过大的接触力/扭矩
- Success Signal: 达到目标 pose 阈值时的成功信号
这是标准的 simulation-based reward，非学习型 Reward Model

Paper Notes

探索

DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall Framework（整体框架）

3.2 Task Formulation（任务定义）

3.3 Training Pipeline（训练流程）

3.4 World Action Model（世界动作模型）

3.5 Dynamics Adaptation（动力学自适应）

3.6 FiLM Conditioning（动力学条件注入）

3.7 Action Space with Variable Impedance（变阻抗动作空间）

3.8 Overall Training Objective（总体训练目标）

3.9 Pseudocode

3.10 Code-to-Paper Mapping Table

4. Experimental Setup (实验设置)

仿真环境

评估维度

Baselines

Ablation 变体

训练配置

真实世界设置

5. Experimental Results (实验结果)

5.1 仿真主实验结果 (Table 1)

5.2 Ablation Study (Table 2) — Unknown State, 1 View

5.3 真实世界实验 (Table 3)

5.4 摩擦系数鲁棒性 (Table 4)

5.5 VLM 应用

5.6 局限性

RL 使用情况分析

Reward Model

VLM 作为 Reward Judge

具体 Reward Signals

目录