DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
Authors: Jiangran Lyu, Ziming Li, Xuesong Shi, Chaoyi Xu, Yizhou Wang, He Wang Affiliations: Center on Frontiers of Computing Studies & School of Computer Science, Peking University, Galbot, Inst. for Artificial Intelligence, Peking University, State Key Laboratory of General Artificial Intelligence, Peking University arXiv: 2503.16806 Project Page: pku-epic.github.io/DyWA GitHub: jiangranlv/DyWA Venue: ICCV 2025
1. Motivation (研究动机)
Non-prehensile manipulation(非抓取操作)的现实需求: 推、滑、翻转等非抓取操作对于处理太薄、太大或无法抓取的物体至关重要,极大地扩展了机器人在非结构化环境中的能力。
现有学习方法的两大局限:
- 依赖多视角相机和精确 Pose Tracking: 现有方法(如 HACMan、CORN)严重依赖多视角设置和精确的物体姿态追踪模块。在实际部署中,多视角设置不总是可用,追踪模块也经常不准确。
- 无法泛化到不同物理条件: 现有模型主要关注几何形状,忽略了底层动力学特性(如物体质量、桌面摩擦系数),导致在物理条件变化时性能严重下降。
Teacher-Student Distillation 框架的问题: 虽然 RL Teacher Policy 在获得 privileged information 时表现优秀,但蒸馏得到的 Student Policy 在 partial observability 下性能大幅下降,原因有三:
- 单视角导致严重的几何信息缺失
- Markovian Student Model 只能学习跨不同动力学条件的”平均”行为
- 传统蒸馏方法仅监督 latent features 和最终 actions,不足以学习 contact-rich 交互的底层动力学
2. Idea (核心思想)
DyWA 的核心洞察是:将动作学习与未来状态预测联合建模,同时从历史轨迹中自适应地捕获动力学信息。
具体来说,DyWA 提出了三个关键创新:
- World Action Model: 将传统的 action model 扩展为同时预测动作和下一步状态的 world action model,通过 next state prediction 提供额外的监督信号,形成 action learning 和 world modeling 的协同效应。
- Dynamics Adaptation Module: 受 RMA 启发,利用历史 observation-action pairs 提取 dynamics embedding,捕获物体质量、摩擦系数等物理参数的变化。
- FiLM Conditioning: 使用 Feature-wise Linear Modulation 将 dynamics embedding 注入 world action model,实现对不同动力学条件的结构化适应。
与现有方法的根本区别:DyWA 实现了仅使用单视角点云、无需 Pose Tracking 的端到端 6D 非抓取操作,并能零样本 Sim-to-Real 迁移,跨越不同物体几何形状和物理条件泛化。
3. Method (方法)
3.1 Overall Framework(整体框架)
Figure 2 解读: 此图展示了 DyWA 的完整 pipeline。左侧为 Vision-based Student Policy(部署时使用),接收当前 partial point cloud、end-effector pose、joint state 和 goal point cloud 作为输入,通过各自的 Encoder 得到特征表示。这些特征输入 World Action Model,同时预测动作 和下一步状态 。上方的 Adaptation Module 编码历史 observation-action pairs,解码为 Dynamics Embedding,通过 FiLM 层调制 World Action Model 的中间表示。右侧为 Oracle Teacher Policy(仅训练时使用),拥有完整点云、物理参数、任务状态等 privileged information,通过 Encoders 和 Policy Net 产生教师动作 。教师策略提供三种监督信号:Imitation Loss(动作模仿)、World Model Loss(状态预测)、Adaptation Loss(adaptation embedding 对齐)。
3.2 Task Formulation(任务定义)
任务目标:通过非抓取操作(推、翻转等)将桌面物体从初始 6D pose 移动到目标 6D pose。
- Goal Pose: 定义为相对于初始 pose 的 6DoF 变换
- Task State: 为物体当前 pose 与目标 pose 之间的相对变换
- Observations: 包含 partial point cloud 、joint states 、end-effector pose
- 成功标准: 物体最终 pose 与目标相差 0.05m 以内且 0.1 radians 以内
3.3 Training Pipeline(训练流程)
采用标准的 Teacher-Student Policy Distillation 框架:
Stage 1: 训练 RL Teacher Policy(200K iterations, PPO)
- 输入 privileged information:完整点云、物理参数(质量、摩擦系数)、精确 task state
- 使用 PPO 算法训练,reward design 与 CORN 一致(详见补充材料)
- 采用 Variable Impedance Control 作为底层动作执行机制
Stage 2: 训练 Student Policy(500K iterations, DAgger)
- 使用 DAgger 进行蒸馏:初始使用教师动作执行,逐渐增加学生动作权重
- Student 仅使用 partial observation(单视角点云、关节状态、end-effector pose)
- Domain Randomization:随机化物体质量、缩放比例、摩擦系数、恢复系数
- 注入小扰动到 torque commands、点云和 goal pose 以增强 Sim-to-Real 迁移
3.4 World Action Model(世界动作模型)
World Action Model 是一种同时预测动作和未来状态的策略模型,核心思想是通过 next state prediction 创造协同学习效应。
Observation and Goal Encoding:
- Partial Point Cloud → 简化的 PointNet++ →
- Joint positions/velocities → shallow MLPs →
- End-effector pose → shallow MLP →
- Goal Description: 将初始点云 通过目标 pose 变换得到 ,共享同一 point cloud encoder
State-based World Modeling: observation 和 goal embeddings 经过 MLPs 同时产生动作 和下一步 task state 。
采用 object-centric 的 task state 表示(而非高维视觉信号),使 world model 聚焦于 task-relevant dynamics。旋转表示采用 9D representation。
World Model Loss:
其中 和 为预测值, 和 为仿真器提供的 ground truth。
Imitation Loss:
Figure 3 解读: Loss 曲线展示了 Dynamics Adaptation(D.A.)和 World Model 的协同效应。左图:对比仅用 D.A. 和同时加入 World Model 的 Imitation Loss,可以看到加入 World Model 后 imitation loss 收敛更快更低,表明 next state prediction 对 action learning 有促进作用。右图:对比仅用 World Model 和同时加入 D.A. 的 World Model Loss,D.A. 帮助 world model 更好地预测未来状态。这验证了两个模块的互补性。
3.5 Dynamics Adaptation(动力学自适应)
受 RMA(Rapid Motor Adaptation)启发,设计了 Adaptation Module 来从历史轨迹中提取环境动力学信息。
Adaptation Embedding 计算:
在每个 timestep,将 observation embedding 与上一步 action embedding 拼接,构造长度为 的 observation-action 序列,通过 1D CNN 提取 adaptation embedding:
Adaptation Loss: 监督 adaptation embedding 对齐教师 encoder 的完整点云和物理参数 embedding:
3.6 FiLM Conditioning(动力学条件注入)
Adaptation embedding 解码为 Dynamics Embedding 后,通过 Feature-wise Linear Modulation (FiLM) 注入 World Action Model。
每个 FiLM block 包含两个 shallow MLPs,从 dynamics embedding 产生调制参数 和 ,对中间特征 进行 affine transformation:
FiLM blocks 密集地集成在 World Action Model 的前几层,后面几层保持无条件。这种设计在视觉编码器中集成语言引导时已被证明高效。
3.7 Action Space with Variable Impedance(变阻抗动作空间)
动作空间包含 end-effector 的 subgoal residual 以及 joint-space impedance 参数:
- 位置增益
- 阻尼因子 ,速度增益
目标关节位置通过 damped least squares 逆运动学求解:
使用 Polymetis API 实现 joint-space impedance controller。
3.8 Overall Training Objective(总体训练目标)
3.9 Pseudocode
Algorithm 1: Teacher Policy Training (PPO)
Algorithm: Teacher Policy Training via PPO
Input: IsaacGym env with privileged state (full PC, physics params, task state),
StateEncoder, PiNet (actor), VNet (critic)
Output: Trained teacher policy π_teacher
1: Initialize PPO agent with StateEncoder, PiNet, VNet
2: for step = 1 to 200K:
3: # Collect rollouts with privileged information
4: obs = env.get_privileged_obs() # full point cloud, physics params, task state, joint & EE
5: features = StateEncoder(obs)
6: action_dist = PiNet(features) # Gaussian distribution
7: action = action_dist.sample()
8: next_obs, reward, done, info = env.step(action)
9: value = VNet(features)
10: # Store transition (obs, action, reward, done, value, log_prob)
11:
12: # PPO update with GAE
13: advantages = gae_ax1(rewards, values, dones, gamma, gae_lambda)
14: returns = advantages + values
15: for epoch in range(ppo_epochs):
16: ratio = exp(new_log_prob - old_log_prob)
17: surr1 = ratio * advantages
18: surr2 = clip(ratio, 1-eps, 1+eps) * advantages
19: policy_loss = -min(surr1, surr2).mean()
20: value_loss = (returns - VNet(features))^2
21: loss = policy_loss + value_coef * value_loss
22: optimizer.step(loss)
23:
24: # Reward bootstrapping on timeout
25: if info['timeout']:
26: reward += gamma * VNet(next_obs)Algorithm 2: Student Policy Distillation (DAgger + World Action Model)
Algorithm: DyWA Student Distillation via DAgger
Input: Trained teacher π_teacher, env, PointNet++, AdaptationModule, WorldActionModel
Output: Trained student policy (DyWA)
1: Initialize student encoders, WorldActionModel, AdaptationModule
2: alpha = 0 # DAgger mixing: 0=teacher, 1=student
3: for step = 1 to 500K:
4: # Anneal alpha from 0 to 1 over training
5: alpha = anneal_schedule(step)
6:
7: # Get observations
8: partial_pc, joint_state, ee_pose = env.get_student_obs()
9: goal_pc = transform(initial_pc, goal_pose) # P_G = G * P_0
10:
11: # Encode observations
12: f_P = PointNetPP(partial_pc) # point cloud feature
13: f_J = MLP_joint(joint_state) # joint feature
14: f_E = MLP_ee(ee_pose) # end-effector feature
15: f_G = PointNetPP(goal_pc) # goal feature (shared encoder)
16:
17: # Adaptation: encode L history obs-action pairs
18: f_O = concat(f_P, f_J, f_E)
19: history = [(f_{t-i-1}^O, f_{t-i-2}^A) for i in 1..L]
20: z_t = Conv1D_Adapter(history) # adaptation embedding
21: dynamics_emb = Decoder(z_t) # dynamics embedding
22:
23: # FiLM conditioning on WorldActionModel
24: features = concat(f_P, f_J, f_E, f_G)
25: for layer in WorldActionModel.early_layers:
26: gamma, beta = FiLM_MLP(dynamics_emb)
27: features = gamma * layer(features) + beta
28: for layer in WorldActionModel.late_layers:
29: features = layer(features) # unconditioned
30:
31: A_s, S_hat_{t+1} = WorldActionModel.heads(features) # predict action + next state
32:
33: # Get teacher action with privileged info
34: A_t = teacher.get_action(env.get_privileged_obs())
35:
36: # Execute mixed action (DAgger)
37: action = (1 - alpha) * A_t + alpha * A_s
38: env.step(action)
39:
40: # Compute losses
41: L_imitation = ||A_s - A_t||^2
42: L_world = ||T_{t+1} - T_hat_{t+1}||_2^2 + ||R_{t+1} - R_hat_{t+1}||_1
43: L_adapt = ||z_t^{Geo,Phy} - concat(f_t^{Geo}, f_t^{Phy})||^2
44: L_total = L_imitation + L_world + L_adapt
45:
46: optimizer.step(L_total)Algorithm 3: FiLM Conditioning Module
Algorithm: FiLM Dynamics Conditioning
Input: latent feature f, dynamics_embedding z
Output: conditioned feature f'
1: class FiLMBlock(nn.Module):
2: def __init__(self, feature_dim, dynamics_dim):
3: self.gamma_mlp = MLP(dynamics_dim, feature_dim) # scaling
4: self.beta_mlp = MLP(dynamics_dim, feature_dim) # shifting
5:
6: def forward(self, f, dynamics_emb):
7: gamma = self.gamma_mlp(dynamics_emb)
8: beta = self.beta_mlp(dynamics_emb)
9: return gamma * f + beta # affine modulationAlgorithm 4: Adaptation Module
Algorithm: Dynamics Adaptation via History Encoding
Input: history of observations {f_O} and actions {f_A} of length L
Output: adaptation embedding z_t, dynamics embedding
1: class AdaptationModule(nn.Module):
2: def __init__(self, obs_dim, action_dim, embed_dim, L):
3: self.action_encoder = MLP(action_dim, embed_dim)
4: self.conv1d = nn.Sequential(
5: Conv1d(obs_dim + embed_dim, hidden_dim, kernel_size=k),
6: ReLU(),
7: Conv1d(hidden_dim, embed_dim, kernel_size=k),
8: )
9: self.decoder = MLP(embed_dim, dynamics_dim) # z -> dynamics embedding
10:
11: def forward(self, obs_history, action_history):
12: # obs_history: [B, L, obs_dim], action_history: [B, L, action_dim]
13: f_A = self.action_encoder(action_history)
14: pairs = concat(obs_history, f_A, dim=-1) # [B, L, obs_dim + embed_dim]
15: pairs = pairs.transpose(1, 2) # [B, C, L] for Conv1d
16: z_t = self.conv1d(pairs).mean(dim=-1) # [B, embed_dim]
17: dynamics_emb = self.decoder(z_t)
18: return z_t, dynamics_emb3.10 Code-to-Paper Mapping Table
| Paper Concept | Source File | Key Class/Function |
|---|---|---|
| PPO Teacher Training | dywa/exp/train/train_ppo_arm.py | inner_main(), load_agent() |
| PPO Algorithm | dywa/src/models/rl/v6/ppo.py | PPO, gae_ax1() |
| Student Distillation (DAgger) | dywa/exp/train/train_rma.py | DAggerTrainerEnv, RMATrainerEnv |
| Student Agent (RMA + World Model) | dywa/exp/train/distill.py | StudentAgentRMA |
| Point Cloud Encoder (PointNet++) | dywa/src/models/pointnet2.py | PointNet++ |
| Point Cloud (PointMAE) | dywa/src/models/cloud/ | Cloud models |
| Common NN Modules (MLP, GRU, CNN, Attention) | dywa/src/models/common.py | MLP, SingleGRU, SimpleCNN, FiLM* |
| Environment Wrappers (Reward, Obs) | dywa/src/env/env/wrap/ | 60+ wrapper classes |
| Robot Controllers | dywa/src/env/robot/ | Franka, UR5 implementations |
| Domain Randomization / Config | dywa/src/data/cfg/ | YAML configs |
| CUDA Kinematics | dywa/c_src/ | Franka/UR5 forward kinematics |
*注: FiLM 的具体实现可能在 StudentAgentRMA 或单独的模型文件中,common.py 包含基础 NN 组件。
4. Experimental Setup (实验设置)
仿真环境
- Simulator: IsaacGym
- 训练物体: DexGraspNet 323-object asset
- 测试物体: 10 个几何多样的 unseen 物体,每个缩放至 5 种尺寸 → 共 50 个评估物体
- Domain Randomization: 物体质量、缩放、摩擦系数、恢复系数随机化
评估维度
| 维度 | 设置 |
|---|---|
| 视角 | Single-view (1 camera) vs. Multi-view (3 cameras) |
| 状态 | Known state (GT pose) vs. Unknown state (goal point cloud) |
| 物体 | Seen (训练集) vs. Unseen (测试集) |
Baselines
| 方法 | 类型 | 说明 |
|---|---|---|
| HACMan | Primitive-based | 基于点云的接触位置选择 + 运动原语 |
| CORN | Closed-loop | 基于物体表示的 teacher-student 蒸馏 |
| CORN (PN++) | Closed-loop | 替换 CORN 的 point cloud encoder 为 PointNet++ |
Ablation 变体
- DAgger (baseline): 无 World Model、无 D.A.、无 FiLM
- World Model only
- RMA (D.A.) only
- Ours w/o W.M.: 有 D.A. + FiLM,无 World Model
- Ours w/o FiLM: 有 W.M. + D.A.,无 FiLM
训练配置
- Teacher: PPO, 200K iterations
- Student: DAgger, 500K iterations
- 成功标准: 位置误差 < 0.05m 且旋转误差 < 0.1 rad
真实世界设置
- 机器人: Franka robot arm
- 相机: RealSense D435 (side view)
- 物体: 10 个 unseen 真实物体(含 slippery 物体、半满水瓶等)
- Pose 评估: Iterative Closest Point (ICP)
5. Experimental Results (实验结果)
5.1 仿真主实验结果 (Table 1)
| Methods | Action Type | Known State (3 view) Seen/Unseen | Unknown State (3 view) Seen/Unseen | Unknown State (1 view) Seen/Unseen |
|---|---|---|---|---|
| HACMan | Primitive | 3.8(42.2) / 5.7(39.4) | 3.0(23.6) / 4.1(26.5) | 1.5(17.9) / 2.9(18.3) |
| CORN | Closed-loop | 86.8 / 79.9 | 46.0 / 47.8 | 29.0 / 29.8 |
| CORN (PN++) | Closed-loop | 87.3 / 84.3 | 76.1 / 75.7 | 50.7 / 49.4 |
| Ours | Closed-loop | 87.9 / 85.0 | 85.8 / 82.3 | 82.2 / 75.0 |
关键发现:
- DyWA 在所有评估 track 上均超越 baselines,最具挑战性的 Unknown State + 1 view 场景下提升最为显著(+31.5% success rate)
- Unknown state + single view 是最能体现 DyWA 动力学建模优势的场景
5.2 Ablation Study (Table 2) — Unknown State, 1 View
| Methods | W.M. | D.A. | FiLM | Seen | Unseen |
|---|---|---|---|---|---|
| DAgger | - | - | - | 59.9 | 57.5 |
| World Model | ✓ | - | - | 61.6 | 59.4 |
| RMA | - | ✓ | - | 65.6 | 57.9 |
| Ours w/o W.M. | - | ✓ | ✓ | 70.0 | 63.7 |
| Ours w/o FiLM | ✓ | ✓ | - | 73.3 | 59.4 |
| Ours (full) | ✓ | ✓ | ✓ | 82.2 | 75.0 |
关键发现:
- World Model 和 Dynamics Adaptation 互补性: 单独使用其中一个仅带来微小提升(+1.7% / +5.7%),但两者结合后性能从 59.9% 跳升到 73.3%(+13.4%)
- FiLM 的有效性: FiLM 提供比 direct concatenation 更有效的结构化条件注入,额外贡献 +8.9% 提升(73.3% → 82.2%)
- 三模块完整组合达到最佳性能 82.2% / 75.0%
5.3 真实世界实验 (Table 3)
| Methods | Mug | Bulldozer | Card | Book | Dinosaur | Chips Can | Switch | YCB-Bottle | Half-full Bottle | Coffee jar | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CORN w/ tracking | 1/5 | 3/5 | 4/5 | 4/5 | 2/5 | 0/5 | 2/5 | 0/5 | 0/5 | 2/5 | 36% |
| Ours | 3/5 | 4/5 | 4/5 | 4/5 | 3/5 | 2/5 | 4/5 | 3/5 | 4/5 | 3/5 | 68% |
关键发现:
- DyWA 在无需外部 Pose Tracking 的情况下,平均成功率 68%,远超依赖 tracking 的 CORN(36%)
- 在 slippery 物体(YCB-Bottle)和 non-uniform mass 物体(Half-full Bottle)上优势尤为明显
- 实现了零样本 Sim-to-Real 迁移
5.4 摩擦系数鲁棒性 (Table 4)
| Methods | S.R./Time | S.R./Time | S.R./Time | S.R./Time |
|---|---|---|---|---|
| Ours w/o D.A. | 3/5, 65s | 3/5, 81s | 4/5, 96s | 3/5, 124s |
| Ours | 4/5, 45s | 4/5, 50s | 4/5, 49s | 4/5, 51s |
关键发现:无 D.A. 的模型随摩擦系数变化执行时间大幅波动(65s→124s),而完整的 DyWA 保持稳定的成功率和执行时间。
5.5 VLM 应用

Figure 5 解读: DyWA 与 Vision-Language Models (VLMs) 集成的应用示例。通过 SoFar 模型将自然语言指令(如 “Put the grip of the electric drill into a person’s hand”)转换为语义物体 pose,作为 DyWA 的 goal 输入。这展示了 DyWA 的 goal-conditioned policy 可以与 VLM 结合实现自然语言驱动的操作。

Figure 6 解读: DyWA 作为 pre-grasping 步骤的应用。对于难以直接抓取的物体(如平放的薄卡片、超过夹爪跨度的饼干盒),DyWA 先将其翻转/旋转到适合抓取的姿态,再配合 grasping model 完成抓取,显著提高抓取成功率。
5.6 局限性
- 仅依赖点云作为视觉输入,对称物体存在几何歧义
- 透明和镜面物体因深度信息不完整而存在困难
- 未来方向:引入外观信息提供更丰富的视觉线索
RL 使用情况分析
本文使用了 Reinforcement Learning (RL),但仅用于训练 Teacher Policy,而非直接用于最终部署的 Student Policy。
Reward Model
- 论文声明 reward design 与 CORN 一致,具体细节在补充材料中
- RL Teacher 使用 PPO 算法训练,基于 IsaacGym 仿真环境中的 state-based reward
- Reward 由环境 wrapper 系统计算(代码中可见
AddWrenchPenalty,AddTrackingReward,AddSuccessAsObs等 wrapper),而非单独的 Reward Model
VLM 作为 Reward Judge
- 否,本文未使用 VLM 作为零样本 reward judge
- VLM 仅在应用阶段(Section 4.4)用于将自然语言指令转换为 goal pose,不参与 reward 计算
具体 Reward Signals
- 论文未详细说明具体 reward function(标注为参考 CORN 和补充材料)
- 从代码结构推断,reward 包括:
- Tracking Reward: 基于物体 pose 与目标 pose 之间的距离
- Wrench Penalty: 惩罚过大的接触力/扭矩
- Success Signal: 达到目标 pose 阈值时的成功信号
- 这是标准的 simulation-based reward,非学习型 Reward Model