DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Authors: Jiangran Lyu, Ziming Li, Xuesong Shi, Chaoyi Xu, Yizhou Wang, He Wang Affiliations: Center on Frontiers of Computing Studies & School of Computer Science, Peking University, Galbot, Inst. for Artificial Intelligence, Peking University, State Key Laboratory of General Artificial Intelligence, Peking University arXiv: 2503.16806 Project Page: pku-epic.github.io/DyWA GitHub: jiangranlv/DyWA Venue: ICCV 2025

1. Motivation (研究动机)

Non-prehensile manipulation(非抓取操作)的现实需求: 推、滑、翻转等非抓取操作对于处理太薄、太大或无法抓取的物体至关重要,极大地扩展了机器人在非结构化环境中的能力。

现有学习方法的两大局限:

  1. 依赖多视角相机和精确 Pose Tracking: 现有方法(如 HACMan、CORN)严重依赖多视角设置和精确的物体姿态追踪模块。在实际部署中,多视角设置不总是可用,追踪模块也经常不准确。
  2. 无法泛化到不同物理条件: 现有模型主要关注几何形状,忽略了底层动力学特性(如物体质量、桌面摩擦系数),导致在物理条件变化时性能严重下降。

Teacher-Student Distillation 框架的问题: 虽然 RL Teacher Policy 在获得 privileged information 时表现优秀,但蒸馏得到的 Student Policy 在 partial observability 下性能大幅下降,原因有三:

  • 单视角导致严重的几何信息缺失
  • Markovian Student Model 只能学习跨不同动力学条件的”平均”行为
  • 传统蒸馏方法仅监督 latent features 和最终 actions,不足以学习 contact-rich 交互的底层动力学

2. Idea (核心思想)

DyWA 的核心洞察是:将动作学习与未来状态预测联合建模,同时从历史轨迹中自适应地捕获动力学信息

具体来说,DyWA 提出了三个关键创新:

  1. World Action Model: 将传统的 action model 扩展为同时预测动作和下一步状态的 world action model,通过 next state prediction 提供额外的监督信号,形成 action learning 和 world modeling 的协同效应。
  2. Dynamics Adaptation Module: 受 RMA 启发,利用历史 observation-action pairs 提取 dynamics embedding,捕获物体质量、摩擦系数等物理参数的变化。
  3. FiLM Conditioning: 使用 Feature-wise Linear Modulation 将 dynamics embedding 注入 world action model,实现对不同动力学条件的结构化适应。

与现有方法的根本区别:DyWA 实现了仅使用单视角点云无需 Pose Tracking 的端到端 6D 非抓取操作,并能零样本 Sim-to-Real 迁移,跨越不同物体几何形状和物理条件泛化。


3. Method (方法)

3.1 Overall Framework(整体框架)

Figure 2 解读: 此图展示了 DyWA 的完整 pipeline。左侧为 Vision-based Student Policy(部署时使用),接收当前 partial point cloud、end-effector pose、joint state 和 goal point cloud 作为输入,通过各自的 Encoder 得到特征表示。这些特征输入 World Action Model,同时预测动作 和下一步状态 。上方的 Adaptation Module 编码历史 observation-action pairs,解码为 Dynamics Embedding,通过 FiLM 层调制 World Action Model 的中间表示。右侧为 Oracle Teacher Policy(仅训练时使用),拥有完整点云、物理参数、任务状态等 privileged information,通过 Encoders 和 Policy Net 产生教师动作 。教师策略提供三种监督信号:Imitation Loss(动作模仿)、World Model Loss(状态预测)、Adaptation Loss(adaptation embedding 对齐)。

3.2 Task Formulation(任务定义)

任务目标:通过非抓取操作(推、翻转等)将桌面物体从初始 6D pose 移动到目标 6D pose。

  • Goal Pose: 定义为相对于初始 pose 的 6DoF 变换
  • Task State: 为物体当前 pose 与目标 pose 之间的相对变换
  • Observations: 包含 partial point cloud 、joint states 、end-effector pose
  • 成功标准: 物体最终 pose 与目标相差 0.05m 以内且 0.1 radians 以内

3.3 Training Pipeline(训练流程)

采用标准的 Teacher-Student Policy Distillation 框架:

Stage 1: 训练 RL Teacher Policy(200K iterations, PPO)

  • 输入 privileged information:完整点云、物理参数(质量、摩擦系数)、精确 task state
  • 使用 PPO 算法训练,reward design 与 CORN 一致(详见补充材料)
  • 采用 Variable Impedance Control 作为底层动作执行机制

Stage 2: 训练 Student Policy(500K iterations, DAgger)

  • 使用 DAgger 进行蒸馏:初始使用教师动作执行,逐渐增加学生动作权重
  • Student 仅使用 partial observation(单视角点云、关节状态、end-effector pose)
  • Domain Randomization:随机化物体质量、缩放比例、摩擦系数、恢复系数
  • 注入小扰动到 torque commands、点云和 goal pose 以增强 Sim-to-Real 迁移

3.4 World Action Model(世界动作模型)

World Action Model 是一种同时预测动作和未来状态的策略模型,核心思想是通过 next state prediction 创造协同学习效应。

Observation and Goal Encoding:

  • Partial Point Cloud → 简化的 PointNet++ →
  • Joint positions/velocities → shallow MLPs →
  • End-effector pose → shallow MLP →
  • Goal Description: 将初始点云 通过目标 pose 变换得到 ,共享同一 point cloud encoder

State-based World Modeling: observation 和 goal embeddings 经过 MLPs 同时产生动作 和下一步 task state

采用 object-centric 的 task state 表示(而非高维视觉信号),使 world model 聚焦于 task-relevant dynamics。旋转表示采用 9D representation。

World Model Loss:

其中 为预测值, 为仿真器提供的 ground truth。

Imitation Loss:

Figure 3 解读: Loss 曲线展示了 Dynamics Adaptation(D.A.)和 World Model 的协同效应。左图:对比仅用 D.A. 和同时加入 World Model 的 Imitation Loss,可以看到加入 World Model 后 imitation loss 收敛更快更低,表明 next state prediction 对 action learning 有促进作用。右图:对比仅用 World Model 和同时加入 D.A. 的 World Model Loss,D.A. 帮助 world model 更好地预测未来状态。这验证了两个模块的互补性。

3.5 Dynamics Adaptation(动力学自适应)

受 RMA(Rapid Motor Adaptation)启发,设计了 Adaptation Module 来从历史轨迹中提取环境动力学信息。

Adaptation Embedding 计算:

在每个 timestep,将 observation embedding 与上一步 action embedding 拼接,构造长度为 的 observation-action 序列,通过 1D CNN 提取 adaptation embedding:

Adaptation Loss: 监督 adaptation embedding 对齐教师 encoder 的完整点云和物理参数 embedding:

3.6 FiLM Conditioning(动力学条件注入)

Adaptation embedding 解码为 Dynamics Embedding 后,通过 Feature-wise Linear Modulation (FiLM) 注入 World Action Model。

每个 FiLM block 包含两个 shallow MLPs,从 dynamics embedding 产生调制参数 ,对中间特征 进行 affine transformation:

FiLM blocks 密集地集成在 World Action Model 的前几层,后面几层保持无条件。这种设计在视觉编码器中集成语言引导时已被证明高效。

3.7 Action Space with Variable Impedance(变阻抗动作空间)

动作空间包含 end-effector 的 subgoal residual 以及 joint-space impedance 参数:

  • 位置增益
  • 阻尼因子 ,速度增益

目标关节位置通过 damped least squares 逆运动学求解:

使用 Polymetis API 实现 joint-space impedance controller。

3.8 Overall Training Objective(总体训练目标)

3.9 Pseudocode

Algorithm 1: Teacher Policy Training (PPO)

Algorithm: Teacher Policy Training via PPO
Input: IsaacGym env with privileged state (full PC, physics params, task state),
       StateEncoder, PiNet (actor), VNet (critic)
Output: Trained teacher policy π_teacher
 
1: Initialize PPO agent with StateEncoder, PiNet, VNet
2: for step = 1 to 200K:
3:     # Collect rollouts with privileged information
4:     obs = env.get_privileged_obs()  # full point cloud, physics params, task state, joint & EE
5:     features = StateEncoder(obs)
6:     action_dist = PiNet(features)  # Gaussian distribution
7:     action = action_dist.sample()
8:     next_obs, reward, done, info = env.step(action)
9:     value = VNet(features)
10:    # Store transition (obs, action, reward, done, value, log_prob)
11:
12:    # PPO update with GAE
13:    advantages = gae_ax1(rewards, values, dones, gamma, gae_lambda)
14:    returns = advantages + values
15:    for epoch in range(ppo_epochs):
16:        ratio = exp(new_log_prob - old_log_prob)
17:        surr1 = ratio * advantages
18:        surr2 = clip(ratio, 1-eps, 1+eps) * advantages
19:        policy_loss = -min(surr1, surr2).mean()
20:        value_loss = (returns - VNet(features))^2
21:        loss = policy_loss + value_coef * value_loss
22:        optimizer.step(loss)
23:
24:    # Reward bootstrapping on timeout
25:    if info['timeout']:
26:        reward += gamma * VNet(next_obs)

Algorithm 2: Student Policy Distillation (DAgger + World Action Model)

Algorithm: DyWA Student Distillation via DAgger
Input: Trained teacher π_teacher, env, PointNet++, AdaptationModule, WorldActionModel
Output: Trained student policy (DyWA)
 
1: Initialize student encoders, WorldActionModel, AdaptationModule
2: alpha = 0  # DAgger mixing: 0=teacher, 1=student
3: for step = 1 to 500K:
4:     # Anneal alpha from 0 to 1 over training
5:     alpha = anneal_schedule(step)
6:
7:     # Get observations
8:     partial_pc, joint_state, ee_pose = env.get_student_obs()
9:     goal_pc = transform(initial_pc, goal_pose)  # P_G = G * P_0
10:
11:    # Encode observations
12:    f_P = PointNetPP(partial_pc)      # point cloud feature
13:    f_J = MLP_joint(joint_state)       # joint feature
14:    f_E = MLP_ee(ee_pose)              # end-effector feature
15:    f_G = PointNetPP(goal_pc)          # goal feature (shared encoder)
16:
17:    # Adaptation: encode L history obs-action pairs
18:    f_O = concat(f_P, f_J, f_E)
19:    history = [(f_{t-i-1}^O, f_{t-i-2}^A) for i in 1..L]
20:    z_t = Conv1D_Adapter(history)       # adaptation embedding
21:    dynamics_emb = Decoder(z_t)         # dynamics embedding
22:
23:    # FiLM conditioning on WorldActionModel
24:    features = concat(f_P, f_J, f_E, f_G)
25:    for layer in WorldActionModel.early_layers:
26:        gamma, beta = FiLM_MLP(dynamics_emb)
27:        features = gamma * layer(features) + beta
28:    for layer in WorldActionModel.late_layers:
29:        features = layer(features)  # unconditioned
30:
31:    A_s, S_hat_{t+1} = WorldActionModel.heads(features)  # predict action + next state
32:
33:    # Get teacher action with privileged info
34:    A_t = teacher.get_action(env.get_privileged_obs())
35:
36:    # Execute mixed action (DAgger)
37:    action = (1 - alpha) * A_t + alpha * A_s
38:    env.step(action)
39:
40:    # Compute losses
41:    L_imitation = ||A_s - A_t||^2
42:    L_world = ||T_{t+1} - T_hat_{t+1}||_2^2 + ||R_{t+1} - R_hat_{t+1}||_1
43:    L_adapt = ||z_t^{Geo,Phy} - concat(f_t^{Geo}, f_t^{Phy})||^2
44:    L_total = L_imitation + L_world + L_adapt
45:
46:    optimizer.step(L_total)

Algorithm 3: FiLM Conditioning Module

Algorithm: FiLM Dynamics Conditioning
Input: latent feature f, dynamics_embedding z
Output: conditioned feature f'
 
1: class FiLMBlock(nn.Module):
2:     def __init__(self, feature_dim, dynamics_dim):
3:         self.gamma_mlp = MLP(dynamics_dim, feature_dim)  # scaling
4:         self.beta_mlp = MLP(dynamics_dim, feature_dim)   # shifting
5:
6:     def forward(self, f, dynamics_emb):
7:         gamma = self.gamma_mlp(dynamics_emb)
8:         beta = self.beta_mlp(dynamics_emb)
9:         return gamma * f + beta  # affine modulation

Algorithm 4: Adaptation Module

Algorithm: Dynamics Adaptation via History Encoding
Input: history of observations {f_O} and actions {f_A} of length L
Output: adaptation embedding z_t, dynamics embedding
 
1: class AdaptationModule(nn.Module):
2:     def __init__(self, obs_dim, action_dim, embed_dim, L):
3:         self.action_encoder = MLP(action_dim, embed_dim)
4:         self.conv1d = nn.Sequential(
5:             Conv1d(obs_dim + embed_dim, hidden_dim, kernel_size=k),
6:             ReLU(),
7:             Conv1d(hidden_dim, embed_dim, kernel_size=k),
8:         )
9:         self.decoder = MLP(embed_dim, dynamics_dim)  # z -> dynamics embedding
10:
11:    def forward(self, obs_history, action_history):
12:        # obs_history: [B, L, obs_dim], action_history: [B, L, action_dim]
13:        f_A = self.action_encoder(action_history)
14:        pairs = concat(obs_history, f_A, dim=-1)  # [B, L, obs_dim + embed_dim]
15:        pairs = pairs.transpose(1, 2)  # [B, C, L] for Conv1d
16:        z_t = self.conv1d(pairs).mean(dim=-1)  # [B, embed_dim]
17:        dynamics_emb = self.decoder(z_t)
18:        return z_t, dynamics_emb

3.10 Code-to-Paper Mapping Table

Paper ConceptSource FileKey Class/Function
PPO Teacher Trainingdywa/exp/train/train_ppo_arm.pyinner_main(), load_agent()
PPO Algorithmdywa/src/models/rl/v6/ppo.pyPPO, gae_ax1()
Student Distillation (DAgger)dywa/exp/train/train_rma.pyDAggerTrainerEnv, RMATrainerEnv
Student Agent (RMA + World Model)dywa/exp/train/distill.pyStudentAgentRMA
Point Cloud Encoder (PointNet++)dywa/src/models/pointnet2.pyPointNet++
Point Cloud (PointMAE)dywa/src/models/cloud/Cloud models
Common NN Modules (MLP, GRU, CNN, Attention)dywa/src/models/common.pyMLP, SingleGRU, SimpleCNN, FiLM*
Environment Wrappers (Reward, Obs)dywa/src/env/env/wrap/60+ wrapper classes
Robot Controllersdywa/src/env/robot/Franka, UR5 implementations
Domain Randomization / Configdywa/src/data/cfg/YAML configs
CUDA Kinematicsdywa/c_src/Franka/UR5 forward kinematics

*注: FiLM 的具体实现可能在 StudentAgentRMA 或单独的模型文件中,common.py 包含基础 NN 组件。


4. Experimental Setup (实验设置)

仿真环境

  • Simulator: IsaacGym
  • 训练物体: DexGraspNet 323-object asset
  • 测试物体: 10 个几何多样的 unseen 物体,每个缩放至 5 种尺寸 → 共 50 个评估物体
  • Domain Randomization: 物体质量、缩放、摩擦系数、恢复系数随机化

评估维度

维度设置
视角Single-view (1 camera) vs. Multi-view (3 cameras)
状态Known state (GT pose) vs. Unknown state (goal point cloud)
物体Seen (训练集) vs. Unseen (测试集)

Baselines

方法类型说明
HACManPrimitive-based基于点云的接触位置选择 + 运动原语
CORNClosed-loop基于物体表示的 teacher-student 蒸馏
CORN (PN++)Closed-loop替换 CORN 的 point cloud encoder 为 PointNet++

Ablation 变体

  • DAgger (baseline): 无 World Model、无 D.A.、无 FiLM
  • World Model only
  • RMA (D.A.) only
  • Ours w/o W.M.: 有 D.A. + FiLM,无 World Model
  • Ours w/o FiLM: 有 W.M. + D.A.,无 FiLM

训练配置

  • Teacher: PPO, 200K iterations
  • Student: DAgger, 500K iterations
  • 成功标准: 位置误差 < 0.05m 且旋转误差 < 0.1 rad

真实世界设置

  • 机器人: Franka robot arm
  • 相机: RealSense D435 (side view)
  • 物体: 10 个 unseen 真实物体(含 slippery 物体、半满水瓶等)
  • Pose 评估: Iterative Closest Point (ICP)

5. Experimental Results (实验结果)

5.1 仿真主实验结果 (Table 1)

MethodsAction TypeKnown State (3 view) Seen/UnseenUnknown State (3 view) Seen/UnseenUnknown State (1 view) Seen/Unseen
HACManPrimitive3.8(42.2) / 5.7(39.4)3.0(23.6) / 4.1(26.5)1.5(17.9) / 2.9(18.3)
CORNClosed-loop86.8 / 79.946.0 / 47.829.0 / 29.8
CORN (PN++)Closed-loop87.3 / 84.376.1 / 75.750.7 / 49.4
OursClosed-loop87.9 / 85.085.8 / 82.382.2 / 75.0

关键发现:

  • DyWA 在所有评估 track 上均超越 baselines,最具挑战性的 Unknown State + 1 view 场景下提升最为显著(+31.5% success rate)
  • Unknown state + single view 是最能体现 DyWA 动力学建模优势的场景

5.2 Ablation Study (Table 2) — Unknown State, 1 View

MethodsW.M.D.A.FiLMSeenUnseen
DAgger---59.957.5
World Model--61.659.4
RMA--65.657.9
Ours w/o W.M.-70.063.7
Ours w/o FiLM-73.359.4
Ours (full)82.275.0

关键发现:

  • World Model 和 Dynamics Adaptation 互补性: 单独使用其中一个仅带来微小提升(+1.7% / +5.7%),但两者结合后性能从 59.9% 跳升到 73.3%(+13.4%)
  • FiLM 的有效性: FiLM 提供比 direct concatenation 更有效的结构化条件注入,额外贡献 +8.9% 提升(73.3% → 82.2%)
  • 三模块完整组合达到最佳性能 82.2% / 75.0%

5.3 真实世界实验 (Table 3)

MethodsMugBulldozerCardBookDinosaurChips CanSwitchYCB-BottleHalf-full BottleCoffee jarAvg.
CORN w/ tracking1/53/54/54/52/50/52/50/50/52/536%
Ours3/54/54/54/53/52/54/53/54/53/568%

关键发现:

  • DyWA 在无需外部 Pose Tracking 的情况下,平均成功率 68%,远超依赖 tracking 的 CORN(36%)
  • 在 slippery 物体(YCB-Bottle)和 non-uniform mass 物体(Half-full Bottle)上优势尤为明显
  • 实现了零样本 Sim-to-Real 迁移

5.4 摩擦系数鲁棒性 (Table 4)

Methods S.R./Time S.R./Time S.R./Time S.R./Time
Ours w/o D.A.3/5, 65s3/5, 81s4/5, 96s3/5, 124s
Ours4/5, 45s4/5, 50s4/5, 49s4/5, 51s

关键发现:无 D.A. 的模型随摩擦系数变化执行时间大幅波动(65s→124s),而完整的 DyWA 保持稳定的成功率和执行时间。

5.5 VLM 应用

Figure 5 解读: DyWA 与 Vision-Language Models (VLMs) 集成的应用示例。通过 SoFar 模型将自然语言指令(如 “Put the grip of the electric drill into a person’s hand”)转换为语义物体 pose,作为 DyWA 的 goal 输入。这展示了 DyWA 的 goal-conditioned policy 可以与 VLM 结合实现自然语言驱动的操作。

Figure 6 解读: DyWA 作为 pre-grasping 步骤的应用。对于难以直接抓取的物体(如平放的薄卡片、超过夹爪跨度的饼干盒),DyWA 先将其翻转/旋转到适合抓取的姿态,再配合 grasping model 完成抓取,显著提高抓取成功率。

5.6 局限性

  • 仅依赖点云作为视觉输入,对称物体存在几何歧义
  • 透明和镜面物体因深度信息不完整而存在困难
  • 未来方向:引入外观信息提供更丰富的视觉线索

RL 使用情况分析

本文使用了 Reinforcement Learning (RL),但仅用于训练 Teacher Policy,而非直接用于最终部署的 Student Policy。

Reward Model

  • 论文声明 reward design 与 CORN 一致,具体细节在补充材料中
  • RL Teacher 使用 PPO 算法训练,基于 IsaacGym 仿真环境中的 state-based reward
  • Reward 由环境 wrapper 系统计算(代码中可见 AddWrenchPenalty, AddTrackingReward, AddSuccessAsObs 等 wrapper),而非单独的 Reward Model

VLM 作为 Reward Judge

  • ,本文未使用 VLM 作为零样本 reward judge
  • VLM 仅在应用阶段(Section 4.4)用于将自然语言指令转换为 goal pose,不参与 reward 计算

具体 Reward Signals

  • 论文未详细说明具体 reward function(标注为参考 CORN 和补充材料)
  • 从代码结构推断,reward 包括:
    • Tracking Reward: 基于物体 pose 与目标 pose 之间的距离
    • Wrench Penalty: 惩罚过大的接触力/扭矩
    • Success Signal: 达到目标 pose 阈值时的成功信号
  • 这是标准的 simulation-based reward,非学习型 Reward Model