Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Paper: arXiv:2601.12993 Code: BeingBeyond/Being-H Code reference: main @ 66b959e0 (2026-05-04)

1. Motivation (研究动机)

当前 VLA 的核心瓶颈不是单个机器人上能不能学会，而是跨 embodiment 时会出现“physical gap”：不同机器人在 DoF、末端执行器、关节限制、控制频率、传感器视角、动作单位上都不一致，直接混训会把同一个语义动作映射成彼此冲突的控制向量。已有做法常用平台专属 action head 或只在少数高资源机器人上训练，这会让模型停留在“硬件方言”的专家，而不是能迁移物理常识的通用策略。

本文要解决的具体问题是：如何把 human hand motion、单臂/双臂/灵巧手/移动底盘等异构数据放进一个可训练、可执行的统一 VLA 框架，并让低资源机器人能够借助人类交互数据和高资源机器人数据完成 few-shot adaptation。为此，论文提出 UniHand-2.0 数据配方、Unified State-Action Space、Mixture-of-Transformers/Mixture-of-Flow 架构，以及面向真实部署延迟的 MPG/UAC 稳定化机制。

这个问题值得研究，因为 robotics 的数据规模远小于 NLP/vision；如果每种机器人都需要从头采集大规模演示，VLA 很难扩展。Being-H0.5 的路线把人类交互轨迹当作 physical interaction 的“mother tongue”，目标是让模型学习抓取、接触、搬运、避障等跨形态共享的物理语法，再把这些能力映射到具体机器人。

Figure 1 解读：这张总览图把论文的逻辑串起来：左侧是 UniHand-2.0 的规模化数据，包含 human motion、robot manipulation 和 visual-text understanding；中间是 Unified Action Space 与 Being-H0.5 的 Und. Expert / Gen. Expert；右侧展示一个 checkpoint 跨 PND Adam-U、Franka+Inspire、Unitree G1、BeingBeyond D1、LeRobot SO-101 等平台部署。图中强调的不是“又一个机器人策略”，而是用统一动作语言把多种 embodiment 的控制问题压到同一建模接口内。

2. Idea (核心思想)

核心 insight：不同机器人虽然机械结构不同，但很多交互技能共享同一套物理语义，例如靠近、对齐、接触、夹持、运输、放置。与其为每种机器人学习独立动作头，不如把人手和机器人动作都投影到语义槽位一致的 Unified Action Space，让模型在同一个 token 序列里同时学习视觉-语言理解和连续动作生成。

关键创新有三层：第一，UniHand-2.0 提供 35,000+ 小时、400M samples、30 embodiments 的 human-centric pre-training recipe；第二，Being-H0.5 用 MoT/MoF 把 multimodal understanding 与 action generation 解耦但共享 attention；第三，MPG 和 UAC 分别处理 sensory shift 下的动作流形漂移与真实机器人上 inference latency/action chunking 的时间错配。

与 $π_{0}$ 、GR00T-N1、OpenVLA 等方法相比，Being-H0.5 的差别在于它不是只把 image/text/state 拼给 action head，而是显式构造跨 embodiment 的 slot-level state/action vocabulary，并把 human hand motion 作为一个“广义 embodiment”。与平台专属 action head 相比，Unified Action Space 试图让动作维度的语义对齐先于模型学习发生，从源头减少 negative transfer。

3. Method (方法)

3.1 Overall framework

Being-H0.5 的整体框架可以分成四个环节：

Data：UniHand-2.0 把 16K hours human、14K hours robot，再加上 visual-text understanding 数据合成 35K+ hours / 400M samples 的训练配方；UniCraftor 则提供多视角 RGBD、pedal-triggered event 和 semi-automatic annotation 的数据采集系统。
Representation：Unified State-Action Space 用 $Φ_{e}$ 把 embodiment-specific state/action 投影到全局稀疏槽位；未使用槽位置零，使 human wrist、finger articulation、robot EEF、joint、base velocity 可以共存。
Architecture：MoT backbone 中 Und. Expert 处理 multimodal reasoning，Act. Expert 处理 flow-based action generation；共享 self-attention 保证动作生成能被视觉/语言上下文充分条件化。
Deployment：MPG 根据 observation/action embedding 的分布差异调节上下文残差强度，UAC 让不同控制频率/延迟预算的机器人都能安全执行 chunked actions。

Figure 6 解读：图中把 Being-H0.5 的方法拆成三条主线：上方是 Unified State-Action Space，将 human hand 与多种 robot action 映射到统一槽位；中间是 UniHand-2.0 的 QA-style serialization，使 vision/text/state/action 都作为序列片段进入模型；右侧是 Mixture-of-Flow，在 action expert 内先学习 shared motor primitives，再通过 routed specialized experts 适配 embodiment/task-specific dynamics。Und. Expert 和 Act. Expert 不是完全隔离的两套网络，而是在 MoT 中共享 attention，因此 action generation 能直接利用视觉语义上下文。

3.2 UniHand-2.0 与 UniCraftor

UniHand-2.0 的作用是把 human-centric interaction prior 引入 VLA 预训练。论文报告其配方为 400M samples、35,000+ hours，其中包含约 16,000 小时 human data、14,000 小时 robot data，并覆盖 30 distinct robotic embodiments。它不只是机器人轨迹库，还包含 cross-embodiment physical control 与 general visual-text understanding，使模型能在同一个序列建模任务里学习“描述”和“动作”。

Figure 2 解读：该图展示 UniHand-2.0 的数据规模定位：论文将其作为 largest embodied pre-training recipe，强调 hours、samples、embodiments 三个维度同时扩展。对阅读本文最重要的是理解：human data 不是附加数据增强，而是被当作跨形态 physical prior 的主干来源。

Figure 3 解读：图中细化了 UniHand-2.0 的组成：human interaction traces、robot demonstrations、visual-language data 共同形成预训练 mixture。论文的假设是 human motion 中包含丰富接触与操控语义，低资源机器人可以通过统一动作空间继承这些 priors。

Figure 5 解读：UniCraftor 是数据采集系统：多视角 observation、native depth、pedal 事件信号和自动/人工校验的 annotation pipeline 共同减少数据标注成本。它对应论文中“可持续扩展 human/robot data”的工程支撑。

3.3 Unified State-Action Space

论文把每个 embodiment $e$ 的原始状态和动作分别映射到全局向量：

s = Φ_{e} (s^{(e)}), a = Φ_{e} (a^{(e)}) .

其中 $Φ_{e}$ 是稀疏槽位路由：相关信号进入对应 global slot，未使用 slot 置零。训练样本再序列化为 modality-tagged segments：

S = [x_{1}, x_{2}, \dots, x_{K}], x_{k} = ⟨ m_{k}, C_{k} ⟩,

M = {vision, text, state, action} .

直觉上，这一步把“不同机器人动作维度不一样”的问题转成“同一语义槽位是否被当前 embodiment 使用”的问题。比如 human wrist pose 对齐到 robot EEF subspace，finger articulation 映射到 fine-manipulation slots；Cartesian action 使用 relative delta displacement，rotation 使用 Axis-Angle，joint-space 使用 absolute radian。这样模型看到的是可解释物理量，而不是每个平台自定义的一组归一化编号。

3.4 Flow Matching、MoF、MPG 与 UAC

连续动作生成使用 Flow Matching。对 target action $a_{i}$ 、噪声 $x_{0} \sim N (0, I)$ 和 $t \in [0, 1]$ ，论文定义线性路径：

x_{t} = (1 - t) x_{0} + t a_{i},

理想速度场为：

u_{t} (x_{t}) = a_{i} - x_{0},

Flow Matching loss 为：

L_{FM} = i \in Ω_{FM} \sum ∥ v_{θ} (x_{t}, t, c) - (a_{i} - x_{0}) ∥_{2}^{2} .

MoF 把 action expert 分成 shared foundation layers 与 routed specialized experts。前者学习 reaching、grasping、collision avoidance 等跨 embodiment motor primitives；后者通过 gating 处理 embodiment/task-specific dynamics。这个设计的直觉是：多机器人共享低层物理规律，但高层动作流形和约束不同；如果所有层完全共享，容易出现 inter-embodiment interference；如果所有层完全隔离，又丢失迁移。

MPG 的目标是让 sensory shift 下的 action flow 不被异常 observation features 拉出动作流形。设 $H$ 为 suffix token features， $Z^{nf}$ 为 zero-noise action-token embeddings， $\overset{ˉ}{Z} = MeanPool (Z^{nf})$ 。论文用 Sliced Wasserstein Distance 估计 observation/action embedding discrepancy：

D (μ_{\hat{H}}, μ_{\hat{Z}}) \approx \frac{1}{M} m = 1 \sum M sort (θ_{m}^{⊤} \hat{H}) - sort (θ_{m}^{⊤} \hat{Z})_{2}^{2},

其中 $\hat{H} = LN (E_{obs} (H))$ ， $\hat{Z} = LN (E_{act} (\overset{ˉ}{Z}))$ 。门控为：

g = exp (- D / τ) \in (0, 1] .

与普通 output gating 不同，MPG 在投影前缩放 feature-conditioned residual，并保留 ungated learned prior offset；因此 gate 抖动时不会同时放大 bias，轨迹更平滑。

UAC 解决 action chunking 的延迟错配。对 embodiment $e$ ，控制周期为 $Δ t^{(e)}$ ，期望 inference latency budget 为 $L^{(e)}$ ，有效延迟步数按：

⌈ L^{(e)} /Δ t^{(e)} ⌉

缩放。训练时采样：

d \sim π^{(e)} (d), d \in {0, 1, \dots, d_{m a x}^{(e)} - 1},

将 chunk 切为 committed prefix $A_{< d}$ 和 predicted postfix $A_{\geq d}$ ，loss 只作用在 postfix。部署时，论文要求：

d \geq ⌈ t_{inference} / t_{control}^{(e)} ⌉ + ϵ_{safety} .

Figure 7 解读：左半部分是 MPG：把 observation embedding 与 action prior embedding 投到 SWD 空间，得到 discrepancy-guided gate $g$ ；右半部分是 UAC：根据 embodiment-specific delay $d$ 把 action chunk 分成 prefix/postfix，prefix 已经在执行队列中，因此 denoising 时必须锁住，只更新 postfix。这个图解释了为什么 Being-H0.5 不只关注离线 benchmark，还把真实控制系统的 latency 和 jitter 纳入训练/推理协议。

3.5 Real-world deployment infrastructure

真实部署中，机器人持续执行，而模型间歇生成 chunk。Being-H0.5 使用 dual-thread buffer，把 inference thread 和 execution thread 解耦。MPG 在低步数 denoising 下尤其重要，因为部署时为了实时性通常不能跑很多 refinement steps；UAC 则保证新 chunk 只覆盖尚未执行的 postfix，避免控制 stutter。

Figure 8 解读：部署图展示了从 perception、policy inference 到 robot execution 的闭环。关键点是 action buffer 的 prefix commitment：已经发给控制器的动作不再被新推理结果重写，只有 postfix 能被 stitched back。这使同一 VLA 可以适配不同 robot control frequency。

3.6 Pseudocode (based on released code)

论文公式与 released code 实现差异：论文实验文字说 LIBERO 使用 chunk size 8、45k steps、4×A800；released example Being-H05/scripts/train/train_libero_example.sh 使用 ACTION_CHUNK_LENGTH=16、MAX_STEPS=60000、4 GPUs。论文强调 Unified Action Space 保留 raw physical magnitude，但 released StateActionTransform 同时支持 quantile / meanstd / minmax / abs / binary 等 normalization modes，具体是否归一化取决于 dataset config。代码中 BeingH 固定 unified_state_dim=200、unified_action_dim=200。

import torch
import torch.nn.functional as F
 
 
def unified_slot_project(raw_state, raw_action, embodiment_map, state_dim=200, action_dim=200):
    """Sketch of Phi_e used by Being-H05 data path."""
    state = torch.zeros(state_dim, dtype=raw_state.dtype, device=raw_state.device)
    action = torch.zeros(action_dim, dtype=raw_action.dtype, device=raw_action.device)
    for src_slice, dst_slice in embodiment_map["state"]:
        state[dst_slice] = raw_state[src_slice]
    for src_slice, dst_slice in embodiment_map["action"]:
        action[dst_slice] = raw_action[src_slice]
    return state, action

import torch
 
 
def beingh_flow_train_step(model, batch, use_training_time_rtc=False, max_delay=0):
    """Mirrors BeingH.forward: sample noise/time, encode action tokens, predict velocity."""
    actions = batch["padded_action"]          # [B * chunk, 200]
    states = batch["padded_state"]            # [B, 200]
    B = actions.shape[0] // model.action_chunk_length
    noise = torch.randn_like(actions)
 
    if use_training_time_rtc:
        base_t = model.sample_time(B, actions.device, actions.dtype)
        delays = torch.randint(0, max_delay + 1, (B,), device=actions.device)
        t = torch.zeros(B * model.action_chunk_length, device=actions.device, dtype=actions.dtype)
        postfix_mask = torch.ones_like(t, dtype=torch.bool)
        for b in range(B):
            lo, hi = b * model.action_chunk_length, (b + 1) * model.action_chunk_length
            d = min(int(delays[b]), model.action_chunk_length)
            t[lo:lo + d] = 1.0                 # clean committed prefix
            t[lo + d:hi] = base_t[b]           # noisy postfix
            postfix_mask[lo:lo + d] = False
    else:
        t = model.sample_time(B, actions.device, actions.dtype)
        t = t[:, None].expand(B, model.action_chunk_length).reshape(-1)
        postfix_mask = torch.ones_like(t, dtype=torch.bool)
 
    noisy = (1.0 - t[:, None]) * noise + t[:, None] * actions
    velocity_target = actions - noise
    action_features = model.action_encoder(noisy.view(B, model.action_chunk_length, -1), t[:B])
    pred_velocity = model.forward_action_branch(states, action_features, batch)
    loss = ((pred_velocity.reshape_as(actions) - velocity_target) ** 2)[postfix_mask].mean()
    return loss

import torch
 
 
def mpg_enhance_context(H, clean_action_emb, obs_proj, act_proj, residual, tau=1.0, lam=0.1, num_proj=32):
    """MPG logic reflected by BeingH model/layers.py: SWD gate + gated residual."""
    z_bar = clean_action_emb.mean(dim=1)
    H_hat = torch.nn.functional.layer_norm(obs_proj(H), obs_proj(H).shape[-1:])
    Z_hat = torch.nn.functional.layer_norm(act_proj(z_bar), act_proj(z_bar).shape[-1:])
    dirs = torch.randn(num_proj, H_hat.shape[-1], device=H.device)
    dirs = dirs / dirs.norm(dim=-1, keepdim=True).clamp_min(1e-6)
    h_sorted = torch.sort(H_hat @ dirs.T, dim=0).values
    z_sorted = torch.sort(Z_hat @ dirs.T, dim=0).values
    D = ((h_sorted - z_sorted) ** 2).mean()
    g = torch.exp(-D / tau).clamp(max=1.0)
    return H + lam * g * residual(H)

import torch
 
 
@torch.no_grad()
def beingh_iterative_action_inference(model, batch, prev_chunk=None, prefix_mask=None, num_steps=10):
    """Mirrors BeingH.get_action: iterative velocity updates with optional prefix overwrite."""
    B = batch["padded_state"].shape[0]
    actions = torch.randn(B, model.action_chunk_length, model.unified_action_dim, device=batch["padded_state"].device)
    dt = 1.0 / num_steps
    predicted_clean_emb = None
    for step in range(num_steps):
        t = torch.full((B,), step / num_steps, device=actions.device)
        action_features = model.action_encoder(actions, t)
        if model.use_mpg and predicted_clean_emb is not None:
            action_features = model.apply_mpg(batch["padded_state"], action_features, predicted_clean_emb)
        pred_velocity = model.predict_velocity(batch, action_features, t)
        actions = actions + dt * pred_velocity
        if prev_chunk is not None and prefix_mask is not None:
            actions = torch.where(prefix_mask[..., None], prev_chunk, actions)
        predicted_clean_emb = model.action_encoder(actions, torch.zeros_like(t))
    return actions

Code reference: main @ 66b959e0 (2026-05-04) — pseudocode and mapping based on this commit

Paper Concept	Source File	Key Class/Function
MoT + Being-H model wrapper	`Being-H05/BeingH/model/beingvla.py`	`BeingHConfig`, `BeingH.forward`, `BeingH.get_action`
Flow Matching action encoder/decoder	`Being-H05/BeingH/model/beingvla.py`, `Being-H05/BeingH/model/layers.py`	`ActionEncoder`, `SimpleMLP`, velocity target construction
Unified state/action dimensions	`Being-H05/BeingH/model/beingvla.py`	`unified_state_dim=200`, `unified_action_dim=200`
MPG / SWD gate	`Being-H05/BeingH/model/layers.py`, `Being-H05/BeingH/model/beingvla.py`	`SlicedWassersteinDistance`, MPG projection/enhancement logic
UAC / RTC-style delay	`Being-H05/BeingH/model/beingvla.py`	`use_training_time_rtc`, `simulated_delay`, prefix mask, inference prefix overwrite
Dataset and state/action transforms	`Being-H05/BeingH/dataset/datasets/vla_dataset.py`, `Being-H05/BeingH/dataset/transform/state_action.py`	`LeRobotIterableDataset`, `StateActionTransform`, `Normalizer`
Training launch config	`Being-H05/scripts/train/train_libero_example.sh`	4 GPUs, `MAX_STEPS=60000`, `ACTION_CHUNK_LENGTH=16`, MPG flags

4. Experimental Setup (实验设置)

Datasets

UniHand-2.0 pre-training recipe：400M samples，35,000+ hours，30 embodiments；论文摘要/贡献中明确分解为 16,000 hours human、14,000 hours robot，并包含 general visual-text understanding 数据。
Simulation：LIBERO 与 RoboCasa。LIBERO 使用 4 suites（L-Spatial、L-Object、L-Goal、L-Long），每 task 50 evaluation episodes；RoboCasa 使用 24 long-horizon household tasks，Human-50 few-shot setting，每 task 50 human demonstrations，5 held-out scenes，每 task 50 trials。
Real-world：5 个真实平台，论文列出 PND Adam-U、Franka+Inspire、Unitree G1、BeingBeyond D1、LeRobot SO-101；每个 task 收集 30–60 minutes real-robot demonstrations。

Baselines

LIBERO 对比 Diffusion Policy、OpenVLA、SpatialVLA、CoT-VLA、 $π_{0}$ -Fast、GR00T-N1、 $π_{0}$ 、F1、InternVLA-M1、Discrete Diffusion VLA、 $π_{0.5}$ 、OpenVLA-OFT、X-VLA、EO1。RoboCasa 对比 3DA、DP3、GWM、BC、GR00T-N1、 $π_{0.5}$ 、 $π_{0}$ 。few-shot adaptation 对比 Native VLM initialization 与 Human-Centric pretraining initialization。

Metrics

Success Rate (%)：任务完成比例，simulation 表格按 suite/category 平均。
LIBERO Avg. / RoboCasa Total Avg.：跨 suites 或 task categories 的平均 success rate。
Few-shot adaptation gain $Δ$ ：Human-Centric 初始化相对 Native VLM 初始化的绝对百分点提升。

Training config

论文实验配置：

LIBERO：RGB-only 224×224，2B backbone，multi-view RGB observations，action chunk size 8，packed sequences 7,680 tokens/GPU，effective batch size 128；specialist 45k steps on 4×A800；generalist 为 LIBERO+RoboCasa joint mixture，约 2× steps；evaluation 每 suite 500 trials。
RoboCasa：RGB-only 224×224，不用 depth/point cloud；specialist 与 LIBERO 相同训练预算，generalist 2× steps。

Released code config anchor：

Config path：Being-H05/scripts/train/train_libero_example.sh
Hardware / launch：NUM_GPUS=4，torchrun --nproc_per_node=4
Optimization：MAX_STEPS=60000，LEARNING_RATE=1e-4，WEIGHT_DECAY=1e-5，WARMUP_RATIO=0.05，lr_scheduler=cosine，gradient_accumulation_steps=1
Input / sequence：FORCE_IMAGE_SIZE=224，MAX_NUM_TOKENS=8704，EXPECTED_NUM_TOKENS=8192，ATTN_MODE=causal
Action / MPG / RTC：ACTION_CHUNK_LENGTH=16，USE_MPG=True，MPG_LAMBDA=0.1，MPG_NUM_PROJECTIONS=32，MPG_REFINEMENT_ITERS=1，USE_TRAINING_TIME_RTC=False，SIMULATED_DELAY=0

5. Experimental Results (实验结果)

Main simulation results

LIBERO success rate (%)

Method	L-Spatial	L-Object	L-Goal	L-Long	Avg.
GR00T-N1	94.4	97.6	93.0	90.6	93.9
$π_{0.5}$	98.8	98.2	98.0	92.4	96.9
X-VLA	98.2	98.6	97.8	97.6	98.1
EO1	99.7	99.8	99.2	94.8	98.2
Being-H0.5 (generalist)	97.0	98.2	99.0	96.2	97.6
Being-H0.5 (specialist)	99.2	99.6	99.4	97.4	98.9

Being-H0.5 specialist 达到 98.9% average，且 LIBERO-Long 为 97.4%；generalist 单 checkpoint 在 LIBERO+RoboCasa joint training 后仍有 97.6% average，说明跨 benchmark 容量竞争没有显著破坏 LIBERO 表现。

RoboCasa 24-task success rate (%)

Modality / Method	Pick & Place	Doors/Drawers	Others	Total Avg.
3D 3DA	0.0	2.3	13.1	5.5
3D DP3	1.5	41.7	32.0	22.8
3D GWM	14.8	54.3	49.8	39.3
RGB BC	4.3	47.0	42.2	28.9
RGB GR00T-N1	18.6	50.2	39.1	36.0
RGB $π_{0.5}$	21.5	57.8	44.9	41.4
RGB $π_{0}$	14.0	53.1	58.5	42.4
Being-H0.5 (generalist)	40.0	73.0	52.0	53.3
Being-H0.5 (specialist)	36.0	71.7	57.6	53.9

RoboCasa 的关键结论是：Being-H0.5 只用 224×224 RGB 就超过 3D/multimodal baselines，总平均达到 53.9%，generalist 也保持 53.3%。这支持论文关于 Unified Action Space + human-centric pretraining 能处理 long-horizon kitchen interactions 的主张。

Figure 9 解读：该图展示 simulation trajectory / benchmark 表现趋势，补充表格中的平均数。它强调 Being-H0.5 在需要多步空间推理和长程动作连贯性的任务上优势更明显，而不是只在短 horizon pick-and-place 上提升。

Human-centric pretraining ablations

Single-task LIBERO 5-shot adaptation：当冻结 Und+Proj+ViT 时，Human-Centric 初始化 Avg. 69.0%，Native VLM 为 57.9%，提升 +11.1；冻结 Und+ViT 时提升最大，Avg. 从 51.3% 到 77.1%， $Δ = + 25.8$ ；Full FT 时 Avg. 从 77.2% 到 81.8%， $Δ = + 4.6$ 。

Multi-task-suite LIBERO 5-shot adaptation：冻结 Und+Proj+ViT 时 Avg. 从 60.7% 到 72.4%， $Δ = + 11.7$ ；冻结 Und Only 时 Avg. 从 72.8% 到 82.1%， $Δ = + 9.3$ ；Full FT 时 Avg. 从 84.1% 到 85.1%， $Δ = + 1.0$ 。

这些结果说明 human-centric pretraining 在参数受限和 low-shot 设置下最有价值；当 full fine-tuning 放开全部参数时，初始化优势仍在但变小。

Figure 12 解读：Action Expert 冻结层数越多，模型越难把 human-centric prior 转换成目标 embodiment 动作。论文指出冻结 0–7 层影响较小，多数 suite 仍超过 80%；超过 14 层后性能快速下降，完整冻结时跌到 20% 以下。这支持 MoF 中“shared lower motor primitives + trainable upper specialization”的设计。

Real-world deployment

Figure 10 解读：真实机器人结果以不同能力维度展示，Being-H0.5 在 Spatial、Long Horizon、Bimanual、Generalization 等维度上相对 $π_{0.5}$ 和无预训练版本都有优势。该图对应论文最核心的外推结论：human-centric data 与 Unified Action Space 对 cross-embodiment few-shot adaptation 有实际部署收益。

Limitations

论文没有把所有真实部署细节和完整 35K hours 数据全部开源到可复现实验级别；released repo 提供训练/推理 pipeline、simulation scripts、weights/data release 入口，但 paper-level pretraining recipe 与真实机器人 deployment infrastructure 的完整复现仍依赖外部资产。另一个限制是 UAS 需要人工定义或维护 embodiment slot mapping；当新硬件的动作语义无法自然落入现有槽位时，仍可能需要扩展 global state/action schema。

Paper Notes

探索

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall framework

3.2 UniHand-2.0 与 UniCraftor

3.3 Unified State-Action Space

3.4 Flow Matching、MoF、MPG 与 UAC

3.5 Real-world deployment infrastructure

3.6 Pseudocode (based on released code)

4. Experimental Setup (实验设置)

Datasets

Baselines

Metrics

Training config

5. Experimental Results (实验结果)

Main simulation results

Human-centric pretraining ablations

Real-world deployment

Limitations

目录