UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Authors: Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia Affiliations: HKUST, CUHK, Tsinghua University, Kling Team (Kuaishou Technology) arXiv: 2512.07831 Project Page: jackailab.github.io/Projects/UnityVideo GitHub: dvlab-research/UnityVideo Year: 2024

Figure 1 解读: UnityVideo 是一个统一的多任务多模态视频生成框架，支持三大能力：(1) 联合生成 (Joint Generation) — 从文本同时生成 RGB 视频和任意辅助模态（DensePose, Depth, Skeleton, Segmentation）；(2) 模态估计 (Estimator) — 从参考视频估计深度、光流、DensePose、骨架、分割等；(3) 可控生成 (Controllable Generation) — 从深度图、骨架、光流等条件生成视频。右上角雷达图显示 UnityVideo 在多项指标上全面超越 VACE、Full-DiT、Aether、Geo4D、SeC 等方法。

1. Motivation (研究动机)

单模态训练的局限性: 当前视频生成模型主要以 RGB 视频为单一训练目标，缺乏对深度 (depth)、光流 (optical flow)、分割 (segmentation)、骨架 (skeleton)、DensePose 等辅助模态的联合学习，限制了模型对物理世界的全面理解。
跨模态交互不足: 现有方法通常将模态视为单向条件（如从条件模态生成 RGB，或从 RGB 估计模态），缺乏双向交互和互补增强。
缺乏统一训练范式: 不同任务（Text-to-Video 生成、可控生成、模态估计）使用不同的训练流程，难以实现跨任务知识迁移，且逐阶段训练容易产生灾难性遗忘。
类比 LLM 的统一学习: 正如 LLM 通过统一自然语言、代码、数学等文本子模态实现了 emergent reasoning，视频生成领域同样有望通过统一多种视觉子模态来增强世界建模能力。

2. Idea (核心思想)

UnityVideo 的核心思想是：在单个 diffusion transformer 中统一多模态、多任务的视频生成与理解，通过动态噪声注入策略和模态自适应架构实现任务统一与模态统一，从而加速收敛并增强 zero-shot 泛化能力。

与现有方法的根本区别在于：(1) 不是将不同模态和任务分开训练或顺序训练，而是在每个 batch 中同时采样三种任务（条件生成、模态估计、联合生成）；(2) 通过 In-Context Learner 和 Modality-Adaptive Switcher 两个互补设计，在语义层面和架构层面同时实现模态区分。

3. Method (方法)

3.1 整体框架

Figure 3 解读: 左侧展示了 UnityVideo 的整体架构。RGB 视频 $V_{r}$ 和辅助模态 $V_{m}$ 通过共享 VAE 编码后，经各自的 Expert Input Layer 进入共享的 DiT blocks。DiT block 内部包含 Self-Attention（RGB 和模态 token 在 width 维度拼接后交互）、双分支 Cross-Attention（分别处理内容 prompt $C_{r}$ 和模态类型 prompt $C_{m}$ ）、以及 Modality-Aware AdaLN Table 实现模态级调制。右侧展示了四种训练/推理模式：(A) 条件生成、(B) 模态估计、(C) 联合生成、(D) Any2Any 推理。

3.2 统一多任务 (Unifying Multiple Tasks)

框架支持三种训练范式，通过动态噪声调度 (Dynamic Noise Scheduling) 统一：

任务	RGB token 噪声	模态 token 噪声	描述
条件生成 (Conditional Generation)	$t \sim [0, 1]$	$t = 0$ (clean)	从辅助模态生成 RGB
模态估计 (Video Estimation)	$t = 0$ (clean)	$t \sim [0, 1]$	从 RGB 估计辅助模态
联合生成 (Joint Generation)	独立加噪	独立加噪	从文本同时生成 RGB 和模态

Dynamic Task Routing: 每个 iteration 按概率 $p_{cond}, p_{est}, p_{joint}$ 采样一种任务类型，其中 $p_{cond} < p_{est} < p_{joint}$ （按学习难度逆序分配概率），避免顺序训练导致的灾难性遗忘。

3.3 统一多模态 (Unifying Multiple Modalities)

两个互补设计解决模态区分问题：

In-Context Learner

利用模型的上下文推理能力，注入模态类型文本 prompt $C_{m}$ （如 “depth map”, “human skeleton”）而非内容描述 caption $C_{r}$ 。通过双分支 Cross-Attention 分别处理：

V_{r}^{'} = CrossAttn (V_{r}, C_{r}) (RGB 特征 + 内容 caption)

V_{m}^{'} = CrossAttn (V_{m}, C_{m}) (模态特征 + 模态类型描述)

这种设计使模型学习模态级别的语义，例如训练时见到 “two persons” 可泛化到分割任务中的 “two objects”。

Modality-Adaptive Switcher

引入可学习的模态 embedding 列表 $L_{m} = {L_{1}, L_{2}, \dots, L_{k}}$ ，在每个 DiT block 的 AdaLN-Zero 层中生成模态特定的调制参数：

γ_{m}, β_{m}, α_{m} = MLP (L_{m} + t_{emb})

其中 $L_{m} \in P_{m}$ 是模态 embedding， $t_{emb}$ 是时间步 embedding。这实现了架构级别的模态区分，支持推理时即插即用的模态选择。

此外，每种模态有独立的 Expert Input Layer 和 Expert Output Layer（编码和预测头），减少模态混淆。

3.4 训练策略 (Training Strategy)

Curriculum Learning

分两组模态、两阶段训练：

Stage 1: 仅训练像素对齐模态（depth, optical flow, DensePose），使用精选的单人数据集（500K clips, 16K steps），建立空间对应基础。
Stage 2: 加入所有模态（额外包括 segmentation, skeleton），使用完整 1.3M 数据集（40K steps），覆盖人体场景和通用场景。

OpenUni 数据集

Figure 4 解读: OpenUni 数据集包含 1.3M 对多模态视频数据，来源包括 Koala36M (~500K, depth/optical flow), 单人数据 (~400K, DensePose/depth/flow/skeleton), OpenS2V (~300K, depth/optical flow), 多人数据 (~100K, segmentation/skeleton/DensePose)。数据经过 OCR 过滤、运动滤波、美学评分等质量筛选，使用 Depth Anything V2、RAFT、SAM、DWPose、DensePose (Meta) 等预训练模型提取多模态标注。

3.5 训练目标 (Training Objective)

基于 Conditional Flow Matching [Lipman et al., ICLR 2023]，三种模式的损失函数：

条件生成:

L_{cond} (θ; t) = E [∥ u_{θ} (r_{t}, [m_{0}, c_{txt}], t) - v_{r} ∥^{2}]

模态估计:

L_{est} (θ; t) = E [∥ u_{θ} (m_{t}, r_{0}, t) - v_{m} ∥^{2}]

联合生成:

L_{joint} (θ; t) = E [∥ u_{θ} ([r_{t}, m_{t}], c_{txt}, t) - [v_{r}, v_{m}] ∥^{2}]

其中 $r_{t} = (1 - t) r_{0} + t r_{1}$ , $m_{t} = (1 - t) m_{0} + t m_{1}$ 是 flow matching 的插值 latent， $v_{r} = r_{1} - r_{0}$ , $v_{m} = m_{1} - m_{0}$ 是速度场， $r_{0}, m_{0}$ 是 clean latent， $r_{1}, m_{1} \sim N (0, I)$ 是高斯噪声。

3.6 关键组件伪代码

Dynamic Task Routing

def dynamic_task_routing(batch, p_cond, p_est, p_joint):
    # 注：论文未公开具体概率值，仅说明 p_cond < p_est < p_joint
    """每个 iteration 随机选择一种任务类型"""
    task = random.choices(
        ['cond', 'est', 'joint'],
        weights=[p_cond, p_est, p_joint]
    )[0]
 
    t = random.uniform(0, 1)  # 采样时间步
 
    if task == 'cond':
        # 条件生成: RGB 加噪, 模态 clean
        rgb_t = (1 - t) * rgb_clean + t * noise_rgb
        modal_t = modal_clean  # t=0
        target = velocity_rgb
    elif task == 'est':
        # 模态估计: RGB clean, 模态加噪
        rgb_t = rgb_clean  # t=0
        modal_t = (1 - t) * modal_clean + t * noise_modal
        target = velocity_modal
    else:
        # 联合生成: 两者独立加噪
        t_rgb, t_modal = random.uniform(0,1), random.uniform(0,1)
        rgb_t = (1 - t_rgb) * rgb_clean + t_rgb * noise_rgb
        modal_t = (1 - t_modal) * modal_clean + t_modal * noise_modal
        target = concat(velocity_rgb, velocity_modal)
 
    return rgb_t, modal_t, target, task

Modality-Adaptive Switcher (AdaLN with Modality Embedding)

class ModalityAdaptiveAdaLN(nn.Module):
    def __init__(self, hidden_dim, num_modalities):
        super().__init__()
        # 可学习的模态 embedding 表
        self.modality_embeddings = nn.Embedding(num_modalities, hidden_dim)
        # AdaLN MLP: 输入 = modality_emb + timestep_emb
        self.adaln_mlp = nn.Sequential(
            nn.SiLU(),
            nn.Linear(hidden_dim, 3 * hidden_dim)  # gamma, beta, alpha
        )
 
    def forward(self, x, t_emb, modality_id):
        """
        x: [B, N, D] 模态 token 特征
        t_emb: [B, D] 时间步 embedding
        modality_id: int, 模态索引
        """
        L_m = self.modality_embeddings(modality_id)  # [D]
        condition = L_m + t_emb  # [B, D]
        gamma, beta, alpha = self.adaln_mlp(condition).chunk(3, dim=-1)
 
        # AdaLN-Zero 调制
        x = gamma * layer_norm(x) + beta
        return x, alpha  # alpha 用于残差连接的 gate

In-Context Learner (Dual-Branch Cross-Attention)

class InContextLearner(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.cross_attn_rgb = CrossAttention(hidden_dim)    # RGB 分支
        self.cross_attn_modal = CrossAttention(hidden_dim)  # 模态分支
 
    def forward(self, V_r, V_m, C_r, C_m):
        """
        V_r: RGB token, V_m: 模态 token
        C_r: 内容描述 caption embedding (e.g., "A cat running...")
        C_m: 模态类型描述 embedding (e.g., "depth map")
        """
        V_r_prime = self.cross_attn_rgb(query=V_r, key_value=C_r)
        V_m_prime = self.cross_attn_modal(query=V_m, key_value=C_m)
        return V_r_prime, V_m_prime

Curriculum Training Pipeline

def train_unityvideo(model, dataset_stage1, dataset_stage2):
    # Stage 1: 像素对齐模态 + 单人数据
    pixel_aligned_modalities = ['depth', 'optical_flow', 'densepose']
    for step in range(16000):
        batch = sample_batch(dataset_stage1, modalities=pixel_aligned_modalities)
        rgb_t, modal_t, target, task = dynamic_task_routing(batch)
        pred = model(rgb_t, modal_t, task)
        loss = mse_loss(pred, target)
        loss.backward(); optimizer.step()
 
    # Stage 2: 所有模态 + 完整数据
    all_modalities = ['depth', 'optical_flow', 'densepose', 'segmentation', 'skeleton']
    for step in range(40000):
        # 均匀采样模态和数据源, 每 batch 分为 4 组
        batch = sample_balanced_batch(dataset_stage2, modalities=all_modalities)
        rgb_t, modal_t, target, task = dynamic_task_routing(batch)
        pred = model(rgb_t, modal_t, task)
        loss = mse_loss(pred, target)
        loss.backward(); optimizer.step()

3.7 Code-to-Paper 映射表

Paper 概念	代码对应	说明
Dynamic Noising (Sec 3.1)	`dynamic_task_routing()`	根据任务类型决定 RGB/模态 token 的噪声策略
In-Context Learner (Sec 3.2)	`InContextLearner` - dual cross-attn	双分支 cross-attention，分别处理内容和模态类型 prompt
Modality-Adaptive Switcher (Sec 3.2)	`ModalityAdaptiveAdaLN`	可学习模态 embedding + AdaLN 调制
Modality Expert I/O (Sec 3.2)	Expert Input/Output Layer	每种模态独立的编码头和预测头
Curriculum Learning (Sec 3.3)	两阶段训练流程	Stage 1 像素对齐模态 → Stage 2 全模态
Flow Matching Loss (Sec 3.4)	$L_{cond}, L_{est}, L_{joint}$	三种模式的速度场回归损失

4. Experimental Setup (实验设置)

模型配置

Backbone: 内部 DiT 架构，10B 参数
训练: Batch size 32, 学习率 $5 \times 1 0^{- 5}$
推理: 50 DDIM steps, CFG scale 7.5

训练数据 (OpenUni)

总计 1.3M 视频 clips，5 种模态
来源: 370K 单人 + 97K 双人 + 489K Koala36M + 344K OpenS2V

评测基准

基准	用途	规模
VBench	Text-to-Video 质量评估	公开 benchmark
UniBench (本文提出)	统一多模态评估	200 UE 合成 + 200 真实视频

Baselines

Text-to-Video: Kling1.6, OpenSora2, HunyuanVideo-13B, Wan2.1-14B, Aether
可控生成: VACE, Full-DiT
深度估计: DepthCrafter, Geo4D, Aether
视频分割: SAMWISE, SeC

评测指标

视频质量: Subject Consistency, Background Consistency, Aesthetic Quality, Overall Consistency, Dynamic Degree
深度估计: Abs Rel $↓$ , $δ < 1.25$ $↑$
视频分割: mIoU $↑$ , mAP $↑$

5. Experimental Results (实验结果)

5.1 主实验结果 (Table 1)

Figure 5 解读: 与 SOTA 方法在多种任务上的定性比较。(A) T2V 中 UnityVideo 展现更强的物理理解（如折射现象）；(B) 可控生成中更忠实地遵循深度引导；(C) 分割任务中有更细粒度的识别能力；(D) 3D 点云重建更准确；(E) 泛化性——仅在人体数据上训练却能泛化到新物体。

任务	模型	BG Consist.	Aesthetic	Overall Consist.	Dynamic Deg.	mIoU	mAP	Abs Rel	$δ < 1.25$
T2V	Kling1.6	95.33	60.48	21.76	47.05	-	-	-	-
T2V	Wan2.1-14B	96.78	63.66	21.53	34.31	-	-	-	-
T2V	Aether	95.78	48.25	20.26	37.32	-	-	0.025	97.95
ControlGen	Full-DiT	95.58	54.82	20.12	49.50	-	-	-	-
Depth Est.	Geo4D	-	-	-	-	-	-	0.053	97.94
Seg.	SeC	-	-	-	-	65.52	22.23	-	-
UnityVideo (ControlGen)	Ours	97.44	54.63	23.57	47.76	-	-	-	-
UnityVideo (JointGen)	Ours	97.97	64.12	23.75	64.42	68.82	23.25	0.022	98.98

关键发现：

Text-to-Video: 联合生成模式在所有指标上取得最佳结果，归功于多任务联合训练带来的协同增强
可控生成: Background Consistency 和 Dynamic Degree 上显著优于 VACE 和 Full-DiT
深度估计: Abs Rel 0.022 优于 Geo4D 的 0.053， $δ < 1.25$ 达到 98.98%
视频分割: mIoU 68.82 和 mAP 23.25 均超过专门的分割模型 SAMWISE 和 SeC

5.2 消融实验

多模态 vs 单模态训练 (Table 2)

Figure 6 解读: 单模态训练 vs 统一多模态训练的对比。统一学习提供互补监督，同时增强运动理解和几何感知。

配置	Subject Consist.	BG Consist.	Imaging Quality	Overall Consist.
Baseline	96.51	96.06	64.99	23.17
Only Flow	97.82	97.14	67.34	23.70
Only Depth	98.13	97.29	69.09	23.48
Ours (Flow modality)	97.97 (+1.46)	97.19 (+1.13)	69.36 (+4.37)	23.74 (+0.57)
Ours (Depth modality)	98.01 (+1.50)	97.24 (+1.18)	69.18 (+4.19)	23.75 (+0.58)

结论：联合训练多种模态带来的提升是互相增强的，不是简单叠加。

多任务 vs 单任务训练 (Table 3)

配置	Subject Consist.	BG Consist.	Temporal Flicker.	Motion Smooth.
Baseline	96.51	96.06	98.73	99.30
Only ControlGen	96.53	95.58	98.45	99.28
Only JointGen	98.01	97.24	99.10	99.44
Ours-ControlGen	96.53 (+0.02)	96.08 (+0.02)	98.79 (+0.06)	99.38 (+0.08)
Ours-JointGen	97.94 (+1.43)	97.18 (+0.63)	99.13 (+0.40)	99.48 (+0.18)

结论：单独训练 ControlGen 反而性能下降，但统一多任务训练恢复并超越了 baseline。

架构设计消融 (Table 4)

配置	Subject Consist.	BG Consist.	Temporal Flicker.	Motion Smooth.
Baseline	96.51	96.06	98.73	99.30
w/ In-Context Learner	97.92	97.08	99.04	99.42
w/ Modality Switcher	97.94	97.18	99.13	99.48
Ours (Both)	98.31	97.54	99.35	99.54

结论：两种设计互补，组合后取得最大增益。

5.3 收敛分析

Figure 2 解读: 统一多模态多任务联合训练（Unified, 蓝线）在 RGB 视频生成上的 training loss 最低，收敛最快，优于单模态联合训练（Flow, Depth, DensePose）和 RGB SFT baseline。这验证了多模态监督能加速和改善 RGB 视频生成的学习。

5.4 用户研究 (Table 5)

模型	Physical Quality	Semantic Quality	Overall Pref.
Kling1.6	10.15	21.25	20.20
HunyuanVideo	24.15	26.10	20.35
Wan2.1	27.20	22.40	27.65
Ours	38.50	30.25	31.80

结论：UnityVideo 在物理质量、语义质量和总体偏好上均获得最高胜率。

5.5 可扩展性 (Appendix Table 3)

随着模态数量从 1 (depth) → 3 (depth+flow+DensePose) → 5 (全部) 增加，Joint Generation 和 Control Generation 的所有指标单调递增，证明框架能有效吸收更多模态的监督信号而不产生负面干扰。

5.6 局限性

当前 VAE 偶尔引入重建伪影
未扩展到更大 backbone 或更多视觉模态
模态数量增多时偶尔出现模态混淆（通过 modality-specific output layers 缓解）

Paper Notes

探索