Yume-1.5: A Text-Controlled Interactive World Generation Model

Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang Affiliations: Shanghai AI Laboratory, Fudan University, Shanghai Innovation Institute arXiv: 2512.22096 Project Page: stdstu12.github.io/YUME-Project GitHub: stdstu12/YUME

训练推理代码都已经开源,提供 Yume-5B-720P 和 Yume-I2V-540P-14B 权重


1. Motivation (研究动机)

现有交互式世界生成方法面临三大瓶颈:

  • 有限的泛化能力(Limited Generalizability):MatrixGame、WORLDMEM 等方法主要在游戏数据集上训练,存在严重的 domain gap,难以生成逼真的动态城市街景等真实场景
  • 过高的生成延迟(High Generation Latency):扩散模型的高计算量阻碍了实时连续生成。随着视频时长增加,历史上下文不断膨胀,推理速度急剧下降——无法支持无限探索
  • 文本控制能力不足(Insufficient Text Control):现有方法仅支持键盘/鼠标控制相机运动,无法通过文本指令触发场景中的事件(如”一只鬼出现了”),交互维度受限

Figure 1 解读:展示 Yume-1.5 的三种交互生成模式。上行:Text-to-World——从文本描述”A stylish woman walks down a Tokyo street…”出发,通过 WASD 键盘控制人物移动和方向键控制相机旋转,自回归生成连续的街景探索视频。中行:Image-to-World——从一张静态图像出发生成可探索世界。下行:Event Editing——在生成过程中通过文本描述”A ghost appeared”触发场景事件。每帧下方的控制面板显示了当前的键盘输入状态。

2. Idea (核心思想)

Yume-1.5 的核心 insight:通过联合时空-通道压缩(TSCM)将不断增长的历史上下文压缩到固定大小,结合 Self-Forcing 蒸馏实现少步推理,再通过文本分解(Event Description + Action Description)引入事件控制能力

具体做法分三个维度:

  1. TSCM(Joint Temporal-Spatial-Channel Modeling):对历史帧做两级压缩——先在时空维度随机采样 + 高压缩率 Patchify(越远的帧压缩越狠),再在通道维度进一步降维到 96 维,用 linear attention 与当前预测帧融合。这样无论历史多长,输入 token 数基本恒定
  2. Self-Forcing + DMD 加速:Generator 用自身生成的帧(而非 GT)作为历史上下文进行训练,消除 train-inference mismatch;然后通过 Distribution Matching Distillation 将多步扩散蒸馏为 4 步推理
  3. 文本事件控制:将 caption 分解为 Event Description(描述场景/事件,仅首帧编码一次)和 Action Description(描述键盘动作,有限集合可预计算缓存),通过 T5 分别编码后拼接,实现低开销的文本控制

与 Self-Forcing、CausVid 的根本区别:不使用 KV cache,而是用 TSCM 压缩历史帧后通过 linear attention 融合,推理速度不随视频长度增长

3. Method (方法)

3.1 整体架构

Figure 3 解读:Yume-1.5 的四个核心组件。(a) DiT Block:标准 self-attention 处理当前预测帧,cross-attention 处理文本条件,底部增加 linear attention 分支融合历史压缩 token ——先将预测帧 token 通过 FC 降维,与 拼接后用 linear attention 融合,结果通过 FC 恢复维度后 element-wise 加回 。(b) Training pipeline:视频经 WanEncoder 编码后与条件图像 拼接,Caption 分解为 Event Description 和 Action Description 分别通过 T5 编码后拼接为 ,与时间步嵌入一起送入 N 个 DiT Block。(c) History Tokens Downsampling:对不同时间距离的历史帧施加不同压缩率——最近 2 帧 (1,2,2),稍远帧 (1,4,4),更远帧 (1,8,8),以此类推,通过插值 Patchify 权重实现。(d) Inference:chunk-based 自回归生成,每次预测新 chunk 时将之前所有生成帧压缩为低压缩率(近期)和高压缩率(远期)两种 memory。

3.2 基础模型(Architecture Preliminary)

基于 Wan2.2-5B(T2V + I2V 统一模型):

  • T2V 模式:噪声 ,文本嵌入 送入 DiT backbone
  • I2V 模式:条件图像/视频 zero-pad 到与 同尺寸,构建 binary mask ,融合为
  • 文本编码策略:与 Wan 不同,不直接把整个 caption 送 T5,而是拆分为 Event Description 和 Action Description 分别编码后拼接。Action Description 集合有限(WASD 组合),可预计算缓存,大幅降低 T5 推理开销
  • 训练使用 Rectified Flow loss

3.3 TSCM: 联合时空-通道压缩

核心问题:自回归生成中历史帧 的帧数 不断增长,标准 attention 的计算量无法承受。现有方案(滑窗截断、FramePack 压缩、相机轨迹检索)各有缺陷。

Step 1: Temporal-Spatial Compression(时空压缩)

对历史帧 做两步处理:

  1. 随机帧采样:32 帧中只采 1 帧
  2. 自适应 Patchify 压缩:根据时间距离施加不同压缩率

表示时间维 、空间高 、空间宽 下采样。通过插值 Patchify 权重实现不同压缩率,无需额外参数。

压缩后的表示 与当前预测帧 (用原始 Patchify 处理)拼接后送入 DiT Block 的 standard self-attention。

Step 2: Channel Compression(通道压缩)

在时空压缩基础上进一步压缩:

  1. 用压缩率 的 Patchify 将 压缩,通道维降至 96,得到
  2. DiT Block 内部:预测帧 token 经 cross-attention 后通过 FC 降维,与 拼接得到
  3. 通过 Linear Attention 融合:

其中 是 ReLU 激活函数。Linear attention 的计算量对 token 数线性,对通道维敏感——所以先压缩通道到 96 再用 linear attention。

  1. 融合结果 通过 FC 恢复通道维度,element-wise 加回

设计哲学:Standard attention 对 token 数敏感 → 时空压缩减少 token 数;Linear attention 对通道维敏感 → 通道压缩降低维度。两者互补。

3.4 Self-Forcing + DMD 实时加速

Figure 4 解读:左(Generator):自回归生成过程。第一个 chunk 从首帧出发,后续 chunk 将自身生成的帧(非 GT)经 TSCM 压缩后作为历史上下文——这就是 Self-Forcing 的核心,用自身输出替代 GT 消除 train-inference mismatch。右(Distillation):Real Model(teacher)处理真实数据计算 score ,Fake Model(student)处理加噪生成数据计算 score ,通过 Distribution Matching Loss 的梯度更新 Generator ,将多步扩散蒸馏为少步(4 步)推理。

训练流程

  1. Foundation Model 训练:在混合数据集上交替训练 T2V 和 I2V 任务,分辨率 704×1280,16 FPS,batch 40,Adam lr=1e-5,A100 GPU 训练 10K iterations
  2. Self-Forcing + TSCM 训练:在 foundation model 基础上,用自身生成帧作为历史上下文(替代 KV cache),通过 TSCM 压缩后训练 600 iterations

DMD Loss

关键区别:DMD 使用模型预测数据(而非真实数据)作为 video conditioning,从而与 Self-Forcing 的训练范式保持一致。

3.5 数据处理

三个数据源:

  1. Sekai-Real-HQ:真实世界街景行走视频,包含相机轨迹和语义标注。将轨迹映射为离散键盘控制词表(vocab_camera 9 种相机动作 + vocab_human 10 种人物移动)。用 InternVL3-78B 为 I2V 训练重新标注 event-focused caption
  2. Synthetic Dataset:从 Openvid-1M 采 80K caption,用 Wan 2.1-14B 合成 80K 视频(720p),VBench 筛选 top 50K,用于维持 T2V 通用能力
  3. Event Dataset:招募标注员撰写 4 类事件描述(日常/科幻/奇幻/天气),收集 10K 第一人称图像,用 Wan 2.2-I2V-14B 合成 10K 视频,人工筛选 4K 高质量样本用于 T2V 训练

Figure 2 解读:数据集重标注示例。Original caption 描述静态场景上下文(“A vivid European-style city street scene…“),用于 T2V 训练。New caption 由 VLM 重新生成,聚焦视频中发生的动态事件(“People are moving aside to avoid the street sprinkler”),用于 I2V 训练——使模型学会根据事件描述生成动态内容。

3.6 Python Pseudocode

import torch
import torch.nn as nn
import torch.nn.functional as F
 
# ============================================================
# 1. TSCM: Joint Temporal-Spatial-Channel Modeling
# ============================================================
 
class TemporalSpatialCompression(nn.Module):
    """Step 1: Compress history frames with adaptive Patchify rates."""
 
    def __init__(self, in_channels, base_patch_size=(1, 2, 2)):
        super().__init__()
        self.base_patchify = nn.Conv3d(in_channels, in_channels,
                                        kernel_size=base_patch_size,
                                        stride=base_patch_size)
 
    def get_compression_rate(self, temporal_distance):
        """Eq.3: farther frames get higher compression."""
        if temporal_distance <= 2:
            return (1, 2, 2)   # 1x temporal, 2x spatial
        elif temporal_distance <= 6:
            return (1, 4, 4)   # 1x temporal, 4x spatial
        elif temporal_distance <= 23:
            return (1, 8, 8)   # 1x temporal, 8x spatial
        else:
            return (1, 16, 16) # maximum compression
 
    def adaptive_patchify(self, z_c, rate):
        """Achieve variable compression by interpolating Patchify weights."""
        # Interpolate base_patchify weights to target rate
        base_weight = self.base_patchify.weight  # (C_out, C_in, 1, 2, 2)
        target_weight = F.interpolate(base_weight, size=rate, mode='trilinear')
        return F.conv3d(z_c, target_weight, stride=rate)
 
    def forward(self, z_c, current_frame_idx):
        """
        z_c: (B, C, T_history, H, W) - all history frames
        Returns: z_hat_c concatenated with z_hat_p
        """
        # Random frame sampling: keep 1 out of 32
        sampled_indices = list(range(0, z_c.shape[2], 32))
 
        compressed_chunks = []
        for idx in sampled_indices:
            dist = current_frame_idx - idx
            rate = self.get_compression_rate(dist)
            chunk = z_c[:, :, idx:idx+1]
            compressed_chunks.append(self.adaptive_patchify(chunk, rate))
 
        z_hat_c = torch.cat(compressed_chunks, dim=2)
        return z_hat_c
 
 
class ChannelCompression(nn.Module):
    """Step 2: Further compress to 96-dim channels for linear attention."""
 
    def __init__(self, in_channels, compressed_channels=96):
        super().__init__()
        # High compression Patchify: (8, 4, 4) rate, output 96 channels
        self.high_compress_patchify = nn.Conv3d(
            in_channels, compressed_channels,
            kernel_size=(8, 4, 4), stride=(8, 4, 4)
        )
 
    def forward(self, z_c):
        """
        z_c: (B, C, T, H, W) - history frames
        Returns: z_linear (B, N_tokens, 96)
        """
        z_linear = self.high_compress_patchify(z_c)  # (B, 96, T', H', W')
        B, C, T, H, W = z_linear.shape
        z_linear = z_linear.reshape(B, C, -1).permute(0, 2, 1)  # (B, N, 96)
        return z_linear
 
 
class LinearAttentionFusion(nn.Module):
    """Eq.4: Linear attention for fusing history tokens with prediction tokens."""
 
    def __init__(self, dim, compressed_dim=96):
        super().__init__()
        self.fc_down = nn.Linear(dim, compressed_dim)  # reduce pred tokens dim
        self.fc_up = nn.Linear(compressed_dim, dim)     # restore dim after fusion
        self.q_proj = nn.Linear(compressed_dim, compressed_dim)
        self.k_proj = nn.Linear(compressed_dim, compressed_dim)
        self.v_proj = nn.Linear(compressed_dim, compressed_dim)
        self.norm = nn.LayerNorm(compressed_dim)
        self.out_proj = nn.Linear(compressed_dim, compressed_dim)
 
    def phi(self, x):
        """Kernel function: ReLU activation."""
        return F.relu(x)
 
    def forward(self, z_pred, z_linear):
        """
        z_pred: (B, N_pred, D) - prediction frame tokens after cross-attn
        z_linear: (B, N_hist, 96) - compressed history tokens
        Returns: (B, N_pred, D) - fused result to add back
        """
        z_p_down = self.fc_down(z_pred)  # (B, N_pred, 96)
        z_fus = torch.cat([z_linear, z_p_down], dim=1)  # (B, N_hist+N_pred, 96)
 
        q = self.q_proj(z_fus)
        k = self.k_proj(z_fus)
        v = self.v_proj(z_fus)
 
        # Normalize q, k for stability
        q = F.normalize(q, dim=-1)
        k = F.normalize(k, dim=-1)
 
        # Linear attention: O(N*d^2) instead of O(N^2*d)
        phi_k = self.phi(k)        # (B, N, d)
        phi_q = self.phi(q)        # (B, N, d)
 
        # Compute: (sum_i v_i * phi(k_i)^T) * phi(q) / (sum_j phi(k_j)^T * phi(q))
        kv = torch.einsum('bnd,bnc->bdc', phi_k, v)     # (B, d, d)
        numerator = torch.einsum('bdc,bnc->bnd', kv, phi_q)  # (B, N, d)
        denominator = torch.einsum('bnd,bnd->bn', phi_k.sum(1, keepdim=True).expand_as(phi_q), phi_q)
        denominator = denominator.unsqueeze(-1).clamp(min=1e-6)
 
        o = numerator / denominator
        o = self.norm(o)
        o = self.out_proj(o)
 
        # Extract only prediction token positions, restore dim
        o_pred = o[:, -z_pred.shape[1]:]  # (B, N_pred, 96)
        return self.fc_up(o_pred)  # (B, N_pred, D) - add to z_l
 
 
# ============================================================
# 2. Text Encoding: Event + Action Description
# ============================================================
 
class DualTextEncoder(nn.Module):
    """Decompose caption into Event Description and Action Description."""
 
    def __init__(self, t5_encoder):
        super().__init__()
        self.t5 = t5_encoder
        # Pre-compute all possible action descriptions
        self.action_vocab = self._build_action_cache()
 
    def _build_action_cache(self):
        """Pre-compute T5 embeddings for all keyboard combinations."""
        actions = [
            "Camera moves forward (W).", "Camera moves left (A).",
            "Camera moves backward (S).", "Camera moves right (D).",
            "Camera moves forward and left (W+A).",
            "Camera moves forward and right (W+D).",
            "Camera turns right (→).", "Camera turns left (←).",
            "Camera tilts up (↑).", "Camera tilts down (↓).",
            "Camera stands still (·).",
        ]
        cache = {}
        with torch.no_grad():
            for action in actions:
                cache[action] = self.t5.encode(action)
        return cache
 
    def forward(self, event_description, action_description):
        """
        event_description: str - scene/event text (encoded once at start)
        action_description: str - keyboard action (looked up from cache)
        Returns: c_a (B, L, D) - concatenated text embedding
        """
        c_event = self.t5.encode(event_description)   # only at first frame
        c_action = self.action_vocab[action_description]  # from cache
        c_a = torch.cat([c_event, c_action], dim=1)
        return c_a
 
 
# ============================================================
# 3. Self-Forcing Training with TSCM
# ============================================================
 
class SelfForcingTrainer:
    """Train generator using its own outputs as history context."""
 
    def __init__(self, generator, real_model, fake_model):
        self.G = generator       # G_theta: few-step generator
        self.G_real = real_model  # teacher: multi-step diffusion
        self.G_fake = fake_model  # student: tracks G_theta via EMA
 
    def generate_long_video(self, first_frame, caption, num_chunks):
        """Autoregressive generation with Self-Forcing."""
        all_frames = [first_frame]
 
        for chunk_idx in range(num_chunks):
            # Use OWN generated frames as history (not GT)
            history = torch.cat(all_frames, dim=2)
 
            # Compress via TSCM
            history_tokens = self.G.tscm(history, chunk_idx)
 
            # Generate new chunk (4-step inference)
            noise = torch.randn_like(first_frame)
            new_chunk = self.G(noise, history_tokens, caption, num_steps=4)
 
            all_frames.append(new_chunk)
 
        return torch.cat(all_frames, dim=2)
 
    def distillation_step(self, z_t, history_tokens, caption, t):
        """
        DMD loss: match score functions of real and fake models.
        Eq.6: gradient of distribution matching loss.
        """
        # Real model score (teacher, on real data distribution)
        with torch.no_grad():
            G_real_output = self.G_real(z_t, history_tokens, caption, t)
            s_real = (z_t - G_real_output) / (t ** 2)
 
        # Fake model score (student, on generated data distribution)
        with torch.no_grad():
            G_fake_output = self.G_fake(z_t, history_tokens, caption, t)
            s_fake = (z_t - G_fake_output) / (t ** 2)
 
        # Generator forward
        G_theta_output = self.G(z_t, history_tokens, caption, t)
 
        # DMD gradient: (s_real - s_fake) * dG/dtheta
        loss = ((s_real - s_fake) * G_theta_output).sum()
        return loss
 
 
# ============================================================
# 4. DiT Block with TSCM Integration
# ============================================================
 
class YumeDiTBlock(nn.Module):
    """Single DiT block with standard attn + linear attn for history."""
 
    def __init__(self, hidden_dim, num_heads, compressed_dim=96):
        super().__init__()
        # Standard components
        self.self_attn = nn.MultiheadAttention(hidden_dim, num_heads)
        self.cross_attn = nn.MultiheadAttention(hidden_dim, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim * 4),
            nn.GELU(),
            nn.Linear(hidden_dim * 4, hidden_dim)
        )
        # TSCM: linear attention branch
        self.linear_attn = LinearAttentionFusion(hidden_dim, compressed_dim)
        # AdaLN modulation
        self.adaln = nn.Linear(hidden_dim, hidden_dim * 6)
 
    def forward(self, z_l, z_linear, text_emb, time_emb):
        """
        z_l: (B, N, D) - current prediction + compressed history tokens
        z_linear: (B, N_hist, 96) - channel-compressed history tokens
        text_emb: (B, L, D) - concatenated event + action embeddings
        time_emb: (B, D) - timestep embedding
        """
        # AdaLN modulation
        mod = self.adaln(time_emb).unsqueeze(1)
        shift1, scale1, gate1, shift2, scale2, gate2 = mod.chunk(6, dim=-1)
 
        # Self-attention (standard, on prediction + spatially-compressed history)
        h = z_l * (1 + scale1) + shift1
        h = h + gate1 * self.self_attn(h, h, h)[0]
 
        # Cross-attention with text
        h = h + self.cross_attn(h, text_emb, text_emb)[0]
 
        # Linear attention fusion with channel-compressed history
        h_pred = h[:, -z_l.shape[1]:]  # extract prediction tokens
        h = h + self.linear_attn(h_pred, z_linear)
 
        # FFN
        h2 = h * (1 + scale2) + shift2
        h = h + gate2 * self.ffn(h2)
 
        return h

3.7 Code Mapping(代码映射)

论文组件代码位置说明
DiT Backbone (Wan-based)fastvideo/models/hunyuan/modules/model.pyWanModel基于 Wan 架构的 DiT 主干,支持 T2V/I2V 双模式
Self-Attention Blockfastvideo/models/hunyuan/modules/model.pyWanAttentionBlock包含 self-attn + cross-attn + FFN,含 RoPE 位置编码
Self-Attention (standard)fastvideo/models/hunyuan/modules/model.pyWanSelfAttentionQK-Norm + RoPE + Flash Attention
Cross-Attention (T2V)fastvideo/models/hunyuan/modules/model.pyWanT2VCrossAttention文本条件注入
Cross-Attention (I2V)fastvideo/models/hunyuan/modules/model.pyWanI2VCrossAttention图像条件注入,分离 image/text stream
Flash Attention 实现fastvideo/models/flash_attn_no_pad.py无 padding 的 Flash Attention
Parallel Attention (SP)fastvideo/models/hunyuan/modules/attenion.pyparallel_attention()序列并行下的分布式注意力
Distillation 训练fastvideo/distill_model.pydistill_one_step()GAN-based 蒸馏 + EMA teacher,包含 discriminator loss
训练脚本scripts/推理和训练入口脚本
数据预处理fastvideo/data_preprocess/相机轨迹→键盘控制转换等
交互式 Demowebapp_single_gpu.py单 GPU Windows 部署 Demo(RTX 4090, 16GB VRAM)

注意:TSCM(linear attention + channel compression)和 Self-Forcing 的完整实现在开源代码中尚未完全体现——model.py 中的 WanModel 是基础 Wan 架构,TSCM 相关的 linear attention 融合模块可能在训练代码或未公开的扩展模块中。开源仓库主要提供了推理和基础蒸馏流程。

4. Experimental Setup (实验设置)

  • 预训练模型:Wan2.2-5B-T2V
  • 训练配置:704×1280, 16 FPS, batch 40, Adam lr=1e-5, A100 GPU
  • Foundation model: 10K iterations; Self-Forcing+TSCM: 600 iterations
  • 评估:Yume-Bench(6 个 fine-grained 指标),测试数据 544×960, 16 FPS, 96 帧,4 步推理
  • 对比方法:Wan-2.1, MatrixGame, Yume (1.0)

5. Experimental Results (实验结果)

5.1 Image-to-Video 生成质量

ModelTime(s)↓IF↑SC↑BC↑MS↑AQ↑IQ↑
Wan-2.16110.0570.8590.8990.9610.4940.695
MatrixGame9710.2710.9110.9320.9830.4350.750
Yume5720.6570.9320.9410.9860.5180.739
Yume-1.580.8360.9320.9450.9850.5060.728

关键发现:

  • Instruction Following 大幅领先:Yume-1.5 的 IF=0.836,远超 Yume(0.657) 和 Wan-2.1(0.057),说明文本控制和键盘指令跟随能力显著增强
  • 推理速度碾压式优势:仅需 8 秒(4 步推理),对比 Wan-2.1 需 611 秒、MatrixGame 需 971 秒
  • 在单张 A100 上实现 540p 12 fps 实时生成
  • 其他指标(SC, BC, MS)与 Yume 持平或略优,AQ/IQ 略低于 MatrixGame 但差距不大

5.2 长视频生成性能

Figure 5 解读:长视频生成中 Aesthetic Score 随 chunk 数量的变化。红线(Self-Forcing without TSCM)从第 2 个 chunk 开始急剧下降,到第 6 个 chunk 时降至 0.44。蓝线(Self-Forcing with TSCM)在第 4-6 个 chunk 保持稳定,最终美学分数 0.523——比无 TSCM 高 18%。说明 TSCM 有效利用了长程历史信息,抑制了质量退化。

Figure 6 解读:Image Quality 随 chunk 数量的变化。同样的趋势——有 TSCM 的模型在第 5-6 个 chunk 保持更一致的图像质量(0.601 vs 0.542),进一步验证 TSCM 对长程生成的稳定作用。

5.3 TSCM 消融实验

ModelIF↑SC↑BC↑MS↑AQ↑IQ↑
TSCM0.8360.9320.9450.9850.5060.728
Spatial Compression Only0.7670.9350.9450.9730.5040.733

TSCM 在关键指标 IF 上提升 9%(0.836 vs 0.767),说明通道压缩 + linear attention 融合有效减少了历史帧运动方向对当前预测的干扰。其他指标差异较小。

5.4 推理速度对比

Figure 7 解读:三种方法随 chunk 数量增加的推理时间变化(704×1280)。Full Context(蓝线)推理时间线性增长,到第 8 个 chunk 时约 9 秒/chunk。Spatial Compression(绿线)增长较慢但仍有趋势。TSCM(红线)保持基本恒定(约 3 秒/chunk),验证了 TSCM 将历史上下文压缩到固定大小的设计目标——推理速度不随视频长度增长。

5.5 定性结果

Figure 8 解读:544×960 分辨率下四种方法的定性对比。Wan-2.1 缺乏交互控制能力,生成结果与指令不匹配。MatrixGame 有一定控制能力但画面质量较低。Yume 画面质量较好但运动控制精度有限。Yume-1.5 在仅用 4 步采样的情况下,同时实现了高画面质量和精确的相机/人物控制——其他方法均使用 50 步采样。

5.6 Conclusion(总结与展望)

核心贡献

  1. TSCM:联合时空-通道压缩 + linear attention,实现恒定速度的无限长视频生成
  2. Self-Forcing + DMD 加速:消除 train-inference mismatch + 4 步蒸馏,单 A100 实现 540p 12fps
  3. 文本事件控制:通过 Caption 分解 + Event Dataset,首次在交互世界模型中引入文本触发事件能力

局限性

  • 倒车、倒退行走等反向运动仍有 artifact
  • 极高人群密度场景质量下降
  • 5B 参数规模限制了生成能力上限;考虑 MoE 架构(Wan2.2 启发)兼顾参数量和延迟

未来方向:更复杂的世界交互、MoE 架构扩展、虚拟环境和仿真系统应用。