Yume-1.5: A Text-Controlled Interactive World Generation Model

Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang Affiliations: Shanghai AI Laboratory, Fudan University, Shanghai Innovation Institute arXiv: 2512.22096 Project Page: stdstu12.github.io/YUME-Project GitHub: stdstu12/YUME

训练推理代码都已经开源，提供 Yume-5B-720P 和 Yume-I2V-540P-14B 权重

1. Motivation (研究动机)

现有交互式世界生成方法面临三大瓶颈：

有限的泛化能力（Limited Generalizability）：MatrixGame、WORLDMEM 等方法主要在游戏数据集上训练，存在严重的 domain gap，难以生成逼真的动态城市街景等真实场景
过高的生成延迟（High Generation Latency）：扩散模型的高计算量阻碍了实时连续生成。随着视频时长增加，历史上下文不断膨胀，推理速度急剧下降——无法支持无限探索
文本控制能力不足（Insufficient Text Control）：现有方法仅支持键盘/鼠标控制相机运动，无法通过文本指令触发场景中的事件（如”一只鬼出现了”），交互维度受限

Figure 1 解读：展示 Yume-1.5 的三种交互生成模式。上行：Text-to-World——从文本描述”A stylish woman walks down a Tokyo street…”出发，通过 WASD 键盘控制人物移动和方向键控制相机旋转，自回归生成连续的街景探索视频。中行：Image-to-World——从一张静态图像出发生成可探索世界。下行：Event Editing——在生成过程中通过文本描述”A ghost appeared”触发场景事件。每帧下方的控制面板显示了当前的键盘输入状态。

2. Idea (核心思想)

Yume-1.5 的核心 insight：通过联合时空-通道压缩（TSCM）将不断增长的历史上下文压缩到固定大小，结合 Self-Forcing 蒸馏实现少步推理，再通过文本分解（Event Description + Action Description）引入事件控制能力。

具体做法分三个维度：

TSCM（Joint Temporal-Spatial-Channel Modeling）：对历史帧做两级压缩——先在时空维度随机采样 + 高压缩率 Patchify（越远的帧压缩越狠），再在通道维度进一步降维到 96 维，用 linear attention 与当前预测帧融合。这样无论历史多长，输入 token 数基本恒定
Self-Forcing + DMD 加速：Generator 用自身生成的帧（而非 GT）作为历史上下文进行训练，消除 train-inference mismatch；然后通过 Distribution Matching Distillation 将多步扩散蒸馏为 4 步推理
文本事件控制：将 caption 分解为 Event Description（描述场景/事件，仅首帧编码一次）和 Action Description（描述键盘动作，有限集合可预计算缓存），通过 T5 分别编码后拼接，实现低开销的文本控制

与 Self-Forcing、CausVid 的根本区别：不使用 KV cache，而是用 TSCM 压缩历史帧后通过 linear attention 融合，推理速度不随视频长度增长。

3. Method (方法)

3.1 整体架构

Figure 3 解读：Yume-1.5 的四个核心组件。(a) DiT Block：标准 self-attention 处理当前预测帧，cross-attention 处理文本条件，底部增加 linear attention 分支融合历史压缩 token $z_{l in e a r}$ ——先将预测帧 token $z_{p}^{l}$ 通过 FC 降维，与 $z_{l in e a r}$ 拼接后用 linear attention 融合，结果通过 FC 恢复维度后 element-wise 加回 $z_{l}$ 。(b) Training pipeline：视频经 WanEncoder 编码后与条件图像 $z_{c}$ 拼接，Caption 分解为 Event Description 和 Action Description 分别通过 T5 编码后拼接为 $c_{a}$ ，与时间步嵌入一起送入 N 个 DiT Block。(c) History Tokens Downsampling：对不同时间距离的历史帧施加不同压缩率——最近 2 帧 (1,2,2)，稍远帧 (1,4,4)，更远帧 (1,8,8)，以此类推，通过插值 Patchify 权重实现。(d) Inference：chunk-based 自回归生成，每次预测新 chunk 时将之前所有生成帧压缩为低压缩率（近期）和高压缩率（远期）两种 memory。

3.2 基础模型（Architecture Preliminary）

基于 Wan2.2-5B（T2V + I2V 统一模型）：

T2V 模式：噪声 $z \in R^{C \times f_{t} \times h \times w}$ ，文本嵌入 $c$ 和 $z$ 送入 DiT backbone
I2V 模式：条件图像/视频 $z_{c} \in R^{C \times f_{t} \times h \times w}$ zero-pad 到与 $z$ 同尺寸，构建 binary mask $M_{c} \in {0, 1}^{1 \times f_{t} \times h \times w}$ ，融合为 $M_{c} \cdot z_{c} + (1 - M_{c}) \cdot z$
文本编码策略：与 Wan 不同，不直接把整个 caption 送 T5，而是拆分为 Event Description 和 Action Description 分别编码后拼接。Action Description 集合有限（WASD 组合），可预计算缓存，大幅降低 T5 推理开销
训练使用 Rectified Flow loss

3.3 TSCM: 联合时空-通道压缩

核心问题：自回归生成中历史帧 $z_{c}$ 的帧数 $f_{t}$ 不断增长，标准 attention 的计算量无法承受。现有方案（滑窗截断、FramePack 压缩、相机轨迹检索）各有缺陷。

Step 1: Temporal-Spatial Compression（时空压缩）

对历史帧 $z_{c}$ 做两步处理：

随机帧采样：32 帧中只采 1 帧
自适应 Patchify 压缩：根据时间距离施加不同压缩率

Frames t - 1 to t - 2 : (1, 2, 2), t - 3 to t - 6 : (1, 4, 4), t - 7 to t - 23 : (1, 8, 8), \dots (3)

$(1, 2, 2)$ 表示时间维 $1 \times$ 、空间高 $2 \times$ 、空间宽 $2 \times$ 下采样。通过插值 Patchify 权重实现不同压缩率，无需额外参数。

压缩后的表示 $\overset{z}{^}_{c}$ 与当前预测帧 $\overset{z}{^}_{p}$ （用原始 Patchify $(1, 2, 2)$ 处理）拼接后送入 DiT Block 的 standard self-attention。

Step 2: Channel Compression（通道压缩）

在时空压缩基础上进一步压缩：

用压缩率 $(8, 4, 4)$ 的 Patchify 将 $z_{c}$ 压缩，通道维降至 96，得到 $z_{l in e a r}$
DiT Block 内部：预测帧 token $z_{p}^{l}$ 经 cross-attention 后通过 FC 降维，与 $z_{l in e a r}$ 拼接得到 $z_{f u s}$
$z_{f u s}$ 通过 Linear Attention 融合：

o^{l} = \frac{( \sum _{i = 1}^{N} v _{i}^{l} ϕ ( k _{i}^{l} ) ^{T} ) ϕ ( q ^{l} )}{( \sum _{j = 1}^{N} ϕ ( k _{j}^{l} ) ^{T} ) ϕ ( q ^{l} )} (4)

其中 $ϕ$ 是 ReLU 激活函数。Linear attention 的计算量对 token 数线性，对通道维敏感——所以先压缩通道到 96 再用 linear attention。

融合结果 $z_{f u s}^{l}$ 通过 FC 恢复通道维度，element-wise 加回 $z_{l}$ ： $z_{f u s}^{l} [:, - N_{l} :] + z_{l}$

设计哲学：Standard attention 对 token 数敏感 → 时空压缩减少 token 数；Linear attention 对通道维敏感 → 通道压缩降低维度。两者互补。

3.4 Self-Forcing + DMD 实时加速

Figure 4 解读：左（Generator）：自回归生成过程。第一个 chunk 从首帧出发，后续 chunk 将自身生成的帧（非 GT）经 TSCM 压缩后作为历史上下文——这就是 Self-Forcing 的核心，用自身输出替代 GT 消除 train-inference mismatch。右（Distillation）：Real Model（teacher）处理真实数据计算 score $s_{re a l}$ ，Fake Model（student）处理加噪生成数据计算 score $s_{f ak e}$ ，通过 Distribution Matching Loss 的梯度更新 Generator $G_{θ}$ ，将多步扩散蒸馏为少步（4 步）推理。

训练流程：

Foundation Model 训练：在混合数据集上交替训练 T2V 和 I2V 任务，分辨率 704×1280，16 FPS，batch 40，Adam lr=1e-5，A100 GPU 训练 10K iterations
Self-Forcing + TSCM 训练：在 foundation model 基础上，用自身生成帧作为历史上下文（替代 KV cache），通过 TSCM 压缩后训练 600 iterations

DMD Loss：

\nabla L_{DMD} = - E_{t} [\int (s_{real} (F (G_{θ} (z_{t}), t), t) - s_{fake} (F (G_{a} (z_{t}), t), t)) \frac{d G _{θ} ( z )}{d θ} d z] (6)

关键区别：DMD 使用模型预测数据（而非真实数据）作为 video conditioning，从而与 Self-Forcing 的训练范式保持一致。

3.5 数据处理

三个数据源：

Sekai-Real-HQ：真实世界街景行走视频，包含相机轨迹和语义标注。将轨迹映射为离散键盘控制词表（vocab_camera 9 种相机动作 + vocab_human 10 种人物移动）。用 InternVL3-78B 为 I2V 训练重新标注 event-focused caption
Synthetic Dataset：从 Openvid-1M 采 80K caption，用 Wan 2.1-14B 合成 80K 视频（720p），VBench 筛选 top 50K，用于维持 T2V 通用能力
Event Dataset：招募标注员撰写 4 类事件描述（日常/科幻/奇幻/天气），收集 10K 第一人称图像，用 Wan 2.2-I2V-14B 合成 10K 视频，人工筛选 4K 高质量样本用于 T2V 训练

Figure 2 解读：数据集重标注示例。Original caption 描述静态场景上下文（“A vivid European-style city street scene…“），用于 T2V 训练。New caption 由 VLM 重新生成，聚焦视频中发生的动态事件（“People are moving aside to avoid the street sprinkler”），用于 I2V 训练——使模型学会根据事件描述生成动态内容。

3.6 Python Pseudocode

import torch
import torch.nn as nn
import torch.nn.functional as F
 
# ============================================================
# 1. TSCM: Joint Temporal-Spatial-Channel Modeling
# ============================================================
 
class TemporalSpatialCompression(nn.Module):
    """Step 1: Compress history frames with adaptive Patchify rates."""
 
    def __init__(self, in_channels, base_patch_size=(1, 2, 2)):
        super().__init__()
        self.base_patchify = nn.Conv3d(in_channels, in_channels,
                                        kernel_size=base_patch_size,
                                        stride=base_patch_size)
 
    def get_compression_rate(self, temporal_distance):
        """Eq.3: farther frames get higher compression."""
        if temporal_distance <= 2:
            return (1, 2, 2)   # 1x temporal, 2x spatial
        elif temporal_distance <= 6:
            return (1, 4, 4)   # 1x temporal, 4x spatial
        elif temporal_distance <= 23:
            return (1, 8, 8)   # 1x temporal, 8x spatial
        else:
            return (1, 16, 16) # maximum compression
 
    def adaptive_patchify(self, z_c, rate):
        """Achieve variable compression by interpolating Patchify weights."""
        # Interpolate base_patchify weights to target rate
        base_weight = self.base_patchify.weight  # (C_out, C_in, 1, 2, 2)
        target_weight = F.interpolate(base_weight, size=rate, mode='trilinear')
        return F.conv3d(z_c, target_weight, stride=rate)
 
    def forward(self, z_c, current_frame_idx):
        """
        z_c: (B, C, T_history, H, W) - all history frames
        Returns: z_hat_c concatenated with z_hat_p
        """
        # Random frame sampling: keep 1 out of 32
        sampled_indices = list(range(0, z_c.shape[2], 32))
 
        compressed_chunks = []
        for idx in sampled_indices:
            dist = current_frame_idx - idx
            rate = self.get_compression_rate(dist)
            chunk = z_c[:, :, idx:idx+1]
            compressed_chunks.append(self.adaptive_patchify(chunk, rate))
 
        z_hat_c = torch.cat(compressed_chunks, dim=2)
        return z_hat_c
 
 
class ChannelCompression(nn.Module):
    """Step 2: Further compress to 96-dim channels for linear attention."""
 
    def __init__(self, in_channels, compressed_channels=96):
        super().__init__()
        # High compression Patchify: (8, 4, 4) rate, output 96 channels
        self.high_compress_patchify = nn.Conv3d(
            in_channels, compressed_channels,
            kernel_size=(8, 4, 4), stride=(8, 4, 4)
        )
 
    def forward(self, z_c):
        """
        z_c: (B, C, T, H, W) - history frames
        Returns: z_linear (B, N_tokens, 96)
        """
        z_linear = self.high_compress_patchify(z_c)  # (B, 96, T', H', W')
        B, C, T, H, W = z_linear.shape
        z_linear = z_linear.reshape(B, C, -1).permute(0, 2, 1)  # (B, N, 96)
        return z_linear
 
 
class LinearAttentionFusion(nn.Module):
    """Eq.4: Linear attention for fusing history tokens with prediction tokens."""
 
    def __init__(self, dim, compressed_dim=96):
        super().__init__()
        self.fc_down = nn.Linear(dim, compressed_dim)  # reduce pred tokens dim
        self.fc_up = nn.Linear(compressed_dim, dim)     # restore dim after fusion
        self.q_proj = nn.Linear(compressed_dim, compressed_dim)
        self.k_proj = nn.Linear(compressed_dim, compressed_dim)
        self.v_proj = nn.Linear(compressed_dim, compressed_dim)
        self.norm = nn.LayerNorm(compressed_dim)
        self.out_proj = nn.Linear(compressed_dim, compressed_dim)
 
    def phi(self, x):
        """Kernel function: ReLU activation."""
        return F.relu(x)
 
    def forward(self, z_pred, z_linear):
        """
        z_pred: (B, N_pred, D) - prediction frame tokens after cross-attn
        z_linear: (B, N_hist, 96) - compressed history tokens
        Returns: (B, N_pred, D) - fused result to add back
        """
        z_p_down = self.fc_down(z_pred)  # (B, N_pred, 96)
        z_fus = torch.cat([z_linear, z_p_down], dim=1)  # (B, N_hist+N_pred, 96)
 
        q = self.q_proj(z_fus)
        k = self.k_proj(z_fus)
        v = self.v_proj(z_fus)
 
        # Normalize q, k for stability
        q = F.normalize(q, dim=-1)
        k = F.normalize(k, dim=-1)
 
        # Linear attention: O(N*d^2) instead of O(N^2*d)
        phi_k = self.phi(k)        # (B, N, d)
        phi_q = self.phi(q)        # (B, N, d)
 
        # Compute: (sum_i v_i * phi(k_i)^T) * phi(q) / (sum_j phi(k_j)^T * phi(q))
        kv = torch.einsum('bnd,bnc->bdc', phi_k, v)     # (B, d, d)
        numerator = torch.einsum('bdc,bnc->bnd', kv, phi_q)  # (B, N, d)
        denominator = torch.einsum('bnd,bnd->bn', phi_k.sum(1, keepdim=True).expand_as(phi_q), phi_q)
        denominator = denominator.unsqueeze(-1).clamp(min=1e-6)
 
        o = numerator / denominator
        o = self.norm(o)
        o = self.out_proj(o)
 
        # Extract only prediction token positions, restore dim
        o_pred = o[:, -z_pred.shape[1]:]  # (B, N_pred, 96)
        return self.fc_up(o_pred)  # (B, N_pred, D) - add to z_l
 
 
# ============================================================
# 2. Text Encoding: Event + Action Description
# ============================================================
 
class DualTextEncoder(nn.Module):
    """Decompose caption into Event Description and Action Description."""
 
    def __init__(self, t5_encoder):
        super().__init__()
        self.t5 = t5_encoder
        # Pre-compute all possible action descriptions
        self.action_vocab = self._build_action_cache()
 
    def _build_action_cache(self):
        """Pre-compute T5 embeddings for all keyboard combinations."""
        actions = [
            "Camera moves forward (W).", "Camera moves left (A).",
            "Camera moves backward (S).", "Camera moves right (D).",
            "Camera moves forward and left (W+A).",
            "Camera moves forward and right (W+D).",
            "Camera turns right (→).", "Camera turns left (←).",
            "Camera tilts up (↑).", "Camera tilts down (↓).",
            "Camera stands still (·).",
        ]
        cache = {}
        with torch.no_grad():
            for action in actions:
                cache[action] = self.t5.encode(action)
        return cache
 
    def forward(self, event_description, action_description):
        """
        event_description: str - scene/event text (encoded once at start)
        action_description: str - keyboard action (looked up from cache)
        Returns: c_a (B, L, D) - concatenated text embedding
        """
        c_event = self.t5.encode(event_description)   # only at first frame
        c_action = self.action_vocab[action_description]  # from cache
        c_a = torch.cat([c_event, c_action], dim=1)
        return c_a
 
 
# ============================================================
# 3. Self-Forcing Training with TSCM
# ============================================================
 
class SelfForcingTrainer:
    """Train generator using its own outputs as history context."""
 
    def __init__(self, generator, real_model, fake_model):
        self.G = generator       # G_theta: few-step generator
        self.G_real = real_model  # teacher: multi-step diffusion
        self.G_fake = fake_model  # student: tracks G_theta via EMA
 
    def generate_long_video(self, first_frame, caption, num_chunks):
        """Autoregressive generation with Self-Forcing."""
        all_frames = [first_frame]
 
        for chunk_idx in range(num_chunks):
            # Use OWN generated frames as history (not GT)
            history = torch.cat(all_frames, dim=2)
 
            # Compress via TSCM
            history_tokens = self.G.tscm(history, chunk_idx)
 
            # Generate new chunk (4-step inference)
            noise = torch.randn_like(first_frame)
            new_chunk = self.G(noise, history_tokens, caption, num_steps=4)
 
            all_frames.append(new_chunk)
 
        return torch.cat(all_frames, dim=2)
 
    def distillation_step(self, z_t, history_tokens, caption, t):
        """
        DMD loss: match score functions of real and fake models.
        Eq.6: gradient of distribution matching loss.
        """
        # Real model score (teacher, on real data distribution)
        with torch.no_grad():
            G_real_output = self.G_real(z_t, history_tokens, caption, t)
            s_real = (z_t - G_real_output) / (t ** 2)
 
        # Fake model score (student, on generated data distribution)
        with torch.no_grad():
            G_fake_output = self.G_fake(z_t, history_tokens, caption, t)
            s_fake = (z_t - G_fake_output) / (t ** 2)
 
        # Generator forward
        G_theta_output = self.G(z_t, history_tokens, caption, t)
 
        # DMD gradient: (s_real - s_fake) * dG/dtheta
        loss = ((s_real - s_fake) * G_theta_output).sum()
        return loss
 
 
# ============================================================
# 4. DiT Block with TSCM Integration
# ============================================================
 
class YumeDiTBlock(nn.Module):
    """Single DiT block with standard attn + linear attn for history."""
 
    def __init__(self, hidden_dim, num_heads, compressed_dim=96):
        super().__init__()
        # Standard components
        self.self_attn = nn.MultiheadAttention(hidden_dim, num_heads)
        self.cross_attn = nn.MultiheadAttention(hidden_dim, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim * 4),
            nn.GELU(),
            nn.Linear(hidden_dim * 4, hidden_dim)
        )
        # TSCM: linear attention branch
        self.linear_attn = LinearAttentionFusion(hidden_dim, compressed_dim)
        # AdaLN modulation
        self.adaln = nn.Linear(hidden_dim, hidden_dim * 6)
 
    def forward(self, z_l, z_linear, text_emb, time_emb):
        """
        z_l: (B, N, D) - current prediction + compressed history tokens
        z_linear: (B, N_hist, 96) - channel-compressed history tokens
        text_emb: (B, L, D) - concatenated event + action embeddings
        time_emb: (B, D) - timestep embedding
        """
        # AdaLN modulation
        mod = self.adaln(time_emb).unsqueeze(1)
        shift1, scale1, gate1, shift2, scale2, gate2 = mod.chunk(6, dim=-1)
 
        # Self-attention (standard, on prediction + spatially-compressed history)
        h = z_l * (1 + scale1) + shift1
        h = h + gate1 * self.self_attn(h, h, h)[0]
 
        # Cross-attention with text
        h = h + self.cross_attn(h, text_emb, text_emb)[0]
 
        # Linear attention fusion with channel-compressed history
        h_pred = h[:, -z_l.shape[1]:]  # extract prediction tokens
        h = h + self.linear_attn(h_pred, z_linear)
 
        # FFN
        h2 = h * (1 + scale2) + shift2
        h = h + gate2 * self.ffn(h2)
 
        return h

3.7 Code Mapping（代码映射）

论文组件	代码位置	说明
DiT Backbone (Wan-based)	`fastvideo/models/hunyuan/modules/model.py` → `WanModel`	基于 Wan 架构的 DiT 主干，支持 T2V/I2V 双模式
Self-Attention Block	`fastvideo/models/hunyuan/modules/model.py` → `WanAttentionBlock`	包含 self-attn + cross-attn + FFN，含 RoPE 位置编码
Self-Attention (standard)	`fastvideo/models/hunyuan/modules/model.py` → `WanSelfAttention`	QK-Norm + RoPE + Flash Attention
Cross-Attention (T2V)	`fastvideo/models/hunyuan/modules/model.py` → `WanT2VCrossAttention`	文本条件注入
Cross-Attention (I2V)	`fastvideo/models/hunyuan/modules/model.py` → `WanI2VCrossAttention`	图像条件注入，分离 image/text stream
Flash Attention 实现	`fastvideo/models/flash_attn_no_pad.py`	无 padding 的 Flash Attention
Parallel Attention (SP)	`fastvideo/models/hunyuan/modules/attenion.py` → `parallel_attention()`	序列并行下的分布式注意力
Distillation 训练	`fastvideo/distill_model.py` → `distill_one_step()`	GAN-based 蒸馏 + EMA teacher，包含 discriminator loss
训练脚本	`scripts/`	推理和训练入口脚本
数据预处理	`fastvideo/data_preprocess/`	相机轨迹→键盘控制转换等
交互式 Demo	`webapp_single_gpu.py`	单 GPU Windows 部署 Demo（RTX 4090, 16GB VRAM）

注意：TSCM（linear attention + channel compression）和 Self-Forcing 的完整实现在开源代码中尚未完全体现——model.py 中的 WanModel 是基础 Wan 架构，TSCM 相关的 linear attention 融合模块可能在训练代码或未公开的扩展模块中。开源仓库主要提供了推理和基础蒸馏流程。

4. Experimental Setup (实验设置)

预训练模型：Wan2.2-5B-T2V
训练配置：704×1280, 16 FPS, batch 40, Adam lr=1e-5, A100 GPU
Foundation model: 10K iterations; Self-Forcing+TSCM: 600 iterations
评估：Yume-Bench（6 个 fine-grained 指标），测试数据 544×960, 16 FPS, 96 帧，4 步推理
对比方法：Wan-2.1, MatrixGame, Yume (1.0)

5. Experimental Results (实验结果)

5.1 Image-to-Video 生成质量

Model	Time(s)↓	IF↑	SC↑	BC↑	MS↑	AQ↑	IQ↑
Wan-2.1	611	0.057	0.859	0.899	0.961	0.494	0.695
MatrixGame	971	0.271	0.911	0.932	0.983	0.435	0.750
Yume	572	0.657	0.932	0.941	0.986	0.518	0.739
Yume-1.5	8	0.836	0.932	0.945	0.985	0.506	0.728

关键发现：

Instruction Following 大幅领先：Yume-1.5 的 IF=0.836，远超 Yume(0.657) 和 Wan-2.1(0.057)，说明文本控制和键盘指令跟随能力显著增强
推理速度碾压式优势：仅需 8 秒（4 步推理），对比 Wan-2.1 需 611 秒、MatrixGame 需 971 秒
在单张 A100 上实现 540p 12 fps 实时生成
其他指标（SC, BC, MS）与 Yume 持平或略优，AQ/IQ 略低于 MatrixGame 但差距不大

5.2 长视频生成性能

Figure 5 解读：长视频生成中 Aesthetic Score 随 chunk 数量的变化。红线（Self-Forcing without TSCM）从第 2 个 chunk 开始急剧下降，到第 6 个 chunk 时降至 0.44。蓝线（Self-Forcing with TSCM）在第 4-6 个 chunk 保持稳定，最终美学分数 0.523——比无 TSCM 高 18%。说明 TSCM 有效利用了长程历史信息，抑制了质量退化。

Figure 6 解读：Image Quality 随 chunk 数量的变化。同样的趋势——有 TSCM 的模型在第 5-6 个 chunk 保持更一致的图像质量（0.601 vs 0.542），进一步验证 TSCM 对长程生成的稳定作用。

5.3 TSCM 消融实验

Model	IF↑	SC↑	BC↑	MS↑	AQ↑	IQ↑
TSCM	0.836	0.932	0.945	0.985	0.506	0.728
Spatial Compression Only	0.767	0.935	0.945	0.973	0.504	0.733

TSCM 在关键指标 IF 上提升 9%（0.836 vs 0.767），说明通道压缩 + linear attention 融合有效减少了历史帧运动方向对当前预测的干扰。其他指标差异较小。

5.4 推理速度对比

Figure 7 解读：三种方法随 chunk 数量增加的推理时间变化（704×1280）。Full Context（蓝线）推理时间线性增长，到第 8 个 chunk 时约 9 秒/chunk。Spatial Compression（绿线）增长较慢但仍有趋势。TSCM（红线）保持基本恒定（约 3 秒/chunk），验证了 TSCM 将历史上下文压缩到固定大小的设计目标——推理速度不随视频长度增长。

5.5 定性结果

Figure 8 解读：544×960 分辨率下四种方法的定性对比。Wan-2.1 缺乏交互控制能力，生成结果与指令不匹配。MatrixGame 有一定控制能力但画面质量较低。Yume 画面质量较好但运动控制精度有限。Yume-1.5 在仅用 4 步采样的情况下，同时实现了高画面质量和精确的相机/人物控制——其他方法均使用 50 步采样。

5.6 Conclusion（总结与展望）

核心贡献：

TSCM：联合时空-通道压缩 + linear attention，实现恒定速度的无限长视频生成
Self-Forcing + DMD 加速：消除 train-inference mismatch + 4 步蒸馏，单 A100 实现 540p 12fps
文本事件控制：通过 Caption 分解 + Event Dataset，首次在交互世界模型中引入文本触发事件能力

局限性：

倒车、倒退行走等反向运动仍有 artifact
极高人群密度场景质量下降
5B 参数规模限制了生成能力上限；考虑 MoE 架构（Wan2.2 启发）兼顾参数量和延迟

未来方向：更复杂的世界交互、MoE 架构扩展、虚拟环境和仿真系统应用。

Paper Notes

探索

Yume-1.5: A Text-Controlled Interactive World Generation Model

Yume-1.5: A Text-Controlled Interactive World Generation Model

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 整体架构

3.2 基础模型（Architecture Preliminary）

3.3 TSCM: 联合时空-通道压缩

3.4 Self-Forcing + DMD 实时加速

3.5 数据处理

3.6 Python Pseudocode

3.7 Code Mapping（代码映射）

4. Experimental Setup (实验设置)

5. Experimental Results (实验结果)

5.1 Image-to-Video 生成质量

5.2 长视频生成性能

5.3 TSCM 消融实验

5.4 推理速度对比

5.5 定性结果

5.6 Conclusion（总结与展望）

目录