Yume-1.5: A Text-Controlled Interactive World Generation Model
Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang Affiliations: Shanghai AI Laboratory, Fudan University, Shanghai Innovation Institute arXiv: 2512.22096 Project Page: stdstu12.github.io/YUME-Project GitHub: stdstu12/YUME
训练推理代码都已经开源,提供 Yume-5B-720P 和 Yume-I2V-540P-14B 权重
1. Motivation (研究动机)
现有交互式世界生成方法面临三大瓶颈:
- 有限的泛化能力(Limited Generalizability):MatrixGame、WORLDMEM 等方法主要在游戏数据集上训练,存在严重的 domain gap,难以生成逼真的动态城市街景等真实场景
- 过高的生成延迟(High Generation Latency):扩散模型的高计算量阻碍了实时连续生成。随着视频时长增加,历史上下文不断膨胀,推理速度急剧下降——无法支持无限探索
- 文本控制能力不足(Insufficient Text Control):现有方法仅支持键盘/鼠标控制相机运动,无法通过文本指令触发场景中的事件(如”一只鬼出现了”),交互维度受限
Figure 1 解读:展示 Yume-1.5 的三种交互生成模式。上行:Text-to-World——从文本描述”A stylish woman walks down a Tokyo street…”出发,通过 WASD 键盘控制人物移动和方向键控制相机旋转,自回归生成连续的街景探索视频。中行:Image-to-World——从一张静态图像出发生成可探索世界。下行:Event Editing——在生成过程中通过文本描述”A ghost appeared”触发场景事件。每帧下方的控制面板显示了当前的键盘输入状态。
2. Idea (核心思想)
Yume-1.5 的核心 insight:通过联合时空-通道压缩(TSCM)将不断增长的历史上下文压缩到固定大小,结合 Self-Forcing 蒸馏实现少步推理,再通过文本分解(Event Description + Action Description)引入事件控制能力。
具体做法分三个维度:
- TSCM(Joint Temporal-Spatial-Channel Modeling):对历史帧做两级压缩——先在时空维度随机采样 + 高压缩率 Patchify(越远的帧压缩越狠),再在通道维度进一步降维到 96 维,用 linear attention 与当前预测帧融合。这样无论历史多长,输入 token 数基本恒定
- Self-Forcing + DMD 加速:Generator 用自身生成的帧(而非 GT)作为历史上下文进行训练,消除 train-inference mismatch;然后通过 Distribution Matching Distillation 将多步扩散蒸馏为 4 步推理
- 文本事件控制:将 caption 分解为 Event Description(描述场景/事件,仅首帧编码一次)和 Action Description(描述键盘动作,有限集合可预计算缓存),通过 T5 分别编码后拼接,实现低开销的文本控制
与 Self-Forcing、CausVid 的根本区别:不使用 KV cache,而是用 TSCM 压缩历史帧后通过 linear attention 融合,推理速度不随视频长度增长。
3. Method (方法)
3.1 整体架构
Figure 3 解读:Yume-1.5 的四个核心组件。(a) DiT Block:标准 self-attention 处理当前预测帧,cross-attention 处理文本条件,底部增加 linear attention 分支融合历史压缩 token ——先将预测帧 token 通过 FC 降维,与 拼接后用 linear attention 融合,结果通过 FC 恢复维度后 element-wise 加回 。(b) Training pipeline:视频经 WanEncoder 编码后与条件图像 拼接,Caption 分解为 Event Description 和 Action Description 分别通过 T5 编码后拼接为 ,与时间步嵌入一起送入 N 个 DiT Block。(c) History Tokens Downsampling:对不同时间距离的历史帧施加不同压缩率——最近 2 帧 (1,2,2),稍远帧 (1,4,4),更远帧 (1,8,8),以此类推,通过插值 Patchify 权重实现。(d) Inference:chunk-based 自回归生成,每次预测新 chunk 时将之前所有生成帧压缩为低压缩率(近期)和高压缩率(远期)两种 memory。
3.2 基础模型(Architecture Preliminary)
基于 Wan2.2-5B(T2V + I2V 统一模型):
- T2V 模式:噪声 ,文本嵌入 和 送入 DiT backbone
- I2V 模式:条件图像/视频 zero-pad 到与 同尺寸,构建 binary mask ,融合为
- 文本编码策略:与 Wan 不同,不直接把整个 caption 送 T5,而是拆分为 Event Description 和 Action Description 分别编码后拼接。Action Description 集合有限(WASD 组合),可预计算缓存,大幅降低 T5 推理开销
- 训练使用 Rectified Flow loss
3.3 TSCM: 联合时空-通道压缩
核心问题:自回归生成中历史帧 的帧数 不断增长,标准 attention 的计算量无法承受。现有方案(滑窗截断、FramePack 压缩、相机轨迹检索)各有缺陷。
Step 1: Temporal-Spatial Compression(时空压缩)
对历史帧 做两步处理:
- 随机帧采样:32 帧中只采 1 帧
- 自适应 Patchify 压缩:根据时间距离施加不同压缩率
表示时间维 、空间高 、空间宽 下采样。通过插值 Patchify 权重实现不同压缩率,无需额外参数。
压缩后的表示 与当前预测帧 (用原始 Patchify 处理)拼接后送入 DiT Block 的 standard self-attention。
Step 2: Channel Compression(通道压缩)
在时空压缩基础上进一步压缩:
- 用压缩率 的 Patchify 将 压缩,通道维降至 96,得到
- DiT Block 内部:预测帧 token 经 cross-attention 后通过 FC 降维,与 拼接得到
- 通过 Linear Attention 融合:
其中 是 ReLU 激活函数。Linear attention 的计算量对 token 数线性,对通道维敏感——所以先压缩通道到 96 再用 linear attention。
- 融合结果 通过 FC 恢复通道维度,element-wise 加回 :
设计哲学:Standard attention 对 token 数敏感 → 时空压缩减少 token 数;Linear attention 对通道维敏感 → 通道压缩降低维度。两者互补。
3.4 Self-Forcing + DMD 实时加速
Figure 4 解读:左(Generator):自回归生成过程。第一个 chunk 从首帧出发,后续 chunk 将自身生成的帧(非 GT)经 TSCM 压缩后作为历史上下文——这就是 Self-Forcing 的核心,用自身输出替代 GT 消除 train-inference mismatch。右(Distillation):Real Model(teacher)处理真实数据计算 score ,Fake Model(student)处理加噪生成数据计算 score ,通过 Distribution Matching Loss 的梯度更新 Generator ,将多步扩散蒸馏为少步(4 步)推理。
训练流程:
- Foundation Model 训练:在混合数据集上交替训练 T2V 和 I2V 任务,分辨率 704×1280,16 FPS,batch 40,Adam lr=1e-5,A100 GPU 训练 10K iterations
- Self-Forcing + TSCM 训练:在 foundation model 基础上,用自身生成帧作为历史上下文(替代 KV cache),通过 TSCM 压缩后训练 600 iterations
DMD Loss:
关键区别:DMD 使用模型预测数据(而非真实数据)作为 video conditioning,从而与 Self-Forcing 的训练范式保持一致。
3.5 数据处理
三个数据源:
- Sekai-Real-HQ:真实世界街景行走视频,包含相机轨迹和语义标注。将轨迹映射为离散键盘控制词表(vocab_camera 9 种相机动作 + vocab_human 10 种人物移动)。用 InternVL3-78B 为 I2V 训练重新标注 event-focused caption
- Synthetic Dataset:从 Openvid-1M 采 80K caption,用 Wan 2.1-14B 合成 80K 视频(720p),VBench 筛选 top 50K,用于维持 T2V 通用能力
- Event Dataset:招募标注员撰写 4 类事件描述(日常/科幻/奇幻/天气),收集 10K 第一人称图像,用 Wan 2.2-I2V-14B 合成 10K 视频,人工筛选 4K 高质量样本用于 T2V 训练
Figure 2 解读:数据集重标注示例。Original caption 描述静态场景上下文(“A vivid European-style city street scene…“),用于 T2V 训练。New caption 由 VLM 重新生成,聚焦视频中发生的动态事件(“People are moving aside to avoid the street sprinkler”),用于 I2V 训练——使模型学会根据事件描述生成动态内容。
3.6 Python Pseudocode
import torch
import torch.nn as nn
import torch.nn.functional as F
# ============================================================
# 1. TSCM: Joint Temporal-Spatial-Channel Modeling
# ============================================================
class TemporalSpatialCompression(nn.Module):
"""Step 1: Compress history frames with adaptive Patchify rates."""
def __init__(self, in_channels, base_patch_size=(1, 2, 2)):
super().__init__()
self.base_patchify = nn.Conv3d(in_channels, in_channels,
kernel_size=base_patch_size,
stride=base_patch_size)
def get_compression_rate(self, temporal_distance):
"""Eq.3: farther frames get higher compression."""
if temporal_distance <= 2:
return (1, 2, 2) # 1x temporal, 2x spatial
elif temporal_distance <= 6:
return (1, 4, 4) # 1x temporal, 4x spatial
elif temporal_distance <= 23:
return (1, 8, 8) # 1x temporal, 8x spatial
else:
return (1, 16, 16) # maximum compression
def adaptive_patchify(self, z_c, rate):
"""Achieve variable compression by interpolating Patchify weights."""
# Interpolate base_patchify weights to target rate
base_weight = self.base_patchify.weight # (C_out, C_in, 1, 2, 2)
target_weight = F.interpolate(base_weight, size=rate, mode='trilinear')
return F.conv3d(z_c, target_weight, stride=rate)
def forward(self, z_c, current_frame_idx):
"""
z_c: (B, C, T_history, H, W) - all history frames
Returns: z_hat_c concatenated with z_hat_p
"""
# Random frame sampling: keep 1 out of 32
sampled_indices = list(range(0, z_c.shape[2], 32))
compressed_chunks = []
for idx in sampled_indices:
dist = current_frame_idx - idx
rate = self.get_compression_rate(dist)
chunk = z_c[:, :, idx:idx+1]
compressed_chunks.append(self.adaptive_patchify(chunk, rate))
z_hat_c = torch.cat(compressed_chunks, dim=2)
return z_hat_c
class ChannelCompression(nn.Module):
"""Step 2: Further compress to 96-dim channels for linear attention."""
def __init__(self, in_channels, compressed_channels=96):
super().__init__()
# High compression Patchify: (8, 4, 4) rate, output 96 channels
self.high_compress_patchify = nn.Conv3d(
in_channels, compressed_channels,
kernel_size=(8, 4, 4), stride=(8, 4, 4)
)
def forward(self, z_c):
"""
z_c: (B, C, T, H, W) - history frames
Returns: z_linear (B, N_tokens, 96)
"""
z_linear = self.high_compress_patchify(z_c) # (B, 96, T', H', W')
B, C, T, H, W = z_linear.shape
z_linear = z_linear.reshape(B, C, -1).permute(0, 2, 1) # (B, N, 96)
return z_linear
class LinearAttentionFusion(nn.Module):
"""Eq.4: Linear attention for fusing history tokens with prediction tokens."""
def __init__(self, dim, compressed_dim=96):
super().__init__()
self.fc_down = nn.Linear(dim, compressed_dim) # reduce pred tokens dim
self.fc_up = nn.Linear(compressed_dim, dim) # restore dim after fusion
self.q_proj = nn.Linear(compressed_dim, compressed_dim)
self.k_proj = nn.Linear(compressed_dim, compressed_dim)
self.v_proj = nn.Linear(compressed_dim, compressed_dim)
self.norm = nn.LayerNorm(compressed_dim)
self.out_proj = nn.Linear(compressed_dim, compressed_dim)
def phi(self, x):
"""Kernel function: ReLU activation."""
return F.relu(x)
def forward(self, z_pred, z_linear):
"""
z_pred: (B, N_pred, D) - prediction frame tokens after cross-attn
z_linear: (B, N_hist, 96) - compressed history tokens
Returns: (B, N_pred, D) - fused result to add back
"""
z_p_down = self.fc_down(z_pred) # (B, N_pred, 96)
z_fus = torch.cat([z_linear, z_p_down], dim=1) # (B, N_hist+N_pred, 96)
q = self.q_proj(z_fus)
k = self.k_proj(z_fus)
v = self.v_proj(z_fus)
# Normalize q, k for stability
q = F.normalize(q, dim=-1)
k = F.normalize(k, dim=-1)
# Linear attention: O(N*d^2) instead of O(N^2*d)
phi_k = self.phi(k) # (B, N, d)
phi_q = self.phi(q) # (B, N, d)
# Compute: (sum_i v_i * phi(k_i)^T) * phi(q) / (sum_j phi(k_j)^T * phi(q))
kv = torch.einsum('bnd,bnc->bdc', phi_k, v) # (B, d, d)
numerator = torch.einsum('bdc,bnc->bnd', kv, phi_q) # (B, N, d)
denominator = torch.einsum('bnd,bnd->bn', phi_k.sum(1, keepdim=True).expand_as(phi_q), phi_q)
denominator = denominator.unsqueeze(-1).clamp(min=1e-6)
o = numerator / denominator
o = self.norm(o)
o = self.out_proj(o)
# Extract only prediction token positions, restore dim
o_pred = o[:, -z_pred.shape[1]:] # (B, N_pred, 96)
return self.fc_up(o_pred) # (B, N_pred, D) - add to z_l
# ============================================================
# 2. Text Encoding: Event + Action Description
# ============================================================
class DualTextEncoder(nn.Module):
"""Decompose caption into Event Description and Action Description."""
def __init__(self, t5_encoder):
super().__init__()
self.t5 = t5_encoder
# Pre-compute all possible action descriptions
self.action_vocab = self._build_action_cache()
def _build_action_cache(self):
"""Pre-compute T5 embeddings for all keyboard combinations."""
actions = [
"Camera moves forward (W).", "Camera moves left (A).",
"Camera moves backward (S).", "Camera moves right (D).",
"Camera moves forward and left (W+A).",
"Camera moves forward and right (W+D).",
"Camera turns right (→).", "Camera turns left (←).",
"Camera tilts up (↑).", "Camera tilts down (↓).",
"Camera stands still (·).",
]
cache = {}
with torch.no_grad():
for action in actions:
cache[action] = self.t5.encode(action)
return cache
def forward(self, event_description, action_description):
"""
event_description: str - scene/event text (encoded once at start)
action_description: str - keyboard action (looked up from cache)
Returns: c_a (B, L, D) - concatenated text embedding
"""
c_event = self.t5.encode(event_description) # only at first frame
c_action = self.action_vocab[action_description] # from cache
c_a = torch.cat([c_event, c_action], dim=1)
return c_a
# ============================================================
# 3. Self-Forcing Training with TSCM
# ============================================================
class SelfForcingTrainer:
"""Train generator using its own outputs as history context."""
def __init__(self, generator, real_model, fake_model):
self.G = generator # G_theta: few-step generator
self.G_real = real_model # teacher: multi-step diffusion
self.G_fake = fake_model # student: tracks G_theta via EMA
def generate_long_video(self, first_frame, caption, num_chunks):
"""Autoregressive generation with Self-Forcing."""
all_frames = [first_frame]
for chunk_idx in range(num_chunks):
# Use OWN generated frames as history (not GT)
history = torch.cat(all_frames, dim=2)
# Compress via TSCM
history_tokens = self.G.tscm(history, chunk_idx)
# Generate new chunk (4-step inference)
noise = torch.randn_like(first_frame)
new_chunk = self.G(noise, history_tokens, caption, num_steps=4)
all_frames.append(new_chunk)
return torch.cat(all_frames, dim=2)
def distillation_step(self, z_t, history_tokens, caption, t):
"""
DMD loss: match score functions of real and fake models.
Eq.6: gradient of distribution matching loss.
"""
# Real model score (teacher, on real data distribution)
with torch.no_grad():
G_real_output = self.G_real(z_t, history_tokens, caption, t)
s_real = (z_t - G_real_output) / (t ** 2)
# Fake model score (student, on generated data distribution)
with torch.no_grad():
G_fake_output = self.G_fake(z_t, history_tokens, caption, t)
s_fake = (z_t - G_fake_output) / (t ** 2)
# Generator forward
G_theta_output = self.G(z_t, history_tokens, caption, t)
# DMD gradient: (s_real - s_fake) * dG/dtheta
loss = ((s_real - s_fake) * G_theta_output).sum()
return loss
# ============================================================
# 4. DiT Block with TSCM Integration
# ============================================================
class YumeDiTBlock(nn.Module):
"""Single DiT block with standard attn + linear attn for history."""
def __init__(self, hidden_dim, num_heads, compressed_dim=96):
super().__init__()
# Standard components
self.self_attn = nn.MultiheadAttention(hidden_dim, num_heads)
self.cross_attn = nn.MultiheadAttention(hidden_dim, num_heads)
self.ffn = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim * 4),
nn.GELU(),
nn.Linear(hidden_dim * 4, hidden_dim)
)
# TSCM: linear attention branch
self.linear_attn = LinearAttentionFusion(hidden_dim, compressed_dim)
# AdaLN modulation
self.adaln = nn.Linear(hidden_dim, hidden_dim * 6)
def forward(self, z_l, z_linear, text_emb, time_emb):
"""
z_l: (B, N, D) - current prediction + compressed history tokens
z_linear: (B, N_hist, 96) - channel-compressed history tokens
text_emb: (B, L, D) - concatenated event + action embeddings
time_emb: (B, D) - timestep embedding
"""
# AdaLN modulation
mod = self.adaln(time_emb).unsqueeze(1)
shift1, scale1, gate1, shift2, scale2, gate2 = mod.chunk(6, dim=-1)
# Self-attention (standard, on prediction + spatially-compressed history)
h = z_l * (1 + scale1) + shift1
h = h + gate1 * self.self_attn(h, h, h)[0]
# Cross-attention with text
h = h + self.cross_attn(h, text_emb, text_emb)[0]
# Linear attention fusion with channel-compressed history
h_pred = h[:, -z_l.shape[1]:] # extract prediction tokens
h = h + self.linear_attn(h_pred, z_linear)
# FFN
h2 = h * (1 + scale2) + shift2
h = h + gate2 * self.ffn(h2)
return h3.7 Code Mapping(代码映射)
| 论文组件 | 代码位置 | 说明 |
|---|---|---|
| DiT Backbone (Wan-based) | fastvideo/models/hunyuan/modules/model.py → WanModel | 基于 Wan 架构的 DiT 主干,支持 T2V/I2V 双模式 |
| Self-Attention Block | fastvideo/models/hunyuan/modules/model.py → WanAttentionBlock | 包含 self-attn + cross-attn + FFN,含 RoPE 位置编码 |
| Self-Attention (standard) | fastvideo/models/hunyuan/modules/model.py → WanSelfAttention | QK-Norm + RoPE + Flash Attention |
| Cross-Attention (T2V) | fastvideo/models/hunyuan/modules/model.py → WanT2VCrossAttention | 文本条件注入 |
| Cross-Attention (I2V) | fastvideo/models/hunyuan/modules/model.py → WanI2VCrossAttention | 图像条件注入,分离 image/text stream |
| Flash Attention 实现 | fastvideo/models/flash_attn_no_pad.py | 无 padding 的 Flash Attention |
| Parallel Attention (SP) | fastvideo/models/hunyuan/modules/attenion.py → parallel_attention() | 序列并行下的分布式注意力 |
| Distillation 训练 | fastvideo/distill_model.py → distill_one_step() | GAN-based 蒸馏 + EMA teacher,包含 discriminator loss |
| 训练脚本 | scripts/ | 推理和训练入口脚本 |
| 数据预处理 | fastvideo/data_preprocess/ | 相机轨迹→键盘控制转换等 |
| 交互式 Demo | webapp_single_gpu.py | 单 GPU Windows 部署 Demo(RTX 4090, 16GB VRAM) |
注意:TSCM(linear attention + channel compression)和 Self-Forcing 的完整实现在开源代码中尚未完全体现——model.py 中的 WanModel 是基础 Wan 架构,TSCM 相关的 linear attention 融合模块可能在训练代码或未公开的扩展模块中。开源仓库主要提供了推理和基础蒸馏流程。
4. Experimental Setup (实验设置)
- 预训练模型:Wan2.2-5B-T2V
- 训练配置:704×1280, 16 FPS, batch 40, Adam lr=1e-5, A100 GPU
- Foundation model: 10K iterations; Self-Forcing+TSCM: 600 iterations
- 评估:Yume-Bench(6 个 fine-grained 指标),测试数据 544×960, 16 FPS, 96 帧,4 步推理
- 对比方法:Wan-2.1, MatrixGame, Yume (1.0)
5. Experimental Results (实验结果)
5.1 Image-to-Video 生成质量
| Model | Time(s)↓ | IF↑ | SC↑ | BC↑ | MS↑ | AQ↑ | IQ↑ |
|---|---|---|---|---|---|---|---|
| Wan-2.1 | 611 | 0.057 | 0.859 | 0.899 | 0.961 | 0.494 | 0.695 |
| MatrixGame | 971 | 0.271 | 0.911 | 0.932 | 0.983 | 0.435 | 0.750 |
| Yume | 572 | 0.657 | 0.932 | 0.941 | 0.986 | 0.518 | 0.739 |
| Yume-1.5 | 8 | 0.836 | 0.932 | 0.945 | 0.985 | 0.506 | 0.728 |
关键发现:
- Instruction Following 大幅领先:Yume-1.5 的 IF=0.836,远超 Yume(0.657) 和 Wan-2.1(0.057),说明文本控制和键盘指令跟随能力显著增强
- 推理速度碾压式优势:仅需 8 秒(4 步推理),对比 Wan-2.1 需 611 秒、MatrixGame 需 971 秒
- 在单张 A100 上实现 540p 12 fps 实时生成
- 其他指标(SC, BC, MS)与 Yume 持平或略优,AQ/IQ 略低于 MatrixGame 但差距不大
5.2 长视频生成性能
Figure 5 解读:长视频生成中 Aesthetic Score 随 chunk 数量的变化。红线(Self-Forcing without TSCM)从第 2 个 chunk 开始急剧下降,到第 6 个 chunk 时降至 0.44。蓝线(Self-Forcing with TSCM)在第 4-6 个 chunk 保持稳定,最终美学分数 0.523——比无 TSCM 高 18%。说明 TSCM 有效利用了长程历史信息,抑制了质量退化。
Figure 6 解读:Image Quality 随 chunk 数量的变化。同样的趋势——有 TSCM 的模型在第 5-6 个 chunk 保持更一致的图像质量(0.601 vs 0.542),进一步验证 TSCM 对长程生成的稳定作用。
5.3 TSCM 消融实验
| Model | IF↑ | SC↑ | BC↑ | MS↑ | AQ↑ | IQ↑ |
|---|---|---|---|---|---|---|
| TSCM | 0.836 | 0.932 | 0.945 | 0.985 | 0.506 | 0.728 |
| Spatial Compression Only | 0.767 | 0.935 | 0.945 | 0.973 | 0.504 | 0.733 |
TSCM 在关键指标 IF 上提升 9%(0.836 vs 0.767),说明通道压缩 + linear attention 融合有效减少了历史帧运动方向对当前预测的干扰。其他指标差异较小。
5.4 推理速度对比
Figure 7 解读:三种方法随 chunk 数量增加的推理时间变化(704×1280)。Full Context(蓝线)推理时间线性增长,到第 8 个 chunk 时约 9 秒/chunk。Spatial Compression(绿线)增长较慢但仍有趋势。TSCM(红线)保持基本恒定(约 3 秒/chunk),验证了 TSCM 将历史上下文压缩到固定大小的设计目标——推理速度不随视频长度增长。
5.5 定性结果

Figure 8 解读:544×960 分辨率下四种方法的定性对比。Wan-2.1 缺乏交互控制能力,生成结果与指令不匹配。MatrixGame 有一定控制能力但画面质量较低。Yume 画面质量较好但运动控制精度有限。Yume-1.5 在仅用 4 步采样的情况下,同时实现了高画面质量和精确的相机/人物控制——其他方法均使用 50 步采样。
5.6 Conclusion(总结与展望)
核心贡献:
- TSCM:联合时空-通道压缩 + linear attention,实现恒定速度的无限长视频生成
- Self-Forcing + DMD 加速:消除 train-inference mismatch + 4 步蒸馏,单 A100 实现 540p 12fps
- 文本事件控制:通过 Caption 分解 + Event Dataset,首次在交互世界模型中引入文本触发事件能力
局限性:
- 倒车、倒退行走等反向运动仍有 artifact
- 极高人群密度场景质量下降
- 5B 参数规模限制了生成能力上限;考虑 MoE 架构(Wan2.2 启发)兼顾参数量和延迟
未来方向:更复杂的世界交互、MoE 架构扩展、虚拟环境和仿真系统应用。