Helios: Real Real-Time Long Video Generation Model

Authors: Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, Li Yuan Affiliations: Peking University, ByteDance China, Canva, Chengdu Anu Intelligence

1. Motivation (研究动机)

当前视频生成领域追求实时、长时间、高质量三者兼得,但现有方法都存在根本性局限:

  • 速度瓶颈:主流模型(如 Wan2.1 14B)生成 5 秒视频需约 50 分钟,远达不到实时。即便号称实时的方法也存在问题——Krea-RealTime-14B 在 H100 上仅 6.7 FPS 且漂移严重
  • 小模型困局:现有实时长视频方法(CausVid, Self-Forcing, Rolling Forcing 等)几乎都基于 ~1.3B 小模型,容量不足以建模复杂运动和高频细节
  • 反漂移代价高:Self-Forcing 需要 train-as-infer rollout,Error-Banks 需要额外模块,且它们的鲁棒性与训练时的 rollout 长度强耦合——训练 5 秒片段,推理超过 5 秒就严重漂移
  • 加速技术有副作用:KV-Cache、Causal Masking、Sparse Attention 等改变了双向预训练模型的推理范式,限制了可达质量上限

核心矛盾:大模型质量好但太慢,小模型够快但质量差

Figure 1 解读:各模型在单卡 H100 上的端到端吞吐量(FPS)对比。深蓝色为 Helios-Distilled(19.53 FPS),浅蓝色为 Helios-Base(0.54 FPS)。斜线填充为 ≤1.3B 小模型,灰色为 ≥2B 大模型。Helios-Distilled 作为 14B 大模型,达到了与 1.3B 小模型蒸馏版(Reward Forcing 22.13、Self Forcing 21.20)相当的速度,远超同尺寸的 Krea(6.74)和 FastVideo(5.37)。

Figure 2 解读:左图为长视频生成的 Total Score,右图为短视频生成的 Total Score。Helios-Base(6.57)和 Helios-Distilled(6.34)在长视频上全面领先;短视频上 Helios-Base(6.35)匹配 Wan 2.1(6.35),Helios-Distilled(6.00)超越所有蒸馏模型。


2. Idea (核心思想)

Helios 的核心 insight:14B 大模型可以比 1.3B 小模型更便宜、更快、同时更强

实现路径是从三个独立维度同时发力,每个维度都避开了前人的常见陷阱:

  1. Unified History Injection:不用 Causal Masking(破坏双向注意力),而是通过 Representation Control + Guidance Attention 将双向模型改造为自回归生成器
  2. Easy Anti-Drifting:不用 Self-Forcing(训练昂贵 + 长度依赖),而是直接在训练中模拟漂移的三种表现形式(位置/颜色/恢复偏移)
  3. Deep Compression Flow:不用 KV-Cache 或 Sparse Attention,而是从 token 数量(Multi-Term Memory Patchification + Pyramid UniPC)和采样步数(Adversarial Hierarchical Distillation)两个维度深度压缩计算量

最终结果:14B 模型在单卡 H100 上 19.5 FPS,计算量降到与 1.3B 模型可比,质量超越所有同类方法。


3. Method (方法)

3.1 整体架构

Figure 4 解读:整体架构图。左下方:历史上下文(绿色)和噪声上下文(灰色)拼接为输入,经 Multi-Term Memory Patchification 分层压缩(Long/Mid/Short 三级 Conv3d)。中间:Autoregressive Video Diffusion Transformer 主体。右上方:Guidance Attention 的两种形式——Self-Attention 中历史和噪声 token 共同参与但用 调制历史 key;Cross-Attention 中仅噪声 token 接收文本信息。右侧:Pyramid Unified Predictor Corrector 从低分辨率逐步上采样到全分辨率。底部:Representation Control 通过零填充/帧填充自动切换 T2V/I2V/V2V 任务。

Helios 是一个 14B Autoregressive Video Diffusion Transformer,基于 Wan-2.1-T2V-14B 初始化。核心架构改动:

  • 输入拼接:历史上下文 (干净,经 Multi-Term Memory Patchification 压缩)+ 噪声上下文 (待去噪)
  • 注意力机制:Guidance Self-Attention(区分历史/噪声 token 的角色)+ Guidance Cross-Attention(仅对噪声 token 注入文本语义)
  • 任务统一:通过 的内容自动切换 T2V/I2V/V2V

三阶段渐进训练:

  • Stage 1 (Base):架构改造(Unified History Injection + Easy Anti-Drifting + Multi-Term Memory Patchification)
  • Stage 2 (Mid):token 压缩(引入 Pyramid Unified Predictor Corrector)
  • Stage 3 (Distilled):step 蒸馏(Adversarial Hierarchical Distillation,50 步 → 3 步)

3.2 Unified History Injection

Representation Control:将长视频生成建模为视频续写任务。

其中 。模型对 去噪,以 为条件。任务切换通过 内容自动实现:全零 → T2V;仅最后一帧非零 → I2V;多帧非零 → V2V。

Guidance Attention:历史上下文(干净信号)和噪声上下文(待去噪)统计特性不同,必须区别对待。

Self-Attention 中引入 head-wise amplification token 来调制历史 key,选择性放大或抑制每个 attention head 中的历史信息:

Cross-Attention 仅作用于 (历史帧已包含文本语义,重复注入是冗余的):

历史帧的 timestep 固定为 0(表示干净、无噪声),整个去噪过程中不会被修改。

def guidance_attention_block(hidden_states, encoder_hidden_states, amp, navit_mask):
    """Per transformer block: Guidance Self-Attention + Cross-Attention + FFN."""
    # --- Self-Attention with history amplification ---
    Q, K, V = qkv_proj(norm1(hidden_states))
    Q_noisy, Q_hist = split(Q, at=noisy_seq_len)
    K_noisy, K_hist = split(K, at=noisy_seq_len)
    V_noisy, V_hist = split(V, at=noisy_seq_len)
 
    K_hist = K_hist * amp  # head-wise amplification (learnable per-head)
 
    attn_out = flash_attention(
        Q=concat(Q_noisy, Q_hist),
        K=concat(K_noisy, K_hist),
        V=concat(V_noisy, V_hist),
    )
    hidden_states = hidden_states + attn_out
 
    # --- Cross-Attention: only noisy tokens attend to text ---
    noisy_states = hidden_states[:noisy_seq_len]
    Q_noisy = q_proj(norm2(noisy_states))
    K_text, V_text = kv_proj(encoder_hidden_states)
    cross_out = attention(Q_noisy, K_text, V_text, mask=navit_mask)
    hidden_states[:noisy_seq_len] += cross_out
 
    # --- FFN ---
    hidden_states = hidden_states + ffn(norm3(hidden_states))
    return hidden_states

源码参考:helios/modules/transformer_helios.pyHeliosTransformerBlock.forward()HeliosAttnProcessorscale_key 实现 amp 放大。

3.3 Easy Anti-Drifting

Figure 5 解读:三种典型漂移模式的可视化。(a) Position Shift:画面在长视频中突然跳到不相关的场景(位置编码超出训练分布)。(b) Color Shift:画面整体色调随时间逐渐偏移(霓虹灯场景变暗/变色)。(c-d) Restoration Shift:画面出现类似图像修复的伪影——(c) 加噪感(粗糙颗粒)和 (d) 模糊感(细节丢失),这是因为模型在不完美的自身输出上反复条件化导致的误差累积。

论文系统地总结了长视频生成中的三种典型漂移模式,并针对性地提出简单有效的解决方案:

a) Position Shift → Relative RoPE

绝对 RoPE 的问题:训练在 5 秒(~109 帧)上,生成 1440 帧视频需要用到 index 0:1399,远超训练分布。此外 RoPE 的周期性与 Multi-Head Attention 交互会导致重复运动(repetitive motion)。

解决方案:限制历史的时间索引为 ,噪声部分为 。无论生成多长视频,每次去噪窗口的位置编码范围都固定,消除了分布外和周期性问题。

b) Color Shift → First-Frame Anchor

Figure 6 解读:四个子图分别追踪正常视频(蓝色)和漂移视频(红色)的 Saturation、Aesthetic Score、RGB Mean、RGB Std 随时间的变化。正常视频的所有指标保持平稳,而漂移视频在某个时间点后出现突变——Saturation 骤降、Aesthetic 下降、RGB 均值和方差剧烈波动。这说明漂移不是渐进的,而是在某个临界点后突然恶化,且几乎不在生成开头发生(支持 First-Frame Anchor 的设计动机)。

通过跟踪 saturation、aesthetic score 和 RGB 统计量,发现漂移视频的分布会在某个时间点突然偏移,而正常视频保持稳定。并且漂移几乎不在开头发生。

解决方案:训练和推理时始终保留第一帧 中,作为全局视觉锚点,约束后续帧的统计分布。

c) Restoration Shift → Frame-Aware Corrupt

模型训练在干净历史上,推理在自身不完美输出上——train-test mismatch 导致误差累积。

解决方案:训练时对历史帧独立施加随机扰动:

def frame_aware_corrupt(history_latents, p_a, p_b, p_c, p_d,
                        a_range, b_range, c_range):
    """Training-time corruption to simulate imperfect inference history."""
    # Skip frame 0 — it's the first-frame anchor, always kept clean
    for t in range(1, history_latents.shape[2]):
        frame = history_latents[:, :, t, :, :]
        r = torch.rand(1).item()
 
        if r < p_a:  # exposure adjustment
            scale = uniform(a_range[0], a_range[1])  # e.g. [0.3, 1.7]
            frame *= scale
        elif r < p_a + p_b:  # add noise
            sigma = uniform(b_range[0], b_range[1])  # e.g. [0, 0.33]
            frame += torch.randn_like(frame) * sigma
        elif r < p_a + p_b + p_c:  # downsample + upsample (blur)
            ratio = uniform(c_range[0], c_range[1])  # e.g. [0, 0.1]
            frame = F.interpolate(
                F.interpolate(frame, scale_factor=ratio),
                size=frame.shape[-2:])
        else:  # keep clean (probability p_d)
            pass
 
    return history_latents

源码参考:helios/utils/utils_helios_base.pycorrupt_history_latents()add_saturation_to_history_latents()。实际实现中 is_frame_independent=True 时每帧独立采样 noise sigma。Stage 1 配置:;Stage 3 配置:

3.4 Deep Compression Flow — Token 维度

Multi-Term Memory Patchification

Figure 7 解读:三个子图展示 Multi-Term Memory Patchification 的效果。左图:随着历史长度增加,Naive 方法(蓝色)的 token 数线性增长,而本文方法(红色)保持恒定(约 2000 tokens)。中图:Naive 方法在历史长度 >6 时 OOM,本文方法的训练显存始终稳定在约 71 GB。右图:推理延迟——Naive 方法线性增长,本文方法保持约 5ms 恒定。

核心观察:预测未来帧时,近期历史提供局部运动连续性,远期历史提供粗粒度全局上下文。因此越远的历史可以压缩得越狠。

分为 short/mid/long 三段( 帧),对每段用不同压缩比的 3D Conv kernel:

def multi_term_memory_patchify(X_hist, T1=16, T2=2, T3=2):
    """Compress history with increasing patchification for distant frames."""
    # Split into three temporal segments
    hist_short = X_hist[:, :, -T1:, :, :]         # most recent T1 frames
    hist_mid   = X_hist[:, :, -T1-T2:-T1, :, :]   # middle T2 frames
    hist_long  = X_hist[:, :, :-T1-T2, :, :]      # oldest T3 frames
 
    # Apply 3D Conv with increasing compression ratio
    tok_short = patch_short(hist_short)                          # Conv3d(1,2,2) → 4x
    tok_mid   = patch_mid(pad_for_3d_conv(hist_mid, (2,4,4)))   # Conv3d(2,4,4) → 32x
    tok_long  = patch_long(pad_for_3d_conv(hist_long, (4,8,8))) # Conv3d(4,8,8) → 256x
 
    # Flatten spatial dims → (B, num_tokens, D)
    for tok in [tok_short, tok_mid, tok_long]:
        tok = tok.flatten(2).transpose(1, 2)
 
        # Compute per-segment RoPE with relative frame indices
        rope_short = rotary_pos_emb(frame_indices=range(0, T1), H=H1, W=W1)
        rope_mid   = rotary_pos_emb(frame_indices=range(0, T2), H=H2, W=W2)
        rope_long  = rotary_pos_emb(frame_indices=range(0, T3), H=H3, W=W3)
 
        # Concatenate: long (coarse) → mid → short (fine)
        return torch.cat([tok_long, tok_mid, tok_short], dim=1)

实际配置(从源码确认):

分段帧数Conv3d kernel/stride压缩比源码定义
Short ()16(1, 2, 2)4xself.patch_short
Mid ()2(2, 4, 4)32xself.patch_mid
Long ()2(4, 8, 8)256xself.patch_long

历史 token 总量减少约 8x,Attention FLOPs 减少约 64x

源码参考:helios/modules/transformer_helios.pyHeliosTransformer3DModel.__init__() 定义 Conv3d,process_input_hidden_states() 执行拼接和 RoPE。pad_for_3d_conv() 对不整除的帧数做 padding。

Pyramid Unified Predictor Corrector

Figure 8 解读:Pyramid Unified Predictor Corrector 的三阶段流程。从左到右:Low Resolution Stage(,关注全局布局和颜色)→ Mid Resolution Stage(平衡质量和效率)→ High Resolution Stage(,细化纹理和边缘)。每个阶段结束时通过 Upsample & Renoise 过渡到下一分辨率,并 reset UniPC state buffer 避免跨尺度的 correction 伪影。

对噪声上下文 采用 coarse-to-fine 多分辨率策略。分 阶段:低分辨率关注全局结构,高分辨率细化纹理。

每阶段学习从 scale 到 scale 的 velocity field:

def pyramid_unipc_inference(noise, X_hist, y, K=3):
    """Coarse-to-fine multi-resolution denoising with UniPC solver."""
    x = noise  # starts at lowest resolution: (B, C, T, H/4, W/4)
    boundaries = [1000, T1, T2, 0]  # partition timestep [0, 1000] into K stages
 
    for k in range(1, K + 1):
        timesteps = linspace(boundaries[k-1], boundaries[k], N_k)
        unipc_buffer = UniPCBuffer()  # fresh buffer per stage
 
        for t_prev, t_curr in pairwise(timesteps):
            v = dit(x, X_hist, y, t=t_prev, stage=k)  # predict velocity field
            x = unipc_buffer.update(x, v, t_prev, t_curr)  # multi-step corrector
 
            if k < K:  # transition to next resolution
            x = F.interpolate(x, scale_factor=2, mode='nearest')  # 2x upsample
            x = renoise_correction(x)  # fix distribution shift from upsampling
            unipc_buffer.reset()  # cross-scale predictions are invalid
 
            return x  # full resolution (B, C, T, H, W)

多尺度推理总 token 数相比单尺度减少约 2.29x

源码参考:helios/scheduler/scheduling_helios.py 管理多阶段 timestep 分配和 UniPC state buffer reset。

3.5 Deep Compression Flow — Step 维度

Adversarial Hierarchical Distillation(基于 DMD 框架):

Figure 9 解读:蒸馏 pipeline。左侧:Few-Step Generator(带 EMA 更新)接收噪声 和历史 (经 Frame-Aware Corrupt),生成 。中间上方: 和真实数据 分别经 Dynamic Re-noise 得到 。右侧:Real-Score Estimator(frozen)输出条件和无条件分数用于 ;Fake-Score Estimator(trainable)用 训练。底部:Multi-head GAN Discriminator 提供 ,打破对 teacher 分布的依赖。

四项关键改进:

  1. Pure Teacher Forcing:用 Helios-Base 作 teacher(已具备长视频生成能力),配合 Easy Anti-Drifting,仅需每步生成一段即可——不需要 Self-Forcing 的多段 rollout
  2. Staged Backward Simulation:将 backward simulation 分解为 阶段独立估计:
  3. Coarse-to-Fine Learning:(a) Staged ODE Init (b) Dynamic Re-noise(Beta 分布 + cosine 参数调度)
  4. Adversarial Post-Training:在 DiT 的第 5/15/25/35/39 层添加 GAN 判别器头(dim=768)
def adversarial_hierarchical_distillation_step(x_0, y, G_theta, p_real, p_fake, D_heads, step):
    """One training step of Adversarial Hierarchical Distillation (DMD + GAN)."""
    # --- Forward: generate fake sample via staged backward simulation ---
    z_t = torch.randn_like(x_0)
    x_0_staged = G_theta.staged_backward_sim(z_t, y)  # K-stage backward sim
 
    # --- Dynamic Re-noise (Beta distribution with cosine schedule) ---
    tau = Beta(alpha=cosine_schedule(step), beta=cosine_schedule(step)).sample()
    x_tau_fake = add_noise(x_0_staged, tau)
    x_tau_real = add_noise(x_0, tau)
 
    # --- Score estimation ---
    s_real_cond, s_real_uncond = p_real(x_tau_real, y, tau)  # frozen teacher
    s_fake_cond = p_fake(x_tau_fake, y, tau)                  # trainable
 
    # --- DMD loss (distribution matching distillation) ---
    L_DMD = distribution_matching_loss(s_real_cond, s_real_uncond, s_fake_cond)
 
    # --- GAN loss (adversarial post-training) ---
    crop_real = random_crop(x_tau_real, H // 2, W // 2)  # half-res to save memory
    crop_fake = random_crop(x_tau_fake, H // 2, W // 2)
 
    L_D = (torch.log(D_heads(crop_real)).mean()
           - torch.log(D_heads(crop_fake)).mean()
           + 100.0 * approx_r1(D_heads, crop_real, sigma=0.1))  # λ_D=100, σ_D=0.1
    L_G = torch.log(D_heads(crop_fake)).mean()
 
    # --- Flow matching loss (stabilizer for p_fake) ---
    L_Flow = F.mse_loss(p_fake.predict_velocity(x_tau_fake), v_target)
 
    # --- Combined losses ---
    L_generator = L_DMD + 0.05 * L_G    # w_G = 5e-2
    L_estimator = L_Flow + 0.01 * L_D   # w_D = 1e-2
 
    # --- TTUR: update p_fake every step, G_θ every 5 steps ---
    optimizer_fake.zero_grad(); L_estimator.backward(); optimizer_fake.step()
    if step % 5 == 0:
        optimizer_G.zero_grad(); L_generator.backward(); optimizer_G.step()

源码参考:train_helios.py_generator_loss()_critic_loss();GAN heads 定义在 transformer_helios.py。实际使用 Sharded EMA(ZeRO-3)+ Async VRAM Freeing + Cache Grad for GAN 来优化显存。

Figure 10 解读:Cache Grad for GAN 的流程示意图,结合前文的 adversarial post-training 设计,说明其通过缓存梯度相关中间量来降低显存占用,并支持判别器头的训练。

3.6 其他推理技术

  • Adaptive Sampling:用 EMA 跟踪 latent 的全局均值 和方差

当偏差超阈值 时,对下一段历史施加 Frame-Aware Corrupt

  • Interactive Interpolation:用户切换 prompt 时,线性插值 步平滑过渡:

3.7 伪代码:Helios 自回归推理(总体流程)

def helios_autoregressive_inference(y, num_sections, T_noisy=33):
    """Generate arbitrarily long video via autoregressive chunk generation."""
    history_latents = []
    image_latents = None   # first-frame anchor
    mu_ema, var_ema = None, None  # EMA statistics for drift detection
    video_chunks = []
 
    for k in range(num_sections):
        # Step A: Compress history with multi-term patchification
        X_hist = multi_term_memory_patchify(history_latents, T1=16, T2=2, T3=2)
 
        # Step B: Prepend first-frame anchor
        if image_latents is not None:
            X_hist = torch.cat([image_latents, X_hist], dim=1)
 
            # Step C: Denoise via Pyramid UniPC (coarse-to-fine)
            noise = torch.randn(B, C, T_noisy, H, W)
            latents = pyramid_unipc_inference(noise, X_hist, y, K=3)
 
            # Step D: Adaptive drift detection
            mu_k, var_k = latents.mean(), latents.var()
            if mu_ema is not None:
                if (mu_k - mu_ema).norm() > delta_mu or (var_k - var_ema).norm() > delta_var:
                    history_latents = frame_aware_corrupt(history_latents, ...)
                    mu_ema = ema_update(mu_ema, mu_k, rho=0.9)
                    var_ema = ema_update(var_ema, var_k, rho=0.9)
 
                    # Step E: Save first frame as anchor (once)
                    if k == 0 and image_latents is None:
                        image_latents = latents[:, :, 0:1, :, :]
 
                        history_latents = torch.cat([history_latents, latents], dim=2)
                        video_chunks.append(vae.decode(latents))
 
                        return torch.cat(video_chunks, dim=2)  # full video

源码参考:helios/pipelines/pipeline_helios.pyfor k in range(num_latent_sections) 循环。每次生成 33 帧/chunk。

3.8 代码-论文映射

论文概念源码文件关键类/函数
Guidance Self-Attentionhelios/modules/transformer_helios.pyHeliosTransformerBlock.forward(), HeliosAttnProcessor (scale_key)
Guidance Cross-Attentionhelios/modules/transformer_helios.pyguidance_cross_attn flag + navit_encoder_attention_mask
Multi-Term Memory Patchificationhelios/modules/transformer_helios.pyself.patch_short/mid/long (Conv3d), process_input_hidden_states()
Representation Controlhelios/modules/transformer_helios.pyHeliosTransformer3DModel.forward() — zero padding for T2V
Relative RoPEhelios/modules/transformer_helios.pyHeliosRotaryPosEmb — per-segment frame_indices
Triton RoPE kernelhelios/modules/helios_kernels/triton_rope.pyFlash RoPE(fused cos/sin)
Triton Norm kernelhelios/modules/helios_kernels/triton_norm.pyFlash LayerNorm/RMSNorm
LoRA adaptationhelios/modules/transformer_helios.pyLoRALinearLayer (down→up)
Frame-Aware Corrupthelios/utils/utils_helios_base.pycorrupt_history_latents(), add_saturation_to_history_latents()
First-Frame Anchorhelios/pipelines/pipeline_helios.pyimage_latents = latents[:,:,0:1,:,:] + is_keep_x0 flag
Pyramid UniPC Schedulerhelios/scheduler/scheduling_helios.py多阶段 timestep 分配 + buffer reset
Autoregressive inference loophelios/pipelines/pipeline_helios.pyfor k in range(num_latent_sections)
GAN discriminator headshelios/modules/transformer_helios.pyDiT layers [5,15,25,35,39], dim=768
Adversarial Distillation trainingtrain_helios.py_generator_loss(), _critic_loss()
Diffusers 集成helios/diffusers_version/pipeline_helios_diffusers.pyHuggingFace Diffusers 兼容

4. Experimental Setup (实验设置)

  • 训练数据: 0.8M 视频片段,时长 <10 秒,分辨率 384×640,109 帧/片段
  • 基础模型: Wan-2.1-T2V-14B
  • 训练硬件: 64-128 NVIDIA H100 GPU
  • 训练精度: BFloat16
  • 三阶段训练配置:
阶段StepsGPUBatchLRLoRA Rank关键参数
Stage 1 init5.5k64×H1001285e-5128
Stage 1 post7.5k64×H1001283e-5128同上
Stage 2 init16k64×H1002561e-4256-
Stage 2 post20k64×H1001923e-5256-
Stage 3 ODE3759128×H1001282e-6256EMA decay=0.99
Stage 3 post2250128×H1001282e-6/4e-7256GAN layers=[5,15,25,35,39], dim=768
  • Benchmark: HeliosBench(自建),240 prompts,4 个时长:very short (81帧), short (240帧), medium (720帧), long (1440帧)
  • 评估维度: Aesthetic(LAION)、Dynamic(Farneback)、Motion Smoothness(RAFT)、Semantic(ViCLIP)、Naturalness(OpenS2V-Eval);每个 metric 映射到 10 分制
  • 推理配置: Stage 1-2: UniPC 50步 + CFG 5.0 + -prediction; Stage 3: 3步 + CFG 1.0 + -prediction
  • Baselines: 30+ 模型,含 base models(SANA Video, CogVideoX, Mochi, Wan, HV Video, LTX Video, Kandinsky 等)和 distilled models(FastVideo, TurboDiffusion, CausVid, Self-Forcing, Rolling Forcing, LongLive, Reward Forcing, Krea 等)

5. Experimental Results (实验结果)

短视频(81 帧)

ModelParamsFPS↑Total↑AestheticDynamicSmoothnessSemanticNaturalness
Wan 2.2 14B14B0.336.1586965
HV Video 1.58.3B0.246.90810857
CausVid1.3B24.414.5087942
Self Forcing1.3B21.205.7589954
Reward Forcing1.3B22.135.5587954
Krea14B6.745.95910954
Helios-Base14B0.546.3588956
Helios-Distilled14B19.536.00871055
  • Helios-Distilled 总分 6.00,超越所有 distilled 模型
  • 19.53 FPS 实时速度,比同尺寸 FastVideo(5.37 FPS)快 3.6x
  • Semantic=10 是所有模型中最高的

长视频(120-1440 帧)

Figure 11 解读:短视频和长视频的 Throughput Score vs Quality Score 散点图。左图(长视频):Helios-Base 和 Helios-Distilled 位于右上角(高质量+高速度),明显优于所有 baselines。右图(短视频):Helios-Distilled 在速度上与 1.3B 蒸馏模型持平,质量上与 14B base 模型持平。

ModelParamsFPS↑Total*↑Naturalness↑Drift Aesthetic↑Drift Naturalness↑
CausVid1.3B24.414.56312
Self Forcing1.3B21.204.30513
Reward Forcing1.3B22.136.18576
Krea14B6.745.80171
Helios-Base14B0.546.477105
Helios-Distilled14B19.536.147107
  • Helios-Distilled 总分 7.08(含 Throughput Score 6),超越最强 baseline Reward Forcing(6.88)
  • 漂移 Naturalness 得分 7,远超 CausVid(2)和 Krea(1)
  • Drift Aesthetic 得分 10,说明视觉质量在长视频中保持稳定

消融实验关键发现

Helios-Base 消融(长视频质量):

变体TotalDrift Naturalness关键影响
Full model6.477-
w/ Causal Mask on Guidance Attn--训练不稳定,无法收敛
w/o Guidance Attention6.232漂移大幅恶化
w/o First-Frame Anchor5.512颜色漂移
w/o Frame-Aware Corrupt4.701恢复偏移严重(影响最大)

Helios-Distilled 消融

变体TotalDrift Naturalness关键影响
Full model6.347-
w/ Self-Forcing (替代 Teacher Forcing)6.116Teacher Forcing 更优
w/ Bidirectional Teacher4.755必须用自回归 teacher
w/ Staged Backward Sim*--训练不稳定
w/o Coarse-to-Fine Learning5.314Curriculum 很重要
w/o Adversarial Post-Training6.319GAN 独特改善 Semantic
w/ Decouple DMD5.214联合训练更好
w/ Reward-weighted Regression6.234DMD 蒸馏更优

关键结论

  1. Frame-Aware Corrupt 是最关键的反漂移组件(去掉后 Total 降 1.77,Drift Naturalness 降到 1)
  2. Easy Anti-Drifting + Pure Teacher Forcing 优于 Self-Forcing rollout(6.34 vs 6.11)
  3. Adversarial Post-Training 在 Semantic 上有独特贡献(去掉后 Drift Naturalness 反升到 9,但 Semantic 下降)
  4. Bidirectional teacher 不行(4.75),必须用自回归 teacher(Helios-Base)
  5. Coarse-to-Fine Learning curriculum 对蒸馏稳定性至关重要(去掉后 Total 降 1.03)

局限性

  • Helios-Distilled 在 Dynamic 和 Smoothness 上略低于部分 base 模型(distillation 的常见 trade-off)
  • 训练资源需求大(128×H100 用于 Stage 3)
  • 当前只支持 384×640 分辨率