LongLive: Real-time Interactive Long Video Generation

1. Motivation (研究动机)

长视频生成的核心挑战是质量、效率和交互性三者兼得：

Diffusion/Diffusion-Forcing 模型：质量好，但双向注意力无法使用 KV cache，推理效率极低。例如 SkyReels-V2 在 H100 上生成 60 秒视频需约 50 分钟
Causal AR 模型：可以用 KV cache 加速，但普遍采用 train-short-test-long 策略——训练在 5 秒短片上，推理时误差累积导致长视频质量逐渐退化（content drift）
交互式生成：用户需要在生成过程中切换 prompt，但现有模型在 prompt 切换时要么视觉不连续（清空 KV cache），要么忽略新 prompt（保留旧 KV cache）
MAGI-1 虽然支持交互，但 prompt 切换需要手动调整不同步骤的 KV-cache 窗口，不实用

核心矛盾：train-short 导致 test-long 质量退化，KV cache 中的旧 prompt 语义导致交互切换失败。

2. Idea (核心思想)

LongLive 的核心 insight：通过 KV-recache 刷新缓存语义 + streaming long tuning 对齐训练推理 + frame sink 保持长程一致性，可以用 1.3B 小模型实现实时交互式长视频生成。

三个关键设计：

KV-recache：prompt 切换时，用已生成视频 + 新 prompt 重新计算 KV cache，擦除旧 prompt 残留语义，同时保持视觉连续性
Streaming Long Tuning：train-long-test-long 策略，训练时就让模型在自身不完美输出上滚动生成长序列，消除 train-test mismatch
Short Window Attention + Frame Sink：限制注意力窗口到近邻帧（效率），同时保留第一帧 chunk 的 KV 作为全局锚点（一致性）

最终：1.3B 模型，单卡 H100 上 20.7 FPS，支持 240 秒视频，32 GPU-days 完成微调。

3. Method (方法)

3.1 整体架构

Figure 2 解读：左侧展示了 LongLive 的 Short Window Attention + Frame Sink 机制——attention mask 呈阶梯状，每个 query 只关注最近 $W$ 帧（蓝色区域）和最底部的 frame sink（绿色区域）。右侧展示了 KV-recache 机制——prompt 切换时，用已生成视频 + 新 prompt 重新计算 KV cache 中的 token（橙色 → 绿色）。

LongLive 是基于 Wan2.1-T2V-1.3B 的 frame-level causal AR video diffusion model。核心架构改动：

将双向注意力改为 causal attention，继承 KV cache 机制
通过 Self-Forcing + DMD 蒸馏初始化为 few-step 生成器
引入 KV-recache、Streaming Long Tuning、Short Window Attention + Frame Sink

训练流程分两阶段：

Stage 1 (Init)：Self-Forcing + DMD 蒸馏，将双向模型转为 causal AR few-step 模型，同时启用 short window attention 和 frame sink
Stage 2 (Long Tuning)：Streaming Long Tuning，在 60 秒长序列上用 KV-recache 训练交互能力

3.2 KV-Recache

Figure 3 解读：对比三种 prompt 切换策略。(a) 清空 KV cache：Prompt 2 立即生效（人物开始新动作），但画面出现突变，背景和人物外观不连续。(b) 保留 KV cache：画面连续流畅，但 Prompt 2 被忽略——人物仍然按 Prompt 1 的指令行动。(c) KV-recache（本文方法）：画面连续 + Prompt 2 立即生效，人物在保持外观一致的同时执行了新动作。

问题诊断：在 DiT 架构中，cross-attention 层不断将 prompt 语义注入 KV cache，self-attention 层将其传播。prompt 切换后，旧 prompt 的残留语义仍留在 cache 中，导致模型无法跟随新 prompt。

清空 KV cache（图 3a）：新 prompt 立即生效，但视觉不连续
保留 KV cache（图 3b）：视觉连续，但新 prompt 被忽略或延迟响应
KV-recache（图 3c）：视觉连续 + 新 prompt 立即生效

机制：在 prompt 切换边界，用已生成的视频帧 + 新 prompt 重新前向传播一次，重建 KV cache。旧 prompt 语义被擦除，视觉上下文被保留。

def kv_recache(model, kv_cache, generated_frames, new_prompt, window_size):
    """Refresh KV cache at prompt switch boundary."""
    # Take the most recent W generated frames as visual context
    visual_context = generated_frames[-window_size:]
 
    # Encode new prompt
    new_prompt_emb = text_encoder(new_prompt)
 
    # Recompute KV cache: visual context + new prompt semantics
    # This erases old prompt residual while keeping visual continuity
    kv_cache_new = model.recompute_kv(
        video_frames=visual_context,
        prompt_embedding=new_prompt_emb,
        kv_cache=kv_cache,  # rewrite in-place
    )
    # Subsequent frames generated normally with refreshed cache
    return kv_cache_new

源码参考：wan/modules/causal_model.py → _apply_cache_updates() 实现 roll_and_insert（滚动插入）和 direct_insert（直接插入），在 prompt 切换时触发 recompute。Recache 仅在切换边界调用一次，额外开销约 6%。

训练对齐：训练时也集成 recache 操作——每个训练样本包含一次 prompt 切换，切换时执行 recache + 用新 prompt 给 teacher 提供监督，确保 train-inference 一致。

3.3 Streaming Long Tuning

Figure 4 解读：对比三种训练策略。(a) Short Tuning：每次只训练独立的 5s clip，Teacher 能可靠监督，但模型从未见过长序列。(b) Naive Long Tuning：直接在 60s 长序列上训练，但 Teacher 对长序列不可靠（它也只见过 5s），且反向传播整个序列会 OOM。(c) Streaming Long Tuning（本文方法）：每次迭代从 KV cache 中取出历史上下文（detach，不回传梯度），生成下一个 5s clip，Teacher 只监督当前 clip（可靠）。这样训练 schedule 完全镜像推理流程。

问题：现有 AR 模型在 5 秒短片上训练（short tuning），推理时 rolling context window 中误差累积，长视频质量逐渐退化。直接在长序列上训练（naive long tuning）会 OOM 且 teacher 对长序列不可靠。

解决方案：Streaming Long Tuning——迭代式生成+监督：

def streaming_long_tuning(generator, teacher, prompt_pair, l_video=60, l_clip=5):
    """Train on long sequences with streaming KV cache and per-clip DMD supervision."""
    kv_cache = initialize_empty()
    current_length = 0
 
    while current_length < l_video:
        # Select active prompt (switch at random point)
        if current_length < switch_time:
            p_active = prompt_pair[0]
        else:
            p_active = prompt_pair[1]
 
        # At switch boundary: perform KV recache
        if current_length == switch_time:
            kv_cache = recache(generator, kv_cache, generated_frames, p_active)
 
        # Generate next 5s clip, conditioned on KV cache (detached, no grad)
        with torch.no_grad():
            clip = generator.rollout(kv_cache, p_active, num_frames=l_clip)
 
        # DMD supervision: teacher supervises only current 5s clip (reliable)
        loss = dmd_loss(generator, teacher, clip, p_active, kv_cache)
        loss.backward()  # gradients only for current clip
        optimizer.step()
 
        # Extend KV cache with generated clip (detached as constant context)
        kv_cache = extend_cache(kv_cache, clip.detach())
        current_length += l_clip
 
        # Reset when reaching max length
        if current_length >= l_video:
            kv_cache = initialize_empty()
            current_length = 0

关键设计：

每次迭代只生成一个 5 秒 clip，teacher 对 5 秒片段是可靠的
之前生成的 clip detach 后作为常量上下文，梯度只流过当前 clip → 避免 OOM
训练 schedule 完全镜像推理流程，消除 train-test mismatch

源码参考：pipeline/streaming_training.py → StreamingTrainingPipeline，KV cache 通过 _initialize_kv_cache() 预分配，通过 clear_kv_cache() 重置。pipeline/self_forcing_training.py 用于 Stage 1 初始化。

3.4 Short Window Attention + Frame Sink

Figure 5 解读：每列展示同一个 prompt 在 20 秒内的生成效果。第一行（Window 21，全窗口）：质量好但计算量大。第二行（Window 12，缩小窗口）：效率提升但人物外观开始漂移（发型、衣服颜色变化）。第三行（Window 9 + Sink 3，本文方法）：窗口更小但加入 frame sink 后，人物一致性恢复到接近全窗口水平。

Short Window Attention：将 causal attention 限制到最近 $W$ 帧（而非全序列），注意力复杂度从 $O (L^{2})$ 降到 $O (W \cdot L)$ ，KV cache 大小与窗口成正比而非与总长度成正比。

问题：缩小窗口提升效率但损失长程一致性（远处的关键视觉线索被丢弃）。

Frame Sink：保留第一帧 chunk（ $S$ 个 latent frame）的 KV token，永远不被驱逐出 cache。这些 sink token 拼接到每层的 KV cache 中，即使在短窗口下也全局可见。

def short_window_attention_with_frame_sink(q, k, v, kv_cache,
                                           local_window=9, sink_size=3,
                                           frame_seqlen=1560):
    """Attention with local window + persistent frame sink tokens."""
    sink_tokens = sink_size * frame_seqlen  # first 3 frames always visible
 
    # Attention mask: allow current frame to attend to:
    # 1. Sink tokens (always visible, frames 0..S-1)
    # 2. Local window (most recent W frames)
    def mask_fn(q_idx, kv_idx):
        is_sink = kv_idx < sink_tokens
        is_local = (kv_idx >= ends[q_idx] - local_window * frame_seqlen) & (kv_idx < ends[q_idx])
        return is_sink | is_local
 
    # KV cache eviction: roll out old tokens but protect sink region
    if cache_full:
        num_evicted = num_new_tokens + local_end - cache_size
        num_rolled = local_end - num_evicted - sink_tokens
        # Roll only the non-sink, non-recent portion
        cache["k"][sink_tokens:sink_tokens+num_rolled] = cache["k"][num_evicted+sink_tokens:local_end]
 
    output = flex_attention(q, k, v, block_mask=mask_fn)
    return output

实际配置：Window 9 + Sink 3 = 有效窗口 12 帧，效果接近 Window 21（全注意力），但计算减少 28%，显存减少 17%。

源码参考：wan/modules/causal_model.py → attention_mask() 定义 flex_attention mask，_apply_cache_updates() 中 roll_and_insert 保护 sink region。

3.5 代码-论文映射

论文概念	源码文件	关键类/函数
KV-recache 机制	`wan/modules/causal_model.py`	`_apply_cache_updates()`, `roll_and_insert`, `direct_insert`
Frame Sink	`wan/modules/causal_model.py`	`sink_size`, sink token 保护逻辑
Short Window Attention	`wan/modules/causal_model.py`	`attention_mask()` + `flex_attention`
Causal AR forward	`wan/modules/causal_model.py`	`_forward_inference()`, `patch_embedding`
Streaming Long Tuning	`pipeline/streaming_training.py`	`StreamingTrainingPipeline`, `_initialize_kv_cache()`
Self-Forcing + DMD init	`pipeline/self_forcing_training.py`	Self-Forcing 训练 pipeline
DMD distillation	`model/dmd.py`	DMD 蒸馏逻辑
Interactive inference	`pipeline/interactive_causal_inference.py`	多 prompt 交互推理
Training entry	`train.py`	训练入口
Inference entry	`inference.py`, `interactive_inference.py`	推理入口

4. Experimental Setup (实验设置)

基础模型: Wan2.1-T2V-1.3B（16 FPS，832×480，5s clips）
训练硬件: 64 H100 GPUs
训练时间: ~12 小时（32 GPU-days）
训练数据: 不需要额外视频数据，采用 self-supervised 方法。用 Qwen2-72B-Instruct 从 VidProM 数据集生成 follow-up prompt pairs
训练配置:
- Stage 1 (Init): Self-Forcing + DMD 蒸馏，启用 short window attention (window=9) + frame sink (sink=3)
- Stage 2 (Long Tuning): 60s 序列，每 batch 包含一次 prompt 切换（切换时间 5s~55s 均匀采样），3000 iterations
- Optimizer: AdamW，lr_actor = $1.0 \times 1 0^{- 5}$ ，lr_critic = $2.0 \times 1 0^{- 6}$
- LoRA rank 256（约 350M 可训练参数，占 27%）
- EMA decay 0.99，batch size 64（1 sample/GPU × 64 GPUs）
Benchmark:
- 短视频: VBench（5s clips）
- 长视频: VBench-Long（30s single-prompt）
- 交互式: 自建 160 个 60s 交互视频数据集（6 个 10s segments，含 prompt 切换）
评估维度: Total Score, Quality Score, Semantic Score（VBench）; CLIP Score（交互式语义一致性）; Background/Subject Consistency（VBench-Long）
Baselines: LTX-Video (1.9B), Wan2.1 (1.3B), SkyReels-V2 (1.3B), MAGI-1 (4.5B), CausVid (1.3B), NOVA (0.6B), Pyramid Flow (2B), Self-Forcing chunk/frame-wise (1.3B), FramePack

5. Experimental Results (实验结果)

短视频（5s, VBench）

Model	Params	Resolution	FPS↑	Total↑	Quality↑	Semantic↑
LTX-Video	1.9B	768×512	8.98	80.00	82.30	70.79
Wan2.1	1.3B	832×480	0.78	84.26	85.30	80.09
SkyReels-V2	1.3B	960×540	0.49	82.67	84.70	74.53
MAGI-1	4.5B	832×480	0.19	79.18	82.04	67.74
CausVid	1.3B	832×480	17.0	81.20	84.05	69.80
Self-Forcing (chunk)	1.3B	832×480	17.0	84.31	85.07	81.28
Self-Forcing (frame)	1.3B	832×480	8.9	84.26	85.25	80.30
LongLive	1.3B	832×480	20.7	84.87	86.97	76.47

LongLive 以 20.7 FPS 达到最快速度，同时 Total Score 84.87 匹配最强 baselines
Quality Score 86.97 是所有模型中最高的

长视频（30s, VBench-Long）

Model	Total↑	Quality↑	Semantic↑	FPS↑
SkyReels-V2	75.29	80.77	53.37	0.49
FramePack	81.95	83.61	75.32	0.92
Self-Forcing (chunk)	81.59	83.82	72.70	17.0
LongLive	83.52	85.44	75.82	20.7

LongLive 在所有指标上都是 SOTA，且速度最快

交互式长视频（60s, 多 prompt 切换）

Method	Quality Score↑	CLIP 0-10s↑	CLIP 10-20s↑	CLIP 20-30s↑	CLIP 30-40s↑	CLIP 40-50s↑	CLIP 50-60s↑
SkyReels-V2	80.49	20.96	22.51	25.78	18.45	19.57	19.61
Self-Forcing	82.46	28.46	24.89	23.53	22.96	23.07	23.19
LongLive	84.38	28.85	25.68	24.64	24.23	24.32	24.32

LongLive 在长时段（30-60s）的 CLIP score 显著优于 baselines，说明 KV-recache 有效保持了 prompt 遵循
用户研究（26 人，1248 judgments）：LongLive 在 Overall (69.2%)、Motion Quality (67.8%)、Instruction Following (69.8%)、Visual Quality (63.2%) 上全面领先

消融实验

KV-recache 消融（Table 4）：

Method	Background Consistency↑	Subject Consistency↑	CLIP Score↑
No KV cache	92.75	89.59	28.95
KV cache (保留)	94.77	93.69	25.92
KV recache	94.81	94.04	27.87

KV recache 同时实现最好的一致性和较高的语义对齐

Streaming Long Tuning 消融（隐含在 Table 1 中）：

Self-Forcing (chunk/frame) 代表 short tuning baseline（仅训练 5s clip）
LongLive 在 Total Score 上从 84.31（Self-Forcing chunk, short tuning）提升到 84.87（streaming long tuning），同时 FPS 从 17.0 提升到 20.7
论文 Section 3.2 指出：streaming long tuning 不仅提升长视频质量，也是启用高效推理策略（window attention + frame sink）的前提条件

Window Size + Frame Sink 消融（Figure 7）：

Figure 7 解读：左图展示 Consistency Score 随窗口大小的变化——蓝色虚线（无 frame sink）在小窗口时急剧下降，红色实线（有 frame sink, Window 9 + Sink 3）在 Window=9 时就达到 94.1，接近 Window=24 的饱和值 94.2。右图展示计算时间和显存随窗口大小线性增长，Window 9 + Sink 3 在效率和质量之间取得最佳平衡。

Consistency 在 Window=24 时饱和（~94.2）
Window 9 + Sink 3 达到 94.2，接近 Window 21（94.0），但计算时间减少 28%，显存减少 17%
无 Frame Sink 时，Window 9 的 Consistency 仅 90.6

局限性（Appendix M）

性能上限受基础模型（Wan2.1-1.3B）约束，短片段质量不会超过 base model
采用 self-supervised 微调，不使用额外真实视频数据，无法纠正 base model 的系统性错误
增益主要在长程适应和稳定化，而非绝对质量天花板

Paper Notes

探索