ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Paper: arXiv:2603.25746 / HF Papers
Code: KlingAIResearch/ShotStream
Code reference: main @ 7ddcfaef (2026-03-31)

1. Motivation (研究动机)

多镜头视频生成要解决的不是“生成一个更长的视频片段”，而是在一个叙事过程中连续生成多个 shot，并且允许用户在生成中途追加新的导演指令。现有多镜头方法通常把整段剧本、所有 shot prompt 和参考帧一次性输入双向架构，优点是全局一致性强，缺点是必须等完整视频离线生成完才能看到结果；论文指出 HoloCine 这类双向方法生成 240 帧多镜头视频约需 25 分钟，交互性和延迟都不适合实时 storytelling。

ShotStream 的目标是把多镜头生成改写成 next-shot generation：每次只根据历史上下文和当前/新到达的 streaming prompt 生成下一个 shot。这样系统可以在故事进行中插入新角色、改变场景或修改动作，同时保持人物、背景与叙事连续性。

Figure 1 解读：ShotStream 以自回归方式逐 shot 生成 5 个连续镜头、总计 405 帧的叙事视频；该图强调它不是单段长视频外推，而是跨镜头保持人物、背景和叙事状态的一致。

2. Idea (核心思想)

核心 insight：把“全片一次性双向生成”拆成“历史条件下的 next-shot 因果生成”，再用双向 teacher 的质量蒸馏出少步 causal student。 关键不是单纯把 attention mask 改成 causal，而是同时处理两个因果生成特有问题：跨 shot 记忆不丢、误差不随 autoregressive rollout 放大。

与 HoloCine、CineTrans、Mask2DiT 等 bidirectional 或 quasi-bidirectional 多镜头方法相比，ShotStream 不要求预先固定完整叙事；与 Self Forcing、Rolling Forcing、Infinity-RoPE 等 causal 长视频基线相比，它显式建模 shot boundary，并用 global/local 双缓存把跨镜头一致性和当前镜头局部连续性拆开。

Figure 2 解读：workflow 把用户 streaming prompts 逐个接入，模型一边输出当前 shot，一边把历史帧压入上下文缓存；新的 prompt 不需要重跑整段视频。

3. Method (方法)

3.1 Bidirectional next-shot teacher

Teacher 基于 Wan2.1-T2V-1.3B。给定历史稀疏上下文帧 $V_{context}$ 和目标 shot 的噪声 latent $z_{t}$ ，先用同一个 3D VAE 编码上下文：

z_{context} = ε (V_{context}),

再 patchify 上下文 latent 与目标 latent：

x_{j} = Patchify (z_{j}), z_{j} \in {z_{context}, z_{t}},

并沿时间维拼接：

x_{input} = FrameConcat (x_{context}, x_{t}) .

论文强调上下文帧不能只用目标 shot caption；每个历史 shot 的帧还要 attend 到对应的 local shot caption，再加 global caption。这样 3D self-attention 能把“哪段视觉历史对应哪段文本描述”绑定起来，teacher 才能学到可用于下一个 shot 的 in-context 条件。

Figure 3 解读：teacher 仍是慢速高质量的双向 next-shot model，训练时只优化 DiT blocks 中的 3D self-attention 层；它负责提供后续 causal student 要模仿的高质量 target distribution。

3.2 4-step causal student + DMD

Teacher 约需 50 denoising steps。ShotStream 通过 Distribution Matching Distillation (DMD) 把它蒸馏成 4-step causal generator。补充材料把 DMD 写成对 smoothed data distribution 与 student distribution 的 reverse-KL 梯度近似：

\nabla_{ϕ} L_{DMD} \approx - E_{t} [(s_{data} (Ψ (G_{ϕ} (ϵ), t), t) - s_{gen, ξ} (Ψ (G_{ϕ} (ϵ), t), t)) \frac{d G _{ϕ} ( ϵ )}{d ϕ}] .

其中 $G_{ϕ}$ 是 causal student， $s_{data}$ 和 $s_{gen, ξ}$ 分别是 teacher/data score 与 student fake score 的 critic。代码中 model/dmd_frameconcat.py 和 trainer/distillation_frameconcat.py/trainer/distillation_frameconcat_streaming.py 对应这个 generator-critic 更新。

3.3 Dual-cache memory + RoPE discontinuity

Causal student 在 chunk-wise 生成时维护两类 KV/cache：

global context cache：存历史 shot 中抽样出的 sparse conditional frames，用于 inter-shot consistency。
local context cache：存当前 shot 内已经生成的 chunk，用于 intra-shot continuity。

如果两类 cache 共享连续 RoPE 时间坐标，模型会混淆“历史镜头”和“当前镜头内的前文”。ShotStream 因此在每个 shot boundary 注入离散相位跳变：

Θ_{t} = ϕt + k θ,

其中 $t$ 是当前 shot 内 latent 位置， $k$ 是 shot index， $θ$ 表示 shot-boundary discontinuity。代码侧 pipeline/causal_inference_ar.py 维护 shot_flags_for_rope 与 context/current cache，wan/modules/causal_model_change_rope.py 在 causal_rope_apply(..., start_frame=...) 和 context cache 分支中实现 RoPE 起点偏移与 KV cache 更新。

3.4 Two-stage self-forcing distillation

第一阶段 intra-shot self-forcing：从 ground-truth 历史 shot 抽样 global context，当前 shot 则由 causal generator 按 chunk 自回归生成；local cache 使用当前 shot 已生成 chunk，而不是 GT chunk。这先解决单个 next-shot 的因果生成能力。

第二阶段 inter-shot self-forcing：模型从头生成第一个 shot，之后每个 shot 都完全条件在自己之前生成的历史 shot 上，只对新生成 shot 做 DMD。这个 rollout 形式和测试时一致，用来缩小 train-test gap、降低误差积累。

Figure 4 解读：global cache 负责历史镜头，local cache 负责当前镜头；Stage 1 把“当前 shot 内 chunk-by-chunk”做因果化，Stage 2 再把“跨 shot rollout”也做成自生成历史。

3.5 代码对应的推理伪代码

# based on pipeline/causal_inference_ar.py, pipeline/self_forcing_training.py, and tools/inference/causal_fewsteps.sh
load ShotStream causal student from ckpts/shotstream_merged.pt
for shot_idx, shot_prompt in enumerate(streaming_prompts):
    if shot_idx == 0:
        condition_frames = zeros(max_context_frames)
        shot_flags_for_rope = [0] * max_context_frames
    else:
        condition_frames = dynamic_sample(previous_generated_frames,
                                          budget=max_context_frames)
        shot_flags_for_rope = shot_ids_of_sampled_context_frames
 
    condition_latents = VAE.encode(condition_frames)
    caption = global_caption + shot_prompt
    prefill_context_kv_cache(condition_latents, caption,
                             rope_flags=shot_flags_for_rope)
 
    noise = randn(num_training_frames=21)  # 7 chunks × 3 latent frames
    local_cache = empty_kv_cache()
    for chunk in split(noise, num_frame_per_block=3):
        latent_chunk = generator(
            chunk,
            kv_cache=local_cache,
            kv_cache_context=global_context_cache,
            crossattn_cache=crossattn_cache,
            current_start=current_frame_index,
            shot_flags_for_rope=rope_flags_for_current_chunk,
        )
        local_cache.update_with_clean_rerun(latent_chunk)
        append_to_current_shot(latent_chunk)
 
    frames = VAE.decode(current_shot_latents)
    previous_generated_frames.extend(frames)
    emit(frames)  # 可在用户继续输入下一个 prompt 前先返回当前 shot

Code reference: main @ 7ddcfaef (2026-03-31) — pseudocode and mapping below are based on this commit.

Paper Concept	Source File	Key Class/Function
Next-shot teacher / frame concat	`trainer/wan_frameconcat.py`, `tools/train/config/1_basemodel.yaml`	`Trainer`, config `trainer: wan_frame_concat`, `only_train3d: True`
Teacher ODE pair sampling	`Teacher_Ode_Sample.py`, `get_ode_csv.py`	ODE samples for causal initialization
Causal adaptation initialization	`trainer/ode_regression.py`, `tools/train/config/2_ode_init.yaml`	`Trainer`, `trainer: ode_regression`
DMD / intra-shot self-forcing	`model/dmd_frameconcat.py`, `trainer/distillation_frameconcat.py`, `tools/train/config/3_dmd.yaml`	`DMDFrameConcat`, `score_distillation_frameconcat`
Inter-shot self-forcing	`trainer/distillation_frameconcat_streaming.py`, `pipeline/self_forcing_training.py`, `tools/train/config/4_dmd_long.yaml`	`score_distillation_frameconcat_stream`, `generate_chunk_with_cache`, LoRA stage
Streaming causal inference	`Inference_Causal.py`, `pipeline/causal_inference_ar.py`	`CausalInferenceArPipeline.inference`, `generator(..., kv_cache_context=...)`
Dual cache / RoPE offset	`pipeline/causal_inference_ar.py`, `wan/modules/causal_model_change_rope.py`	`_initialize_context_kv_cache`, `causal_rope_apply`, cache update branches

4. Experimental Setup (实验设置)

4.1 训练数据与模型

Backbone：Wan2.1-T2V-1.3B，生成 $832 \times 480$ 视频。
Teacher 数据：内部 320K 多镜头视频，每条 2–5 个 shots，最长 250 帧；标注包含 global caption 和 shot-level captions。
Teacher 训练：只优化 DiT blocks 的 3D self-attention，Adam，10,000 steps，LR $1 \times 1 0^{- 5}$ ，batch size 64。
Causal initialization：用 teacher 权重初始化，采样 5K teacher ODE solution pairs，所有参数训练 2,000 steps，LR $1 \times 1 0^{- 6}$ ，batch size 64。

4.2 Causal distillation 配置

论文补充材料写 Stage 1 大约 500 steps、batch size 32；released config 中 tools/train/config/3_dmd.yaml 和 4_dmd_long.yaml 提供了可运行的训练入口，但默认 demo config 使用 demo/data/sample.csv，因此不等价于论文内部全量训练集。

关键代码配置：

change_rope: True, dynamic_sample_frames: True, max_context_frames: 6, multi_caption: True
num_frame_per_block: 3, num_training_frames: 21, restrict_max_length: 81
DMD denoising steps: [1000, 740, 500, 260], warp_denoising_step: True
guidance_scale: 3.0, timestep_shift: 8.0, generator LR 2e-6, critic LR 4e-7
Stage 2 released config uses LoRA: rank 256, alpha 256, dropout 0, bf16, apply_to_critic: true

4.3 评测协议

评测集由 Gemini 2.5 Pro 生成 100 个多样化 multi-shot prompts，并按各 baseline 的输入格式改写。指标覆盖 intra-shot consistency、inter-shot semantic/subject/background consistency、transition control、text alignment、aesthetic quality、dynamic degree；用户研究随机抽 24 个 multi-shot prompts，让 54 名参与者在 8 个匿名方法结果中多选偏好的 visual consistency、prompt following 和 visual quality。

对比方法包括 bidirectional 组（Mask2DiT、EchoShot、CineTrans）和 causal 组（Self Forcing、LongLive、Rolling Forcing、Infinity-RoPE）。所有 FPS 在单张 H200 上测量。

5. Experimental Results (实验结果与评价)

5.1 主结果

ShotStream 在 Table 1 中达到 15.95 FPS，远快于 bidirectional baselines：Mask2DiT 0.149 FPS、EchoShot 0.643 FPS、CineTrans 0.413 FPS。相对 bidirectional 模型，论文报告吞吐量超过 25×；同时它接近 causal long-video baselines 的速度。

定量上，ShotStream 在大多数质量指标为最优：intra-shot subject/background 为 0.825/0.819，inter-shot semantic/subject/background 为 0.762/0.654/0.645，transition control 为 0.978，text alignment 为 0.234，aesthetic quality 为 0.571；dynamic degree 为 63.56，低于 EchoShot 的 65.92 但排名第二。

Figure 5 解读：qualitative comparison 显示 ShotStream 相比基线更能保持人物/背景连续，并且相邻 shots 的过渡更自然；这对应 Table 1 中 inter-shot consistency 和 transition control 的优势。

5.2 用户研究

用户研究中，ShotStream 获得最高偏好率：visual consistency 87.69%，prompt following 76.15%，visual quality 83.08%。这说明自动指标外，用户也更容易感知它在跨镜头一致、prompt adherence 和画面质量上的优势。

5.3 Ablation

Teacher ablation 支持四个设计：动态抽样历史 context frames、给条件帧注入对应 shot captions、用 temporal concat 而不是 spatial concat、只训练 3D self-attention。Student ablation 中，RoPE Offset 在 dual-cache distinction 上优于无 indicator 和 learnable embedding；two-stage training 在 inter-shot consistency 与整体 dynamic degree 上优于只做 Stage 1 或只做 Stage 2。

Figure 6 解读：ablation 的核心结论是“双缓存要可区分、训练 rollout 要贴近测试 rollout”；否则要么跨 shot 语义漂移，要么在自回归历史上逐步累积错误。

5.4 局限与可复现性判断

训练集是内部 320K 多镜头视频，公开仓库提供 demo data、训练/推理代码和 checkpoint 下载脚本，但不能完全复现实验数据。
论文训练细节和 released config 存在尺度差异：补充材料写 Stage 1 约 500 steps、Stage 2 LoRA 1,000 steps，而公开 config 默认 max_iters: 30000 且使用 demo CSV；记笔记时应把 paper-reported schedule 与 repo runnable config 区分开。
交互式生成依赖历史帧抽样与 cache budget；当用户中途大幅改变人物/场景时，global cache 既提供一致性，也可能保留不该继续出现的旧视觉状态。
16 FPS 是单 H200 上的报告值；真实交互系统还会受 VAE decode、I/O、prompt arrival 和前端流式播放栈影响。

Paper Notes

探索

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Bidirectional next-shot teacher

3.2 4-step causal student + DMD

3.3 Dual-cache memory + RoPE discontinuity

3.4 Two-stage self-forcing distillation

3.5 代码对应的推理伪代码

4. Experimental Setup (实验设置)

4.1 训练数据与模型

4.2 Causal distillation 配置

4.3 评测协议

5. Experimental Results (实验结果与评价)

5.1 主结果

5.2 用户研究

5.3 Ablation

5.4 局限与可复现性判断

目录