Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Authors: Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu Affiliations: Nanyang Technological University, ARC Lab Tencent PCG Year: 2026

1. Motivation (研究动机)

现代视频扩散模型（如 Sora、Wan、Movie-Gen）在短视频生成上表现出色，但交互式应用（世界模型、神经游戏引擎、XR 环境）需要实时流式生成长视频——逐帧输出、低延迟、长时间保持视觉一致性。

现有自回归视频扩散方法面临的核心矛盾如下：

方法范式	一致性	流式能力	长视频能力	代表工作
History Corruption	受损	支持	支持	SkyReels-V2, Diffusion Forcing
Planning Generation	支持	不支持	支持	Zhang & Agrawala 2025
Self Forcing	支持	支持	严重退化	Self Forcing (Huang et al.)
Rolling Forcing (本文)	支持	支持	支持	本文

Figure 2 解读：四种自回归视频生成范式对比。(a) History Corruption 通过向历史帧注入噪声来减少对历史的过度依赖，但会破坏时序一致性；(b) Planning Generation 先生成关键帧再插值，打破了严格的顺序输出要求，不适合实时流式；(c) Self Forcing 逐帧因果生成，能保持一致性和流式，但随着生成长度增加，误差严重累积；(d) Rolling Forcing 通过滚动窗口联合去噪，同时满足一致性、流式性和长视频生成三个需求。

Self Forcing（CausVid 的后续改进）将预训练的双向视频扩散模型蒸馏为快速因果自回归生成器。其核心是 Distribution Matching Distillation (DMD) 损失：

\nabla_{θ} L_{DMD} \approx - E_{t} [\int (s_{data} (Ψ (G_{θ} (ϵ), t), t) - s_{gen} (Ψ (G_{θ} (ϵ), t), t)) \frac{d G _{θ} ( ϵ )}{d θ} d ϵ]

其中 $s_{data}$ 和 $s_{gen}$ 分别是在真实数据和生成数据上训练的 score function。Self Forcing 通过在训练时以自生成历史为条件来缓解 exposure bias，但当生成长度超出训练窗口时，误差仍然严重累积。

Figure 1 (Teaser)

Figure 1 解读：Rolling Forcing 在单 GPU 上以约 16 fps 实时流式生成多分钟视频。三个示例 prompt 的生成结果展示了从 0 到 2 分钟的连续输出，视觉质量和时序一致性均保持良好。

2. Idea (核心思想)

Rolling Forcing 的核心可以概括为三点：滚动扩散窗口（Rolling Diffusion Window）、时序与全局历史上下文（Temporal & Global Context）、以及高效训练算法。

Figure 3 解读：Rolling Forcing 去噪过程示意图（T=4）。每一行代表一个去噪步骤，每一列代表一个视频帧。红色框表示需要计算梯度的去噪窗口，绿色框表示不计算梯度的窗口。关键观察：(1) 每个去噪窗口包含连续多帧，且噪声水平递增；(2) 每次前向传播后输出一帧干净帧（最左侧帧），窗口向右滚动一帧；(3) 最近帧的 KV cache（橙色虚线圆）作为时序上下文，初始帧的 KV cache（蓝色虚线圆）作为全局上下文锚点。

从直觉上看，Rolling Forcing 不是让每一帧孤立地完成全部去噪，而是让一个包含多帧的窗口一起参与联合修正：窗口内帧之间通过双向注意力交换信息，左侧帧先完成去噪并输出，随后窗口右移，新的高噪声帧补入。这样做的直接目标，是在保持因果流式输出的同时，减少逐帧生成带来的误差累积。

3. Method (方法)

3.1 Rolling Diffusion Window

核心思想：将 Self Forcing 的单帧去噪扩展为多帧联合去噪窗口。

在 Self Forcing 中，第 $i$ 帧的去噪过程为：

x_{t_{j - 1}}^{i} = Ψ (G_{θ} (x_{t_{j}}^{i}, t_{j}, x_{0}^{< i}), t_{j - 1})

每帧独立经历完整的去噪步骤，严格的因果关系导致误差逐帧传播。

Rolling Forcing 将单帧扩展为包含 $T$ 帧的滚动窗口，窗口内各帧被赋予递增的噪声水平：

p_{θ} (x_{t_{0 : T - 1}}^{i : i + T - 1} ∣ x_{t_{1 : T}}^{i : i + T - 1}, x_{0}^{< i}) = Ψ (G_{θ} (x_{t_{1 : T}}^{i : i + T - 1}, t_{1 : T}, x_{0}^{< i}), t_{0 : T - 1})

关键设计要点：

窗口长度 $L_{win} = T$ （等于去噪步数）
窗口内帧通过双向注意力连接，允许相互修正局部误差
每次前向传播后，最左侧帧完成去噪并输出，窗口右滚一帧，新帧从纯高斯噪声开始
通过 DMD 蒸馏将去噪步数 $T$ 压缩到仅 5 步，使窗口足够小以在单 GPU 上实时运行

噪声时间表： $[t_{5}, t_{4}, t_{3}, t_{2}, t_{1}] = [1000, 800, 600, 400, 200]$

3.2 Temporal & Global Context（Attention Sink）

随着生成进行，干净历史帧 $x_{0}^{< i}$ 不断增长，直接处理计算开销过大。Rolling Forcing 采用 KV cache 机制：

时序上下文（Temporal Context）：缓存最近 $L_{tem}$ 帧的 KV 状态，保持短期时序一致性。

全局上下文（Global Context / Attention Sink）：缓存初始 $L_{glo}$ 帧的 KV 状态作为全局锚点，类似于 LLM 中的 attention sink 机制。

总注意力窗口： $L_{tem} + L_{glo} + L_{win} = L_{bidirectional}$ （与双向 teacher 模型匹配）

Dynamic RoPE 调整：直接缓存初始帧会导致位置编码溢出（因为随着 $i$ 增大，初始帧与当前帧的相对距离超出训练范围）。解决方案：

缓存全局上下文帧的 key states 时不应用 RoPE
推理时动态应用 RoPE，将全局上下文帧的位置索引设为 $i - L_{tem} - L_{glo} : i - L_{tem} - 1$
即视全局上下文帧为紧接在时序上下文之前，保持相对位置不变

实现参数： $L_{tem} = 3, L_{glo} = 3, L_{win} = 15$ latent frames

3.3 Efficient Training Algorithm

问题：训练时需要对每个窗口做反向传播来计算 DMD 损失，但总共有 $N$ 个窗口（ $N$ 为视频帧数），全部反向传播内存不可承受。

解决方案：从每 $T$ 个连续窗口中随机选一个进行梯度计算，其余窗口仅做前向传播。这样梯度计算量从 $N$ 降至 $⌈ N / T ⌉$ 。

Mixed Training Strategy：以等概率交替使用两种训练目标：

Self Forcing 训练（Eq. 3）: 每帧从同一噪声水平 $t_{j}$ 去噪，作为正则化项鼓励自然的相机运动
Rolling Forcing 训练（Eq. 5）: 每帧从不同噪声水平去噪，但仅在非重叠窗口上计算梯度

3.4 Python 伪代码

import torch
 
def rolling_forcing_inference(
    generator,       # G_θ: AR diffusion model (returns KV embeddings)
    noise_schedule,  # [t_1, t_2, ..., t_T], e.g., [200, 400, 600, 800, 1000]
    num_frames,      # N: total frames to generate
    forward_diffusion,  # Ψ: adds noise to clean frames
):
    T = len(noise_schedule)  # denoising steps = window length
    t_list = [0] + noise_schedule  # t_0=0, t_1, ..., t_T
 
    # === Phase 1: Initialize first window ===
    # Generate initial T frames via full denoising from pure noise
    x_noisy = torch.randn(T, C, H, W)  # [T frames, all at t_T noise level]
    kv_cache = initialize_kv_cache()
    output_frames = []
 
    # Initialize: denoise first window with generator
    x_clean = generator(x_noisy, t_list[1:T+1], kv_cache)
 
    # === Phase 2: Rolling generation ===
    for i in range(num_frames):
        # Step 1: Sample fresh noise for new frame
        new_noise = torch.randn(1, C, H, W)  # pure Gaussian at t_T
 
        # Step 2: Construct denoising window
        # Window contains T frames with noise levels [t_1, t_2, ..., t_T]
        # - Frames 0..T-2: carried from previous step (noise re-injected to next level)
        # - Frame T-1: fresh pure noise
        window_noisy = torch.cat([
            forward_diffusion(x_clean[1:], t_list[1:T]),  # re-noise T-1 previous frames
            new_noise                                       # append new frame at t_T
        ], dim=0)  # shape: [T, C, H, W]
 
        # Step 3: Joint denoising with bidirectional attention
        x_clean = generator(window_noisy, t_list[1:T+1], kv_cache)
        # x_clean[0] is at noise level t_0=0 (fully denoised)
 
        # Step 4: Emit clean frame and update KV cache
        output_frames.append(x_clean[0])
 
        # Update temporal context KV cache (recent L_tem frames)
        kv_cache.update_temporal(generator.get_kv(x_clean[0], t_0=0))
 
        # Keep global context KV cache (initial L_glo frames, with dynamic RoPE)
        # Global context keys stored WITHOUT RoPE, applied dynamically at attention time
 
    return torch.stack(output_frames)

def rolling_forcing_training_step(
    generator,           # G_θ: student model
    score_data,          # s_data: score function on real data
    score_gen,           # s_gen: score function on generated data
    forward_diffusion,   # Ψ
    video_length,        # N: number of frames in training video
    T,                   # number of denoising steps (= window length)
):
    # Step 1: Randomly choose training mode (50/50 mix)
    use_rolling_forcing = (torch.rand(1) > 0.5)
 
    # Step 2: Random exit point for gradient (which window gets gradient)
    j = torch.randint(0, T, (1,))  # random noise level index
 
    # Step 3: Autoregressively generate full video
    kv_cache = initialize_kv_cache()
    predicted_clean_video = []
 
    for i in range(video_length):
        if use_rolling_forcing:
            # Rolling Forcing: construct window with progressive noise levels
            window_noisy = build_rolling_window(prev_clean, t_list[1:T+1])
 
            # Gradient only for windows where i ≡ j (mod T)
            if i % T == j:
                x_clean = generator(window_noisy, t_list[1:T+1], kv_cache)  # with grad
            else:
                with torch.no_grad():
                    x_clean = generator(window_noisy, t_list[1:T+1], kv_cache)
 
            # Select frame at position j within window as prediction for frame i
            predicted_clean_video.append(x_clean[0])
        else:
            # Self Forcing: standard single-frame denoising
            noise_level = t_list[j + 1]  # same noise level for all frames
            x_noisy_i = forward_diffusion(generator_output_i, noise_level)
            x_clean_i = generator(x_noisy_i, noise_level, kv_cache)
            predicted_clean_video.append(x_clean_i)
 
        # Update KV cache with self-generated history (NOT ground truth)
        kv_cache.update(x_clean_i)
 
    # Step 4: Compute DMD loss on predicted video
    predicted_video = torch.stack(predicted_clean_video)
 
    # DMD: match distribution of generated video to real data distribution
    loss = compute_dmd_loss(predicted_video, score_data, score_gen, forward_diffusion)
 
    loss.backward()
    return loss

3.5 Dynamic RoPE for Attention Sink

def apply_dynamic_rope_for_global_context(
    global_kv_cache,    # KV states of initial L_glo frames (stored WITHOUT RoPE)
    temporal_kv_cache,  # KV states of recent L_tem frames (stored WITH RoPE)
    current_window,     # current denoising window frames
    current_frame_idx,  # i: index of current denoising window start
    L_tem, L_glo,       # temporal and global context sizes
    rope_freqs,         # RoPE frequency table
):
    # Global context frames: assign positions immediately before temporal context
    # Positions: [i - L_tem - L_glo, ..., i - L_tem - 1]
    global_positions = range(current_frame_idx - L_tem - L_glo,
                            current_frame_idx - L_tem)
 
    # Apply RoPE to global context keys at query time (not at cache time)
    global_keys_roped = apply_rope(global_kv_cache.keys, global_positions, rope_freqs)
 
    # Temporal context already has correct RoPE from when it was cached
    # Current window frames get RoPE at positions [i, i+1, ..., i+T-1]
 
    # Concatenate for attention: [global | temporal | current_window]
    all_keys = torch.cat([global_keys_roped, temporal_kv_cache.keys, current_window_keys])
    all_values = torch.cat([global_kv_cache.values, temporal_kv_cache.values, current_window_values])
 
    return all_keys, all_values

3.6 代码仓库映射

GitHub 仓库: TencentARC/RollingForcing

核心文件结构

论文概念	代码文件	说明
Rolling Forcing 推理 (Algorithm 2)	`pipeline/rolling_forcing_inference.py`	`CausalInferencePipeline.inference_rolling_forcing()` — 滚动窗口推理主循环
Rolling Forcing 训练 (Algorithm 1)	`pipeline/rolling_forcing_training.py`	`RollingForcingTrainingPipeline` — 滚动窗口训练，含梯度采样和混合训练
Attention Sink + Dynamic RoPE (Sec 3.3)	`wan/modules/causal_model.py`	`CausalWanSelfAttention` — KV cache 管理、sink token 保留、`causal_rope_apply()` 动态 RoPE
DMD Loss (Eq. 1)	`model/dmd.py` + `utils/loss.py`	DMD 损失计算，score function 差值
训练入口	`train.py` → `trainer/distillation.py`	`ScoreDistillationTrainer` — 支持 DMD/SiD/GAN 多种蒸馏模式
推理入口	`inference.py`	加载 checkpoint，调用 `inference_rolling_forcing()`
基础模型封装	`utils/wan_wrapper.py` + `wan/modules/model.py`	Wan2.1 DiT 模型封装
因果注意力模型	`wan/modules/causal_model.py`	因果 attention mask、KV cache 逻辑
VAE 编解码	`wan/modules/vae.py`	视频 latent 编解码
配置文件	`configs/rolling_forcing_dmd.yaml`	超参数配置

关键代码片段对应

滚动窗口构建 (pipeline/rolling_forcing_inference.py):

window_num = num_blocks + rolling_window_length_blocks - 1
for window_index in range(window_num):
    start_block = max(0, window_index - rolling_window_length_blocks + 1)
    end_block = min(num_blocks - 1, window_index)

梯度采样 (pipeline/rolling_forcing_training.py):

exit_flag = torch.randint(high=rolling_window_length_blocks, ...)
require_grad = window_index % rolling_window_length_blocks == exit_flag

Attention Sink 实现 (wan/modules/causal_model.py):

sink_tokens = 1 * self.block_length  # 保留第一个 block
# 溢出时驱逐旧 token 但保留 sink tokens
kv_cache["k"][:, sink_tokens:sink_tokens + num_rolled_tokens] = \
    kv_cache["k"][:, sink_tokens + num_evicted_tokens:...].clone()

Dynamic RoPE (wan/modules/causal_model.py):

# 全局上下文: 缓存时不应用 RoPE
if local_start_index == 0:
    kv_cache["k"][:, :block_length] = k[..., :block_length]  # unroped
# 注意力计算时动态应用 RoPE
input_key = torch.cat([
    causal_rope_apply(kv_cache["k"][:, :block_length], ..., start_frame=adjusted_pos),
    working_cache_key,
    roped_key
], dim=1)

4. Experimental Setup (实验设置)

4.1 训练配置

基础模型：Wan2.1-T2V-1.3B
每 chunk：3 latent frames
去噪步数 / 窗口长度： $T = 5$
训练窗口：27 latent frames
batch size：8
训练步数：3,000
优化器：AdamW
学习率：generator lr = $1.5 \times 1 0^{- 6}$ ，fake score lr = $4.0 \times 1 0^{- 7}$

4.2 评测配置

对比模型：SkyReels-V2、MAGI-1、CausVid、Self Forcing
定量指标：Throughput (FPS)、Latency (s)、Temporal、Subject、Background、Motion、Aesthetic、Imaging、 $Δ_{Drift}^{Quality}$
定性场景：跑步者场景、乌龟赛车场景、珊瑚礁水下场景、交互式 prompt 切换
推理设定：单 GPU 实时流式生成，关注长时间一致性与质量漂移

5. Experimental Results (实验结果)

5.1 定量比较

Model	Params	Throughput (FPS)	Latency (s)	Temporal	Subject	Background	Motion	Aesthetic	Imaging	$Δ_{Drift}^{Quality} ↓$
SkyReels-V2	1.3B	0.49	112	97.43	89.23	93.45	98.76	61.55	62.90	5.59
MAGI-1	4.5B	0.19	282	98.21	90.86	93.25	99.20	59.91	59.87	2.15
CausVid	1.3B	15.38	0.78	96.84	87.99	89.99	98.09	60.95	66.38	2.18
Self Forcing	1.3B	15.38	0.78	97.49	86.48	90.29	98.47	60.54	68.68	1.66
Rolling Forcing	1.3B	15.79	0.76	97.61	92.80	93.71	98.70	62.39	70.75	0.01

关键发现：Rolling Forcing 在几乎所有质量指标上都取得最优，特别是 $Δ_{Drift}^{Quality}$ 仅为 0.01（Self Forcing 为 1.66），说明其在抑制长视频误差累积方面具有明显优势。

Figure 4 解读：定性对比。上半部分展示 120 秒生成结果（跑步者场景），Rolling Forcing 在 2 分钟内保持高保真度，而 Self Forcing 在 60s 后出现严重退化，CausVid 和 SkyReels-V2 更早崩溃。下半部分展示另一个场景（乌龟赛车），Rolling Forcing 的色彩、细节、运动一致性显著优于其他方法。

5.2 消融实验

Model	Temp.	Subj.	Back.	Mot.	Aes.	Img.	$Δ_{Drift}^{Quality} ↓$
w/o RF inference	95.45	86.01	89.94	97.36	57.59	65.19	5.53
w/o RF training	95.91	87.50	90.86	98.05	60.41	69.24	0.89
w/o SF training	90.83	83.27	88.14	95.63	55.30	62.00	1.62
w/o attention sink	97.53	83.22	87.99	98.56	58.99	67.30	4.63
Ours full	97.61	92.80	93.71	98.70	62.39	70.75	0.01

Figure 5 解读：消融实验可视化（珊瑚礁水下场景，30 秒）。(1) 去除 RF 推理（仅用逐帧去噪）导致 30s 内即出现明显退化；(2) 去除 RF 训练（用逐帧范式训练）质量漂移虽有改善但仍存在；(3) 去除 SF 训练（移除混合训练中的正则化项）导致不自然的相机运动和一致性下降；(4) 去除 attention sink 导致色彩偏移和全局一致性丧失。

5.3 交互式视频流

Figure 6 解读：Rolling Forcing 支持交互式视频流——用户可在生成过程中实时更换 prompt。实现方式很简洁：丢弃 cross-attention cache 中的旧 prompt，换入新 prompt 的 embeddings 即可。

5.4 小结与局限

Rolling Forcing 通过滚动窗口联合去噪，在长视频场景下有效抑制误差累积。
Attention sink 与 Dynamic RoPE 让长期上下文缓存更稳定。
依赖 DMD 蒸馏，训练阶段需要维护 teacher score function。
全局上下文仅锚定初始帧，对于场景剧烈变化的超长视频可能不够。
$Δ_{Drift}^{Quality} = 0.01$ 的结果很强，但也可能部分受益于评测指标的局限性（仅比较首尾 5 秒）。

Paper Notes

探索

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

1. Motivation (研究动机)

Figure 1 (Teaser)

2. Idea (核心思想)

3. Method (方法)

3.1 Rolling Diffusion Window

3.2 Temporal & Global Context（Attention Sink）

3.3 Efficient Training Algorithm

3.4 Python 伪代码

3.5 Dynamic RoPE for Attention Sink

3.6 代码仓库映射

核心文件结构

关键代码片段对应

4. Experimental Setup (实验设置)

4.1 训练配置

4.2 评测配置

5. Experimental Results (实验结果)

5.1 定量比较

5.2 消融实验

5.3 交互式视频流

5.4 小结与局限

目录