Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh Affiliations: UCLA, ByteDance Seed, University of Central Florida GitHub: guandeh17/Self-Forcing

1. Motivation (研究动机)

现有自回归长视频生成方法面临两层 train-inference mismatch，导致超出训练时长后质量急剧下降：

Temporal mismatch（时间维度不对齐）：训练时 student 只生成 ≤5 秒的短片（teacher 的最大能力），推理时需要生成远超 5 秒的长视频。Self-Forcing 虽然部分缓解了 CausVid 的 over-exposure 问题，但仍然被 teacher 的 5 秒极限所束缚
Supervision mismatch（监督维度不对齐）：训练时 teacher 对每一帧都提供密集监督，student 几乎不接触误差；推理时 student 必须在自身不完美的输出上持续 rollout，误差不断复合。表现为 motion stagnation（运动停滞）、scene freezing（场景冻结）、catastrophic fidelity loss（质量崩溃）
关键观察：虽然长 rollout 质量退化，但视频往往保留了结构连贯性——错误主要是可恢复的退化（如运动停滞、曝光异常），而非结构崩溃。这说明自回归机制本身没坏，teacher 具有从退化视频中纠正错误的能力

此外，论文发现 VBench 评估长视频存在 bias——它倾向于给 over-exposed 和退化的帧更高的 Image Quality 和 Aesthetic 分数，导致评估不可靠。

Figure 1 解读：左侧展示 Self-Forcing++ 生成的 4 分钟视频截帧（每分钟一组），画面在整个 4 分钟内保持高质量和动态运动。右上角雷达图对比各方法在 CLIP Score、Temporal Quality、Overall Consistency 等维度的表现——Self-Forcing++ 全面领先。右下角动态度曲线显示其他方法（CausVid、MAGI-1、SkyReels-V2、Self-Forcing）的 Dynamic Degree 在 20-40 秒后急剧下降（运动停滞），而 Self-Forcing++ 在 100 秒内始终保持高动态度。

2. Idea (核心思想)

Self-Forcing++ 的核心 insight：短时长 teacher 虽然不能直接生成长视频，但它拥有从退化视频中纠正错误的知识——利用这种纠错能力来训练 student 应对长 rollout 中的误差累积。

具体做法极其简洁（三个组件）：

Backward Noise Initialization：不从随机噪声起步，而是将 student 自身 rollout 的退化输出重新加噪作为起点，确保 DMD 训练的 trajectory 保留了历史上下文
Extended DMD：将 DMD 的监督范围从固定 5 秒窗口扩展到 student rollout 的任意位置——在 $N ≫ T$ 的长 rollout 中均匀采样窗口进行蒸馏
GRPO with Optical Flow Reward：用光流幅度作为 reward 信号，通过 Group Relative Policy Optimization 微调 student，抑制滑动窗口带来的突变和运动不连续

与 LongLive 等方法的根本区别：不需要 attention sink、frame sink 或特殊的 KV cache 管理，只需简单地扩展训练时的 rollout 长度和 DMD 采样范围。

3. Method (方法)

3.1 整体框架

Figure 2 解读：对比三种方法的训练流程。左（CausVid）：用真实数据训练，固定训练时长，推理时用 overlapping frame 做自回归解码——存在严重的 train-inference mismatch。中（Self-Forcing）：student self-rollout + backward noise initialization，但 DMD 监督仍限制在固定 5 秒窗口内，推理超出 5 秒后质量崩溃。右（Self-Forcing++，本文）：student self-rollout 到 $N ≫ 5 s$ （如 100 秒），用 rolling KV cache 保持上下文，然后在长 rollout 的任意位置均匀采样 $K$ 帧窗口进行 Extended DMD。底部时间线：CausVid 和 Self-Forcing 都存在 mismatch（红色叉），Self-Forcing++ 实现了完全的 train-inference alignment（绿色勾）。

Self-Forcing++ 基于 Wan2.1-T2V-1.3B，通过三步训练：

Stage 0 (Init)：ODE 初始化 + Self-Forcing 短视频蒸馏（继承 Self-Forcing 基线）
Stage 1 (Extended DMD)：长 rollout + Backward Noise Init + Extended DMD 扩展训练
Stage 2 (GRPO, 可选)：用光流 reward 微调，提升长程平滑性

3.2 Backward Noise Initialization

短视频蒸馏中，student 从随机噪声 $ϵ \sim N (0, I)$ 起步预测 $x_{0}$ ，teacher 直接在同一噪声上提供监督。但长视频中，每个新 chunk 的起始状态不是随机噪声，而是前一个 chunk 的自回归延续。如果训练时从随机噪声起步，就割裂了与历史上下文的依赖关系。

解决方案：将 student 自身 rollout 的干净输出 $x_{0}^{S}$ 按 diffusion noise schedule 重新加噪：

x_{t} = (1 - σ_{t}) x_{0} + σ_{t} ϵ, where x_{0} = x_{t - 1} - σ_{t - 1} \overset{ϵ}{^}_{θ} (x_{t - 1}, t - 1) (1)

这保证了：(a) 噪声 trajectory 保留了 student 的历史生成轨迹，(b) teacher 评估的是在正确上下文下的分布差异。

def backward_noise_initialization(student_clean_frames, noise_schedule, noise):
    """Re-inject noise into student's clean rollout to create training trajectories."""
    # student_clean_frames: (N, C, H, W) - student's self-generated clean frames
    # noise_schedule: {sigma_t} for each diffusion timestep
    # noise: epsilon ~ N(0, I)
 
    x_t = {}
    for t in noise_schedule.timesteps:
        sigma_t = noise_schedule.sigma(t)
        # Linear interpolation between clean frame and noise
        x_t[t] = (1 - sigma_t) * student_clean_frames + sigma_t * noise
 
    return x_t  # noisy trajectories for DMD training

3.3 Extended Distribution Matching Distillation

Self-Forcing 将 DMD 限制在前 $T$ 帧（≤5 秒）。Self-Forcing++ 将 student rollout 扩展到 $N ≫ T$ 帧，然后均匀采样窗口位置进行蒸馏：

\nabla_{θ} L_{DMD_{extended}} = E_{t} E_{z} [\nabla_{θ} KL (p_{θ, t}^{S} (z) ∥ p_{t}^{T} (z))]

\approx - E_{t} E_{i \sim Unif {1, \dots, N - K + 1}} [\int (s^{T} (Φ (G_{θ} (z_{i}), t), t) - s_{θ}^{S} (Φ (G_{θ} (z_{i}), t), t)) \frac{d G _{θ} ( z _{i} )}{d θ} d z_{i}] (2)

其中 $i$ 从 student 长 rollout 的 ${1, \dots, N - K + 1}$ 中均匀采样， $K$ 是 teacher 能可靠监督的窗口大小（通常 $K = T \approx 5 s$ ）。

核心 insight：teacher 虽然只能生成 5 秒视频，但它在海量训练数据上学到了视频的分布。任何 5 秒片段都可以视为长视频边缘分布的一个样本。因此 teacher 可以在 student 长 rollout 的任意位置提供可靠监督。

def extended_dmd_training_step(student, teacher, cache_size, rollout_length, window_size):
    """One training step of Self-Forcing++ with extended DMD."""
    # Step 1: Student self-rollout to N >> T frames with rolling KV cache
    V = student.rollout(length=rollout_length, cache_size=cache_size)  # N clean frames
 
    # Step 2: Uniformly sample a window position from the long rollout
    i = random.randint(0, rollout_length - window_size)  # uniform sampling
    W = V[i : i + window_size]  # K-frame window
 
    # Step 3: Backward noise initialization on the sampled window
    t = sample_timestep(noise_schedule)
    x_t = backward_noise_init(W, t)  # re-inject noise
 
    # Step 4: Compute DMD loss (student vs teacher distributions)
    L_DMD = dmd_loss(
        student_output=student(x_t, t),
        teacher_output=teacher(x_t, t),
    )
 
    # Step 5: Update student
    L_DMD.backward()
    optimizer.step()

源码参考：推理代码与 Self-Forcing 相同。训练代码尚未发布。

Rolling KV Cache：训练时使用与推理完全相同的 rolling KV cache——student rollout 时维护固定大小的 cache，每生成一个 chunk 就更新 cache。不需要 CausVid 的 overlapping frame 重计算，也不需要 Self-Forcing 的 latent frame masking。

3.4 Improving Long-Term Smoothness via GRPO

滑动窗口 + sparse attention 可能导致窗口边界处的突变（abrupt scene transitions）。Self-Forcing++ 使用 Group Relative Policy Optimization (GRPO) 进行后处理微调：

Reward: 光流幅度（optical flow magnitude）作为 temporal smoothness 的代理指标
Policy: student 生成器 $π_{θ}$ ，其 log probability 通过 diffusion noise prediction 计算
Importance weight: $ρ_{t, i} = \frac{π _{θ} ( a _{t, i} ∣ s _{t, i} )}{π _{θ_{old}} ( a _{t, i} ∣ s _{t, i} )}$

def grpo_update(student, student_old, generated_video):
    """GRPO fine-tuning with optical flow reward."""
    # Compute optical flow reward between consecutive frames
    R = optical_flow_reward(generated_video)  # scalar reward
 
    # Compute importance weights (policy ratio)
    for t, i in all_timesteps_and_frames:
        rho = student.log_prob(a_t_i, s_t_i) / student_old.log_prob(a_t_i, s_t_i)
 
    # GRPO objective (clipped policy gradient)
    L_GRPO = -E[min(rho * advantage, clip(rho, 1-eps, 1+eps) * advantage)]
 
    L_GRPO.backward()
    optimizer.step()

3.5 Visual Stability 评估指标

Figure 3 解读：左图展示 VBench 的 Image Quality 偏差——退化的后期帧（77.13 分）反而比正常早期帧（69.03 分）得分更高。右图展示 Aesthetic Quality 偏差——over-exposed 帧（64.01 分）比正常曝光帧（50.88 分）得分更高。这说明 VBench 对长视频评估不可靠。Self-Forcing++ 提出 Visual Stability 指标：使用 Gemini-2.5-Pro 评估 over-exposure 和 error accumulation，输出 0-100 分。

3.6 Training Budget Scaling

Figure 6 解读：横轴为训练预算（1x-25x），纵轴为视频时长（0-255 秒）。左列（Self-Forcing 和 ODE 初始化）只能生成低质量短片。随着训练预算增加，Self-Forcing++ 的视频质量和可达时长同步提升：4x 时保持语义一致性，8x 时生成详细背景，20x 时生产高保真 50+ 秒视频，25x 时生成 255 秒视频且质量损失极小。这是首次展示训练计算可以直接兑换为视频时长的 scaling property。

3.7 代码-论文映射

论文概念	源码/参考	说明
ODE 初始化	Self-Forcing	继承 Self-Forcing 的初始化流程
Self-Forcing 基线训练	Self-Forcing	DMD + rolling KV cache
Backward Noise Init	论文 Eq. 1, Algorithm 1 line 5	对 student rollout 重新加噪
Extended DMD	论文 Eq. 2, Algorithm 1 lines 3-6	均匀采样长 rollout 中的窗口
Rolling KV Cache	推理代码同 Self-Forcing	训练和推理完全一致
GRPO + 光流 reward	论文 Section 3.3, Algorithm 1 line 9	可选的后处理微调
Visual Stability 指标	论文 Section 3.4	Gemini-2.5-Pro 评估

4. Experimental Setup (实验设置)

基础模型: Wan2.1-T2V-1.3B（16 FPS，832×480）
训练流程: ODE Init → Self-Forcing → Self-Forcing++（Extended DMD）→ GRPO（可选）
Rollout 长度: $N$ 帧，从 50s 扩展到 100s（论文实验），scaling 到 255s（4min15s）
Window 大小: $K = T$ （teacher 的 5 秒窗口）
吞吐量: 17.0 FPS（与 Self-Forcing 相同，推理代码不变）
Baselines: NOVA (0.6B), Pyramid Flow (2B), MAGI-1 (4.5B), SkyReels-V2 (1.3B), CausVid (1.3B), Self-Forcing (1.3B), LTX-Video (1.9B), Wan2.1 (1.3B)
评估设置:
- 短视频: VBench 标准 946 prompts × 16 dimensions
- 长视频: 128 prompts（MovieGen），50s/75s/100s 三个时长
- 指标: Text Alignment, Temporal Quality, Dynamic Degree, Visual Stability (新), Framewise Quality
训练预算: 1x = 生成 5s 视频所需的计算量；论文测试 1x-25x

5. Experimental Results (实验结果)

短视频（5s, VBench）

Model	Params	FPS↑	Total↑	Quality↑	Semantic↑
LTX-Video	1.9B	8.98	80.00	82.30	70.79
Wan2.1	1.3B	0.78	84.67	85.69	80.60
NOVA	0.6B	0.88	80.12	80.39	79.05
Pyramid Flow	2B	6.7	81.72	84.74	69.62
CausVid	1.3B	17.0	82.46	83.61	77.84
Self-Forcing	1.3B	17.0	83.00	83.71	80.14
Self-Forcing++	1.3B	17.0	83.11	83.79	80.37

短视频上 Self-Forcing++ 与 Self-Forcing 持平（83.11 vs 83.00），说明长视频训练不损害短视频质量

长视频（50s/75s/100s）

50s:

Model	Text Align↑	Temporal Quality↑	Dynamic Degree↑	Visual Stability↑	Framewise Quality
CausVid	25.25	89.34	37.35	40.47	61.56
Self-Forcing	24.77	88.17	34.35	40.12	61.06
SkyReels-V2	23.73	88.78	39.15	51.25	54.13
MAGI-1	26.04	88.34	28.49	51.25	54.20
Ours	26.37	91.03	55.36	90.94	60.82

75s:

Model	Text Align↑	Temporal Quality↑	Dynamic Degree↑	Visual Stability↑	Framewise Quality
NOVA	23.37	86.32	31.24	34.06	31.53
MAGI-1	24.95	87.89	24.82	43.28	52.04
SkyReels-V2	22.70	88.99	39.89	55.47	51.55
CausVid	24.76	89.14	35.82	39.84	60.96
Self-Forcing	23.39	87.79	29.15	35.00	60.02
Ours	26.31	91.00	55.62	86.10	60.67

100s:

Model	Text Align↑	Temporal Quality↑	Dynamic Degree↑	Visual Stability↑	Framewise Quality
CausVid	24.41	89.06	34.60	39.21	61.01
Self-Forcing	22.00	87.39	26.41	32.03	58.25
SkyReels-V2	22.05	88.80	38.75	56.72	50.48
Ours	26.04	90.87	54.12	84.22	60.66

关键发现：

Dynamic Degree：Self-Forcing++ 在 100s 时仍达 54.12，而 Self-Forcing 仅 26.41（运动停滞），CausVid 34.60
Visual Stability: 90.94（50s）和 84.22（100s），远超所有 baselines（最高 56.72）
Text Alignment: 100s 时 26.04，超越 CausVid（+6.67%）和 Self-Forcing（+18.36%）

消融实验

Attention Window Size（Table 3, Visual Stability on 50s）:

CausVid	Self-Forcing	Attn-15	Attn-12	Attn-9	Ours
40.47	40.12	44.69	42.19	52.50	90.94

缩小 attention window（Attn-9）有一定帮助（52.50 vs 40.12），但远不如 Self-Forcing++（90.94）
Extended DMD 是主要贡献，window 大小只是辅助手段

GRPO 效果（Figure 5）：

无 GRPO：光流曲线存在尖锐峰值（窗口边界处的突变），variance≈24.52（Figure 5 中标注）
有 GRPO：光流曲线平滑，variance≈20.82（Figure 5 中标注），峰值被有效抑制

Training Budget Scaling：

1x：只能生成低质量短片
4x：保持语义一致性
8x：详细背景 + 准确主体
20x：50+ 秒高保真视频
25x：255 秒（4min15s）视频，利用了 99.9% 的 position embedding 容量

Paper Notes

探索

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 整体框架

3.2 Backward Noise Initialization

3.3 Extended Distribution Matching Distillation

3.4 Improving Long-Term Smoothness via GRPO

3.5 Visual Stability 评估指标

3.6 Training Budget Scaling

3.7 代码-论文映射

4. Experimental Setup (实验设置)

5. Experimental Results (实验结果)

短视频（5s, VBench）

长视频（50s/75s/100s）

消融实验

目录