Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

1. Motivation (研究动机)

1.1 长视频生成的核心瓶颈：Error Accumulation

当前 video DiT (如 Wan 2.1) 单次生成长度有限 (通常约 5 秒)。要生成更长视频，需要 autoregressive 方式逐 clip 生成，但 error accumulation (误差累积) 会导致：

图像质量退化 (blur, color shift)
运动漂移 / 语义失控
场景同质化 (scene homogeneity bias)

1.2 现有方案的局限

现有三大路线都只是 缓解 (mitigate) error，而非 纠正 (correct)：

路线	代表方法	局限
Noise modification	FreeNoise, Rolling Diffusion	改 noise schedule，无法消除根本误差
Frame anchoring	StreamingT2V, FramePack	用 clean reference image 做 anchor，仍受限于单 prompt 单场景
Improved sampling	FramePack (Zhang & Agrawala, 2025)	Anti-drifting sampling，长度仍受限 (~1 min)

1.3 关键洞察：Training-Test Hypothesis Gap

Figure 1 解读：对比三种模型的 train-test 假设：

(a) Video Generative DiT：训练时 trajectory 上的中间态 $X_{t}$ 基于 clean latent $X_{vid}$ 构建 (error-free)，但推理时 autoregressive 条件下 $\tilde{X}_{t}$ 来自上一次自身的 imperfect 输出 → train-test hypothesis gap
(b) Image Restoration DiT：训练和测试都接收 degraded input → 没有 gap → 天然具备 error robustness
(c) SVI (本文)：借鉴 restoration 思路，在训练时主动注入 self-generated errors，使 train 和 test 都面对 error-injected input → 消除 gap，同时保留 generation 能力

核心理念：与其训练一个 fragile 的 generator 然后设法让推理不出错，不如训练一个 能在有错条件下自动纠错的 generator。

2. Idea (核心思想)

一句话总结：提出 Error-Recycling Fine-Tuning (ERFT)，将 DiT 自身产生的 error 回收为 supervisory prompt 注入训练，使模型学会在 autoregressive 推理时主动识别并纠正自己的错误，从而支持无限长度视频生成。

核心设计的三步闭环：

Error Injection — 将历史 error 注入 clean input，模拟推理时的 degradation
Bidirectional Error Curation — 单步双向积分高效计算 error
Error Replay Memory — 动态存储与重采样 error，形成闭环

关键优势：

无限长度：error 被主动纠正，不再随时间累积
零额外推理开销：只训练 LoRA adapter，推理与原始 pipeline 一致
多模态兼容：支持 text / audio / skeleton 等多种 condition
数据高效：仅需 300-6k 短视频 LoRA 微调

3. Method (方法)

3.1 整体架构

Figure 3 解读：SVI 整体架构由三个核心模块组成：

(a) Error-Recycling Fine-Tuning (Sec 4.1)：从 3D VAE 获取 $X_{vid}$ , $X_{img}$ , $X_{noi}$ ，从 Error Replay Memory 中采样 $E_{vid}$ , $E_{noi}$ , $E_{img}$ ，按概率注入 clean input 得到 error-injected 版本 $\tilde{X}_{vid}$ , $\tilde{X}_{noi}$ , $\tilde{X}_{img}$ 。拼接后送入 DiT blocks (Self-Attention + Cross-Attention) 预测 velocity $\hat{V}_{t}$
(b) Bidirectional Error Curation (Sec 4.2)：用 predicted velocity 单步前向/后向积分，近似预测 latent 和 noise，与 ground-truth 作差得到 $E_{vid}$ 和 $E_{noi}$
(c) Error Replay Memory (Sec 4.3)：维护两个 memory bank $B_{vid}$ 和 $B_{noi}$ ，按 discretized timestep 存储 error，训练时选择性重采样

3.2 Training-Test Hypothesis Gap 的数学分析

Figure 2 解读：展示 flow matching 中两种 error 的来源：

(a) Error-Free Training：训练假设中间态 trajectory 无误差
(b) Single-Clip Predictive Error：推理时 regressive nature 导致 predicted velocity $\hat{V}_{t}$ 与 ground-truth $V_{t}$ 之间存在偏差 $E_{t}$ ，使 trajectory 终点偏离 $X_{vid}$
(c) Cross-Clip Conditional Error：autoregressive 多 clip 生成时，error-included reference image $\tilde{X}_{img}$ 进一步偏移 trajectory 起点 → 两种 error 相互放大

数学定义：

训练目标 (Error-Free Flow Matching)：

L = E_{X_{noi}, X_{vid}, X_{img}, C, t} ∣ u (X_{t}, X_{img}, C, t; θ) - V_{t} ∣^{2}

其中 $X_{t} = t \cdot X_{vid} + (1 - t) \cdot X_{noi}$ ， $V_{t} = X_{vid} - X_{noi}$ 。

推理时步进积分： $X_{t_{k + 1}} = X_{t_{k}} + (t_{k + 1} - t_{k}) \cdot u (X_{t_{k}}, X_{img}, t_{k}; θ)$

问题根源：训练时 $X_{t}$ 基于 clean $X_{vid}$ 构建，但推理时实际是 $\tilde{X}_{t}$ (基于上一步 imperfect 预测)。

# Pseudocode: Error-Free Flow Matching Training (Baseline)
def flow_matching_train_step(x_vid, x_img, x_noi, condition, t, model):
    """标准训练：假设所有 input 都是 clean 的"""
    x_t = t * x_vid + (1 - t) * x_noi          # clean 中间态
    v_t = x_vid - x_noi                          # ground-truth velocity
    v_hat = model(x_t, x_img, condition, t)      # predicted velocity
    loss = mse(v_hat, v_t)
    return loss
 
# Pseudocode: Autoregressive Inference (暴露 hypothesis gap)
def autoregressive_inference(model, ref_image, prompts, n_steps=50):
    """推理时 error 不断累积"""
    x_img = encode(ref_image)  # 初始 clean
    videos = []
    for prompt in prompts:
        x_noi = torch.randn_like(x_img)
        x_t = x_noi  # t=0
        for k in range(n_steps):
            t_k = k / n_steps
            v_hat = model(x_t, x_img, prompt, t_k)  # x_t 已含 error!
            x_t = x_t + (1/n_steps) * v_hat
        x_vid_hat = x_t  # 含 accumulated error
        videos.append(decode(x_vid_hat))
        x_img = x_vid_hat[-1]  # error-included → 传递到下一 clip
    return videos

3.3 Error-Recycling Fine-Tuning (ERFT)

核心目标：预测 error-recycled velocity $V_{t}^{rcy} = X_{vid} - \tilde{X}_{noi}$ ，无论当前 trajectory 历史是否含 error，都指向 clean latent $X_{vid}$ 。

3.3.1 Error Injection

将从 replay memory 采样的三类 error 按概率注入 clean input：

\tilde{X}_{vid} = X_{vid} + I_{vid} \cdot E_{vid}, \tilde{X}_{noi} = X_{noi} + I_{noi} \cdot E_{noi}, \tilde{X}_{img} = X_{img} + I_{img} \cdot E_{img}

其中 $I_{*} = 1$ w.p. $p_{*}$ ，否则为 0。默认 $p = 0.5$ ，保留一半 clean input 维持生成能力。

# Pseudocode: Error Injection
def inject_errors(x_vid, x_noi, x_img, error_bank_vid, error_bank_noi, t, p=0.5):
    """从 replay memory 采样 error 并按概率注入"""
    # 将训练 timestep t 映射到最近的 test discretized timestep t_n
    t_n = find_nearest_test_timestep(t)
 
    # 选择性采样 error (Sec 4.3)
    E_vid = uniform_sample(error_bank_vid[t_n])      # timestep-aligned
    E_noi = uniform_sample(error_bank_noi[t_n])      # same timestep
    E_img = uniform_sample_across_T(error_bank_vid)   # 跨 timestep 采样
 
    # 按概率注入
    mask_vid = (torch.rand(1) < p).float()
    mask_noi = (torch.rand(1) < p).float()
    mask_img = (torch.rand(1) < p).float()
 
    x_vid_tilde = x_vid + mask_vid * E_vid
    x_noi_tilde = x_noi + mask_noi * E_noi
    x_img_tilde = x_img + mask_img * E_img
 
    return x_vid_tilde, x_noi_tilde, x_img_tilde

3.3.2 Control Injection & Velocity Prediction

SVI 通过两种方式注入额外控制信号：

Visual Control $C_{vis}$ ：如 skeleton，通过 element-wise addition 注入 tokenized input → 精确空间控制 (SVI-Dance)
Embedding Control $C_{emb}$ ：如 text, audio，通过 cross-attention 层注入 → 多模态语义控制 (SVI-Talk)

# Pseudocode: SVI Forward Pass
def svi_forward(x_vid_tilde, x_noi_tilde, x_img_tilde, condition, t, model,
                c_vis=None, c_emb=None):
    """Error-injected 输入 → DiT → predicted velocity"""
    # 构建 noisy latent
    x_t_tilde = t * x_vid_tilde + (1 - t) * x_noi_tilde
 
    # 拼接 image condition
    x_input = concat(x_t_tilde, x_img_tilde)  # along channel
 
    # 可选 visual control (element-wise add after tokenization)
    if c_vis is not None:
        x_input = x_input + c_vis  # e.g., skeleton encoding
 
    # DiT blocks with optional embedding control in cross-attention
    v_hat = model.dit_blocks(x_input, condition, t, c_emb=c_emb)
    return v_hat

3.4 Bidirectional Error Curation

Figure 4 解读：展示三种 error injection 场景下的 error 计算方式：

(a) No Injected Error：无注入时，模拟 single-clip predictive error。前向积分 $\hat{X}_{vid} = \tilde{X}_{t} + \int_{t}^{1} \hat{V}_{s} d s$ 得到 predicted latent，后向积分得到 predicted noise。 $E_{vid} = \hat{X}_{vid} - X_{vid}$ ， $E_{noi} = \hat{X}_{noi}^{img} - X_{noi}^{img}$
(b) Error-Injected Start Point：error 注入到 noise / reference image，模拟 cross-clip conditional error
(c) Error-Injected End Point：error 注入到 video latent，模拟累积 error 视角

核心近似：避免完整 ODE 求解，用 单步积分 近似预测：

\hat{X}_{vid} = \tilde{X}_{t} + \int_{t}^{1} \hat{V}_{s} d s \approx \tilde{X}_{t} + (1 - t) \hat{V}_{t}

\hat{X}_{noi}^{img} = \tilde{X}_{t} - \int_{0}^{t} \hat{V}_{s} d s \approx \tilde{X}_{t} - t \hat{V}_{t}

Error 统一公式：

E_{vid} = \hat{X}_{vid} - X_{vid}^{rcy}, E_{noi} = \hat{X}_{noi}^{img} - X_{noi}^{rcy}, E_{img} = Unif_{T} (E_{vid})

# Pseudocode: Bidirectional Error Curation
def compute_errors(x_t_tilde, v_hat, t, x_vid, x_noi_img, x_vid_rcy, x_noi_rcy):
    """单步双向积分计算 error"""
    # 前向积分：近似预测 video latent
    x_vid_hat = x_t_tilde + (1 - t) * v_hat
 
    # 后向积分：近似预测 noise (image-conditioned start)
    x_noi_img_hat = x_t_tilde - t * v_hat
 
    # 计算 latent error 和 noise error
    E_vid = x_vid_hat - x_vid_rcy      # video latent error
    E_noi = x_noi_img_hat - x_noi_rcy  # noise error
 
    # Image error: 从 E_vid 跨 temporal axis 均匀采样
    E_img = uniform_sample_temporal(E_vid)
 
    return E_vid, E_noi, E_img

3.5 Error Replay Memory

维护两个按 discretized timestep 索引的 memory bank： $B_{vid}$ 和 $B_{noi}$ 。

关键设计：

Timestep 对齐：将训练 timestep $T_{tra}$ ( $N_{tra} = 1000$ ) 映射到 test timestep $T_{test}$ ( $N_{test} = 50$ )，存储到对应位置
容量控制：每个 bank 上限 $Z = 500$ ，满时用 $L_{2}$ 距离替换最相似的 entry，保持 diversity
Warmup：跨机 cross-machine gathering 加速初始 bank 填充 (借鉴 Federated Learning)

选择性采样策略：

Error 类型	采样策略	原因
$E_{vid}$	从 $B_{vid, n}$ (timestep-aligned) 均匀采样	step-wise error 与 timestep 强相关
$E_{noi}$	从 $B_{noi, n}$ (同 timestep) 均匀采样	noise-latent 对偶性
$E_{img}$	从 $B_{vid}$ 跨 timestep 均匀采样	reference image error 是全 trajectory 累积的

# Pseudocode: Error Replay Memory
class ErrorReplayMemory:
    def __init__(self, n_test=50, max_size=500):
        self.n_test = n_test
        self.max_size = max_size
        # 按 test timestep 索引的 bank
        self.B_vid = {n: [] for n in range(n_test)}
        self.B_noi = {n: [] for n in range(n_test)}
 
    def update(self, E_vid, E_noi, t_train):
        """将新计算的 error 存入对应 timestep slot"""
        t_n = self._align_timestep(t_train)
        for bank, error in [(self.B_vid, E_vid), (self.B_noi, E_noi)]:
            if len(bank[t_n]) < self.max_size:
                bank[t_n].append(error)
            else:
                # 替换 L2 距离最近的 entry，保持 diversity
                dists = [torch.norm(error - e) for e in bank[t_n]]
                idx = torch.argmin(torch.tensor(dists))
                bank[t_n][idx] = error
 
    def sample(self, t_train):
        """选择性采样三类 error"""
        t_n = self._align_timestep(t_train)
        E_vid = random.choice(self.B_vid[t_n])
        E_noi = random.choice(self.B_noi[t_n])
        # E_img 跨 timestep 采样
        all_vid_errors = [e for slot in self.B_vid.values() for e in slot]
        E_img = random.choice(all_vid_errors)
        return E_vid, E_noi, E_img
 
    def _align_timestep(self, t_train):
        """将连续训练 timestep 映射到最近的 discrete test timestep"""
        test_steps = torch.linspace(0, 1, self.n_test)
        return torch.argmin(torch.abs(test_steps - t_train)).item()

3.6 Optimization 目标

最终训练 loss：

L_{SVI} = E_{\tilde{X}_{vid}, \tilde{X}_{noi}, \tilde{X}_{img}, C, t} ∣ u (\tilde{X}_{t}, \tilde{X}_{img}, C, t; θ) - V_{t}^{rcy} ∣^{2}

其中 $\tilde{X}_{t} = t \tilde{X}_{vid} + (1 - t) \tilde{X}_{noi}$ ， $V_{t}^{rcy} = X_{vid} - \tilde{X}_{noi}$ 指向 clean latent。

只训练 LoRA adapter，冻结原始 DiT 权重 → 用户可灵活切换。

# Pseudocode: Full SVI Training Loop
def svi_training_step(model, lora, vae, data, memory, p_inject=0.5):
    """完整的 SVI 单步训练"""
    video_clip, ref_image, condition = data
 
    # 1. Encode
    x_vid = vae.encode(video_clip)
    x_img = vae.encode(ref_image)  # with padding
    x_noi = torch.randn_like(x_vid)
    t = torch.rand(1)
 
    # 2. Error injection (from replay memory)
    if memory.is_ready():
        E_vid, E_noi, E_img = memory.sample(t)
        x_vid_t, x_noi_t, x_img_t = inject_errors(
            x_vid, x_noi, x_img, E_vid, E_noi, E_img, p=p_inject)
    else:
        x_vid_t, x_noi_t, x_img_t = x_vid, x_noi, x_img
 
    # 3. Construct error-injected intermediate state
    x_t = t * x_vid_t + (1 - t) * x_noi_t
 
    # 4. Forward through DiT + LoRA
    v_hat = model(x_t, x_img_t, condition, t, lora=lora)
 
    # 5. Error-recycled velocity target (always points to CLEAN latent)
    v_target = x_vid - x_noi_t  # V_t^rcy
 
    # 6. Loss
    loss = F.mse_loss(v_hat, v_target)
 
    # 7. Bidirectional error curation & memory update
    with torch.no_grad():
        E_vid_new, E_noi_new, _ = compute_errors(
            x_t, v_hat, t, x_vid, x_noi, x_vid, x_noi)
        memory.update(E_vid_new, E_noi_new, t)
 
    return loss

3.7 SVI Family — 多应用变体

变体	用途	训练改动	Error 注入概率 ( $E_{img}, E_{vid}, E_{noi}$ )
SVI-Shot	单场景长视频	Padding + random image as anchor	0.9, 0.9, 0.01
SVI-Film	多场景 storyline 长片	5 motion frames, padding frame → zero latent	0.9, 0.9, 0.01
SVI-Talk	Audio-conditioned talking	Audio cross-attention ( $C_{emb}$ )	0.9, 0.9, 0.01
SVI-Dance	Skeleton-guided dancing	Skeleton → visual control ( $C_{vis}$ )	0.9, 0.9, 0.01

3.8 Code Mapping

论文概念	代码位置 (GitHub)
Error-Recycling Fine-Tuning	`train/` 目录下的训练脚本
Error Replay Memory	训练代码中的 memory bank 实现
SVI-Shot / SVI-Film inference	`comfyui_workflow/` (ComfyUI workflow)
LoRA weights	提供预训练 LoRA checkpoint
Benchmark evaluation	`eval/` 目录

4. Experimental Setup (实验设置)

4.1 Base Model & Training

Base model: Wan 2.1 (14B video DiT)
Fine-tuning: 仅训练 LoRA adapter
训练数据: 仅需 300-6k 短视频
训练效率: 数据量极小，任何人都可以制作自己的 SVI

4.2 Benchmarks

三大 benchmark，涵盖 consistent / creative / conditional 设定：

Benchmark	场景数	时长	描述
Consistent Video Generation	Single	50s / 250s	单 prompt 单场景，不换场景
Creative Video Generation	Multiple	50s / 250s	Storyline-based prompt stream，多场景切换
Multimodal Conditional Generation	Single	300s (talk) / 50s (dance)	Audio / skeleton 条件控制

自动 Prompt Stream 引擎：

Keyword → Image (自动检索)
Image → Storyline (Qwen2.5 MLLM 自动生成 prompt stream)
Storyline → Short Film (SVI 迭代生成)

4.3 Baselines

Generic: Wan 2.1, StreamingT2V, HistoryGuidance, FramePack
Audio-conditioned talking: MultiTalk
Skeleton-conditioned dancing: UniAnimate-DiT

4.4 Metrics

视频质量 (VBench++): Subject Consistency, Background Consistency, Aesthetic Quality, Imaging Quality, Dynamic Degree, Motion Smoothness
条件生成: Sync-C, Sync-D, FVD, PSNR, SSIM

5. Experimental Results (实验结果)

5.1 主实验：Long Video Generation in the Wild

Figure 7 解读：定性对比三类生成任务：

(a) Creative Video Generation: StreamingT2V / FramePack 在场景切换时严重退化，SVI-Film 保持平滑 transition 和高视觉质量
(b) Consistent Video Generation: 现有方法出现 color shift、motion drift、静态退化，SVI-Shot 保持时间一致性和合理动态
(c) Multimodal Conditional Generation: SVI-Talk / SVI-Dance 无需专门设计即可在长视频中保持条件一致性

Consistent Video Generation (Table 1 上半):

Method	Sub. Cons.	Back. Cons.	Aest. Qual.	Img. Qual.	Dyn. Deg.	Motion Sm.
Wan 2.1	87.03%	92.45%	56.40%	65.70%	12.68%	98.51%
FramePack	93.08%	94.72%	63.57%	66.72%	7.75%	99.57%
SVI-Shot	98.13%	98.19%	63.84%	71.88%	17.61%	98.93%

SVI-Shot 在大多数指标上达到 SOTA，尤其 consistency (+5.05%) 和 image quality (+5.16%) 相对 FramePack 提升显著。

Creative Video Generation (Table 1 下半):

Method	Sub. Cons.	Back. Cons.	Aest. Qual.	Img. Qual.	Dyn. Deg.	Motion Sm.
Wan 2.1	81.44%	89.81%	51.33%	53.09%	61.97%	98.57%
SVI-Film	84.27%	90.68%	57.02%	57.04%	57.04%	99.12%
SVI-Shot	93.52%	95.86%	58.07%	62.81%	55.63%	98.42%

现有 long video 方法在多场景切换时全面失败，SVI 显著领先。

Conditional Generation:

Task	Method	Key Metrics
Audio-talking	SVI-Talk	Sync-C: 6.12, Sync-D: 8.74, FVD: 390
Skeleton-dance	SVI-Dance	PSNR: 20.01, SSIM: 0.71, FVD: 299

5.2 Stability Analysis

Figure 5 解读：随着生成帧数增加 (800 → 4800)，所有现有方法的 consistency 和 quality 均明显下降，而 SVI 保持稳定 (几乎水平线)。这证明 SVI 真正实现了 error correction 而非仅仅 mitigation。

5.3 Error 可视化

Figure 6 解读：

上行: Wan 2.1 对自身 error 敏感， $\hat{E}_{vid}$ (decoded error) 显示明显结构化 degradation → SVI fine-tuning 后 error 大幅减小
下行: 注入 error ( $E_{noi}$ by Wan 2.1) 可以很好地模拟 autoregressive drifting → 验证 error recycling 策略的有效性

5.4 Ablation Study

Method	Sub. Cons.	Back. Cons.	Aest. Qual.	Img. Qual.
Wan 2.1 (baseline)	66.73%	82.83%	43.95%	42.31%
SVI w/o $E_{img}$	73.82%	84.21%	49.58%	57.63%
SVI w/o $E_{noi}$	94.22%	94.87%	59.80%	69.90%
SVI w/o $E_{vid}$	93.56%	95.01%	58.99%	71.50%
SVI full	94.69%	95.39%	61.88%	71.22%

关键发现：

去除 $E_{img}$ 导致所有指标大幅下降 → reference image error injection 是核心 (直接模拟 cross-clip conditional error)
$E_{vid}$ 和 $E_{noi}$ 起辅助作用，三者结合效果最佳

5.5 定性展示

Figure 8 解读：展示 SVI-Film 的定性结果，重点是多场景长视频中的连续性与场景切换表现。

Figure 9 解读：展示 cat creative short 的定性结果，重点是创意视频生成中的画面连贯性。

Figure 10 解读：展示 baby creative short 的定性结果，重点是创意视频生成中的主体稳定性。

Figure 11 解读：展示 zoo creative short 的定性结果，重点是多场景生成中的整体连贯性。

Figure 12 解读：展示 motorcycle creative short 的定性结果，重点是动作与画面的一致性。

Figure 13 解读：展示 airplane creative short 的定性结果，重点是长视频生成中的视觉稳定性。

Figure 14 解读：展示 dance_034 的定性结果，重点是 skeleton-guided dancing 的动作控制效果。

Figure 15 解读：展示 talkingface_vis 的定性结果，重点是 audio-conditioned talking face 的一致性。

Figure 16 解读：展示 tom_vis 的定性结果，重点是长视频中的视觉连贯性与条件一致性。

Paper Notes

探索

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

1. Motivation (研究动机)

1.1 长视频生成的核心瓶颈：Error Accumulation

1.2 现有方案的局限

1.3 关键洞察：Training-Test Hypothesis Gap

2. Idea (核心思想)

3. Method (方法)

3.1 整体架构

3.2 Training-Test Hypothesis Gap 的数学分析

3.3 Error-Recycling Fine-Tuning (ERFT)

3.3.1 Error Injection

3.3.2 Control Injection & Velocity Prediction

3.4 Bidirectional Error Curation

3.5 Error Replay Memory

3.6 Optimization 目标

3.7 SVI Family — 多应用变体

3.8 Code Mapping

4. Experimental Setup (实验设置)

4.1 Base Model & Training

4.2 Benchmarks

4.3 Baselines

4.4 Metrics

5. Experimental Results (实验结果)

5.1 主实验：Long Video Generation in the Wild

5.2 Stability Analysis

5.3 Error 可视化

5.4 Ablation Study

5.5 定性展示

目录