HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Authors: Kai Zou, Dian Zheng, Hongbo Liu, Tiankai Hang, Bin Liu, Nenghai Yu Affiliations: University of Science and Technology of China (USTC), The Chinese University of Hong Kong, Tongji University, Tencent Hunyuan, Anhui Province Key Laboratory of Digital Security (USTC) arXiv: 2603.08703 Project Page: jacky-hate.github.io/HiAR GitHub: Jacky-hate/HiAR Year: 2026

1. Motivation (研究动机)

自回归 (AR) 视频扩散模型能够生成理论上无限长的视频,但面临一个核心矛盾:

  • 时序连续性 vs. 质量退化:现有 AR 方法(如 Self-Forcing)在生成下一个 block 时,将前一个 block 完全去噪至 (即 clean context)作为条件。虽然 clean context 提供了最大的条件信号,但同时也以最大置信度传播了累积预测误差 ,导致逐 block 的分布漂移(color shift、over-saturation、motion repetition 等)。

  • 核心洞察:作者指出 高度 clean 的 context 并非必要条件。双向扩散模型(如 Wan2.1)在相同噪声水平下对所有帧联合去噪,仍能保持时序连贯性。这说明 noisy context 已经提供了足够的条件信号,且能有效衰减误差传播。

  • 最优噪声水平分析:context 噪声水平 控制一个 bias-information trade-off。将 context 呈现为 (当前去噪步的输出噪声水平)是满足时序因果性的最噪条件,在最大化误差衰减的同时保留所有必要的条件信息。

2. Idea (核心思想)

HiAR 提出 Hierarchical Denoising(分层去噪),将传统 AR 的 “block-first” 生成顺序翻转为 “step-first”:

  1. 去噪顺序翻转:不再逐 block 完成所有去噪步,而是在每个去噪步内对所有 block 执行因果生成。这样每个 block 的 context 自然处于与自身相同的噪声水平 ,实现了误差衰减。

  2. 流水线并行:在 的 (block, step) 网格中,同一反对角线上的 位置互相独立,可以分配到不同 GPU 实现 pipelined parallelism,获得 wall-clock 加速。

  3. Forward-KL 正则化:self-rollout distillation 在 hierarchical denoising 下会放大 reverse-KL 固有的 low-motion shortcut。作者引入 forward-KL regulariser,在 bidirectional-attention 模式下计算,保持运动多样性而不干扰 causal DMD 目标。

3. Method (方法)

3.1 整体框架

Figure 2 解读:左侧为现有 AR rollout(如 Self-Forcing),按 block-first 顺序逐 block 完成所有去噪步,每步条件化于 clean context(),放大误差传播。右侧为 HiAR 的 hierarchical denoising,按 step-first 顺序在每个去噪步内对所有 block 因果生成,context 处于匹配的噪声水平 。下方为训练流程:causal rollout 结合 reverse-KL (DMD) loss,加上在 bidirectional-attention 模式下计算的 forward-KL regulariser 以保持运动多样性。

3.2 Context Noise Level 与误差传播分析

误差分解:考虑 block 在步骤 (从 去噪到 )。设 为前一 block 的 ground-truth clean latent, 为模型预测。Context 在噪声水平 下构造为:

展开后分解为三项:

  • 真实信号和传播偏差共享系数 ,随机扰动系数为
  • 控制 bias-information trade-off:增大 衰减偏差但同时减少有用条件信号
  • (Self-Forcing 做法),误差无衰减直接传播

时序因果性约束:context 必须至少包含当前 block 在步骤 后所拥有的信息量。SNR 单调递增要求:

最优选择:取约束边界即最噪条件:

# Pseudocode: Context noise level selection
def get_context_noise_level(step_j, schedule):
    """Optimal context noise = output noise level of current step."""
    # schedule: [t_1=1, t_2, ..., t_S≈0]
    t_j = schedule[step_j]          # input noise level
    t_j1 = schedule[step_j + 1]    # output noise level
    # Optimal: use output level (noisiest that preserves causality)
    t_c = t_j1
    return t_c

3.3 Hierarchical Denoising 推理

核心改变:将去噪步 放在外循环,block 放在内循环。每步中,block 在噪声水平 下的状态为 context。

Algorithm 1 — Hierarchical Denoising:

def hierarchical_denoising(model, schedule, initial_noise, N):
    """
    Args:
        model: denoiser v_theta
        schedule: [t_1=1, t_2, ..., t_S≈0], S denoising steps
        initial_noise: {x_t1^(n)} for n=1..N
        N: number of blocks
    Returns:
        generated blocks {x_hat_0^(n)} for n=1..N
    """
    S = len(schedule) - 1
    x = initial_noise  # x[n] = x_{t_1}^{(n)}
 
    for j in range(S):  # denoising steps (outer loop)
        t_j = schedule[j]
        t_j1 = schedule[j + 1]
        sigma_j = get_sigma(t_j)
        sigma_j1 = get_sigma(t_j1)
 
        for n in range(N):  # causal block sweep (inner loop)
            # Context: blocks B_{<n} at noise level t_{j+1}
            context = get_kv_cache(blocks[:n])
            # Euler step
            v_pred = model(x[n], t_j, context=context)
            x[n] = x[n] + v_pred * (sigma_j1 - sigma_j)
            # Update KV cache with x[n] at t_{j+1}
            update_kv_cache(n, x[n])
 
    return {n: x[n] for n in range(N)}

3.4 Pipelined Parallelism

Figure: Pipeline Parallelism 解读:在 网格中,block 在步骤 仅依赖 在步骤 以及 在步骤 的结果。同一反对角线上的位置互相独立,可分配给不同 GPU。每个 stage 负责一个去噪步,stage 间通过异步点对点通信交换中间 latent。

KV cache 融合优化:在因果 attention 下,更新 的 KV cache 和去噪 可融合为一次 forward call,将每 stage 的 forward pass 从 减少到 (第一个 block 单独去噪 + 次融合 + 1 次尾部 cache 写入),在 4-step 设置下实现 加速。

# Pseudocode: DAC (Denoise-AddNoise-Cache) pipeline parallel
def pipeline_parallel_inference(model, schedule, noise, N, rank, world_size):
    """
    Each rank handles one denoising step (stage).
    Diagonal traversal of (block, step) grid.
    """
    S = len(schedule) - 1
    assert world_size == S  # one GPU per denoising step
 
    for diag in range(N + S - 1):
        n = diag - rank  # block index for this rank
        if 0 <= n < N:
            j = rank  # this rank's denoising step
            t_j, t_j1 = schedule[j], schedule[j + 1]
 
            # Receive x[n] at t_j from previous stage
            if rank > 0:
                x_n = recv(src=rank - 1)
            else:
                x_n = noise[n]
 
            # Phase D: Denoise
            v_pred = model(x_n, t_j, context=kv_cache[:n])
            x_denoised = x_n + v_pred * (sigma(t_j1) - sigma(t_j))
 
            # Phase A: Send to next stage (overlapped)
            if rank < world_size - 1:
                send(x_denoised, dst=rank + 1)
 
            # Phase C: Update KV cache (overlapped with next stage's D)
            update_kv_cache(n, x_denoised)

3.5 Training with Forward-KL Regulation

训练采用 self-rollout distillation,在 hierarchical denoising pipeline 下进行。

DMD Loss (Reverse-KL):student model 先 rollout 生成 ,再用作下一 block 的 context。训练目标为 reverse-KL (DMD):

Low-motion shortcut 问题:reverse-KL 的 mode-seeking 性质使 student 倾向于生成低运动输出以最小化 loss。Hierarchical denoising 放大了此效应,因为条件化于多水平 noisy context 增加了学习难度,给 mode-seeking 目标更多迭代以收敛到低运动模式。

Forward-KL Regulariser:从 teacher 采样 dense denoising trajectory ,监督 student 匹配相邻步之间的 velocity:

因为 target 来自 teacher 分布,优化此目标等价于最小化 forward-KL,鼓励 student 覆盖 teacher 的输出模式,保持运动多样性。

解耦设计

  1. 仅 bidirectional-attention 模式 仅在 bidirectional-attention 下计算,不修改 causal self-rollout DMD loss。因为 bidirectional 和 causal 模式下的 motion dynamics 强正相关(Pearson ),正则化前者等效约束后者。
  2. 仅前 :motion dynamics 主要由低频结构决定(earliest denoising steps),因此 仅应用于前 步。

总训练目标

其中 ,critic 和 generator 以 5:1 比例交替更新。

# Pseudocode: HiAR training step
def train_step(student, teacher, critic, batch, ode_pairs):
    """
    Combined DMD + Forward-KL training.
    Args:
        student: student generator v_theta
        teacher: frozen teacher (Wan2.1-1.3B)
        critic: fake score network (Wan2.1-14B)
        batch: training data with text prompts
        ode_pairs: pre-sampled teacher ODE trajectories
    """
    # === Phase 1: DMD Loss (causal rollout) ===
    # Self-rollout under hierarchical denoising
    x_gen = hierarchical_rollout(student, batch, causal_attention=True)
    # Compute reverse-KL gradient via critic
    dmd_loss = compute_dmd_loss(student, critic, x_gen, batch)
 
    # === Phase 2: Forward-KL Loss (bi-attn mode) ===
    # Sample teacher trajectory pair (x_ti, x_ti+1)
    x_ref_t, x_ref_t1, t_i, t_i1 = sample_ode_pair(ode_pairs)
    # Student prediction in bidirectional-attention mode
    v_pred = student(x_ref_t, t_i, bidirectional_attention=True)
    # Target velocity from teacher trajectory
    v_target = (x_ref_t1 - x_ref_t) / (sigma(t_i1) - sigma(t_i))
    fkl_loss = mse_loss(v_pred, v_target)
 
    # === Combined objective ===
    lambda_fkl = 0.1
    total_loss = dmd_loss + lambda_fkl * fkl_loss
    total_loss.backward()

Figure 4 解读:训练过程中 bidirectional dynamic score 与 causal dynamic score 的相关性散点图。每个点代表一个 training checkpoint,颜色编码训练步数。Pearson 的强正相关证实 low-motion shortcut 同时影响两种 attention 模式,正则化 bidirectional 模式能有效约束 causal 模式的运动多样性。

3.6 Code-to-Paper 映射表

Paper ConceptSource FileKey Class/Function
Hierarchical Denoising 推理pipeline/causal_inference.pyCausalInferencePipeline.inference()
Timestep-first training looppipeline/timestep_forcing_training.pyTimestepForcingTrainingPipeline._timestep_first_loop()
Frame-first training looppipeline/timestep_forcing_training.pyTimestepForcingTrainingPipeline._frame_first_loop()
DMD loss (Reverse-KL)model/dmd.pyDMD.compute_distribution_matching_loss()
Forward-KL regulariser ()model/timestep_forcing_dmd.pyTimestepForcingDMD.progressive_distillation_loss()
KV cache 融合 forwardmodel/timestep_forcing_dmd.pyTimestepForcingDMD._pd_single_forward()
Pipeline parallel inference (DAC)scripts/pipeline_parallel_inference.py_run_dac()
Pipeline parallel inference (baseline)scripts/pipeline_parallel_inference.py_run_baseline()
Critic lossmodel/dmd.pyDMD.critic_loss()
Generator lossmodel/dmd.pyDMD.generator_loss()
训练主循环trainer/distillation.pyDistillationTrainer.train(), fwdbwd_one_step()
Teacher ODE trajectory 生成scripts/generate_ode_pairs.py
Noise schedule ()wan/utils/fm_solvers.py

4. Experimental Setup (实验设置)

Base model: Wan2.1-1.3B,teacher 为 Wan2.1-14B

训练配置

  • 4-step denoising schedule ()
  • Chunk-wise 实现,每 chunk 3 个 latent frames
  • 16k ODE solution pairs,causal attention masking fine-tune
  • Forward-KL:从 Wan2.1-1.3B base model 采样 20k denoising trajectories(50 ODE steps each),仅约束第 1 个去噪步 (),
  • Learning rate: ,batch size 64,训练 20k steps on 5-second clips
  • 推理时使用 sliding-window KV cache(5s window)

评测指标

  • VBench:16 维度,分为 Quality score 和 Semantic score,所有模型采样 20s
  • Drift Score(本文提出):将 20s 视频均分为 5 段,计算每段的感知质量(MUSIQ、CLIP-IQA)、时序连贯性(DINOv2 cosine similarity、LPIPS)、低级统计量(HSV saturation mean、Laplacian variance),对每个指标做线性回归取斜率,归一化加权求和

Baselines

  • 双向扩散:LTX-Video、Wan2.1-1.3B
  • AR 扩散:NOVA、Pyramid Flow、SkyReels-V2 (1.3B)、MAGI-1 (4.5B)
  • Distilled AR:CausVid、Self-Forcing、Causal Forcing

5. Experimental Results (实验结果)

5.1 主实验 (Table 1: 20s generation)

ModelThru. (fps)Lat. (s)Total↑Quality↑Semantic↑Dynamic↑Drift↓
LTX-Video8.9813.50.7660.7890.6850.458
Wan2.1-1.3B0.781030.8020.8130.7660.690
NOVA0.884.10.7730.7770.7570.444
Pyramid Flow6.702.50.7750.8040.6700.161
SkyReels-V2 (1.3B)0.491120.7880.8080.7070.333
MAGI-1 (4.5B)0.192820.7570.7850.6470.486
CausVid170.690.7640.7710.7400.6210.842
Self-Forcing170.690.8050.8290.7080.5420.355
Causal Forcing170.690.8100.8370.7010.6720.615
HiAR (Ours)300.300.8210.8460.7230.6860.257

关键发现

  • HiAR 在所有方法中取得最高 Total score (0.821),最高 Quality score (0.846),最低 Drift (0.257)
  • 相比 Self-Forcing,Drift 降低 27.6%(0.257 vs. 0.355)
  • 吞吐量 30 fps,per-chunk latency 0.30s,相比同 backbone 的 distilled AR 模型实现 加速
  • Dynamic score 0.686,接近 Wan2.1-1.3B teacher (0.690),远超 Self-Forcing (0.542)

Figure 3 解读:distilled AR 模型在 20s 生成上的定性对比。CausVid 退化最严重(色调偏移至 neon green/yellow),Self-Forcing 和 Causal Forcing 存在明显的色彩过饱和。HiAR 在所有 prompt 类型上保持稳定的色彩保真度和结构连贯性,无可感知的 drift。

5.2 Ablation: Context Noise Level (Table 2)

Context noise Quality↑Semantic↑Smooth.↑Drift↓
(input level)0.7990.6920.9780.184
(output level; default)0.8460.7230.9880.257
(Self-Forcing)0.8290.7080.9910.355
  • :最低 drift (0.184) 但质量显著下降 (0.799),运动不流畅 (Smooth 0.978)
  • (HiAR default):最佳质量-稳定性平衡
  • (Self-Forcing):最高 drift (0.355)

5.3 Ablation: Forward-KL Regulariser Design (Table 3)

ConfigurationQuality↑Semantic↑Dynamic↑Drift↓
bi-attn + 1 step (default)0.8460.7230.6860.257
causal + 1 step0.8280.7010.6250.271
bi-attn + 2 steps0.8350.7080.6930.296
bi-attn + 4 steps0.8130.6840.6910.306
w/o 0.8390.7320.4450.218
w/o re-training0.7670.5590.5120.309
w/o hier. denoising (Self-Forcing)0.8290.7080.5420.355
  • 去除 :Dynamic 暴跌至 0.445,确认 low-motion shortcut 的存在
  • 仅推理时使用 hierarchical denoising (w/o re-training):Drift 改善 (0.309 vs. 0.355) 但质量损失大 (0.767)
  • Causal 模式下的 效果弱于 bidirectional 模式 (Dynamic 0.625 vs. 0.686)
  • 增加约束步数 带来的 Dynamic 边际收益递减但质量和 drift 恶化

Figure 1 解读:Motivation 总览。(a) Bidirectional diffusion(Wan2.1)证明 shared noise level 下即可保持时序连贯,但长度固定。(b) 标准 AR(Self-Forcing)可扩展长度但 clean context 导致质量退化。(c) HiAR 的 hierarchical denoising 在推理时(w/o training)即可缓解 drift,经过重训练(w/ training)后实现稳定质量、可扩展长度和无缝连续性。