FAR: Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Authors: Yuchao Gu, Weijia Mao, Mike Zheng Shou Affiliations: National University of Singapore (Show Lab) GitHub: showlab/FAR Year: 2025

1. Motivation (研究动机)

1.1 问题背景

当前视频生成模型（如 Wan、Cosmos）主要在短视频片段（约 5 秒）上训练，能捕捉短期时序一致性（物体运动、人体动作），但无法维持长期一致性（环境记忆、场景持续性）。要实现真正的 world simulator，模型需要在长视频上直接训练以捕捉长程依赖关系。

核心挑战：

视觉 token 数量随上下文长度爆炸式增长（128 帧 > 8K tokens），训练成本极高
Test-time extrapolation（滑窗、RoPE 外推等）效果远不如直接在长视频上训练
已有 AR-Diffusion 混合模型存在 training-inference gap

1.2 核心贡献

FAR (Frame AutoRegressive) 基线模型：帧级自回归 + 逐帧 flow matching，收敛速度快于 Video DiT，生成质量优于 Token-AR 模型
Context Redundancy 发现：视频自回归中邻近帧对时序一致性至关重要，远端帧主要作为记忆，存在大量冗余
Long Short-Term Context Modeling + Asymmetric Patchify Kernels：对远端上下文用大 patchify kernel 压缩 token，近端保持标准分辨率，训练成本降低约 5 倍、显存降低约 6 倍
Stochastic Clean Context：解决 AR-Diffusion 模型的 training-inference gap
Multi-Level KV Cache：加速长视频推理

1.3 个人思考与总结

FAR 的本质洞察：视频自回归中的 context redundancy。这是一个非常自然且有力的观察——远端帧主要提供“在哪个场景”的记忆信息，不需要像素级细节，而近端帧需要精确的运动信息来保持时序一致性。Asymmetric patchify 是对这一观察最直接的工程实现。

Stochastic Clean Context 的优雅之处：通过引入特殊时间步 $t = - 1$ 和随机替换策略，以零额外计算成本解决了 AR-Diffusion 模型的 training-inference gap。相比 ACDIT/MAGI 需要 double training cost 的方案，SCC 更加高效。

vs Video DiT (Latte)：FAR 帧级独立噪声 + 因果注意力 > Latte 统一噪声 + full attention，收敛更快（Fig.5）
vs Token-AR (MAGViT, TATS)：连续 latent space 避免了 VQ 信息损失，生成质量显著更高
vs MAGI/ACDIT：同为 Frame-AR 范式，但 FAR 不需要 double training cost，且有 long short-term context 优势
vs TECO：TECO 对所有帧统一使用 aggressive downscaling，牺牲了近端帧的预测精度；FAR 通过非对称策略保留了近端细节

缺少大规模 text-to-video 实验：目前仅在 UCF-101、BAIR、Minecraft、DMLab 上验证，未在大规模 T2V 数据上训练
视频长度有限：最长 300 帧（约 20 秒），未验证分钟级视频的能力
未来方向：
- Scale up 到大规模 T2V 数据，与 Wan/Cosmos 等直接对比
- 分钟级长视频数据集构建与评测
- 探索 video-level in-context learning（利用长上下文能力）
对 RL for Video Generation 的启示

每帧生成可视为一个 decision step，reward 可以逐帧或逐段给出
KV Cache 机制使得 rollout 高效，适合 online RL
Long short-term context 可以作为 RL agent 的 memory mechanism
Stochastic Clean Context 的思想可以迁移到 RL 中的 teacher forcing 策略

2. Idea (核心思想)

2.1 FAR 框架总览

Figure 2 解读：FAR 的训练和推理流程。左侧为短视频训练：输入视频帧序列经 VAE 编码到 latent space，每帧独立采样时间步 $t \sim U (0, 1)$ ，通过 flow matching 学习去噪。部分帧随机替换为 clean context（时间步标记为 -1）。右侧为长视频训练：引入 long-term context window（使用大 patchify kernel，每帧仅 4 tokens）和 short-term context window（标准 patchify，每帧 64 tokens），实现非对称 token 压缩。推理时采用自回归方式逐帧生成。

架构核心设计：

基于 DiT/SiT 架构，采用 causal spatiotemporal attention
帧内 full attention，帧间 causal attention（每帧只能看到之前的帧）
与 Latte 的区别：不使用交替的 spatial/temporal attention，而是统一的因果时空注意力

2.2 Context Redundancy

核心思想——Context Redundancy：

当前帧的生成主要依赖邻近帧（局部运动一致性）
远端帧主要提供场景记忆，不需要 fine-grained 细节
因此可以对远端帧使用更大的 patchify kernel 压缩 token 数量

2.3 Stochastic Clean Context

训练策略：

随机将一部分加噪帧替换为对应的 clean latent
给 clean context 帧分配特殊的时间步嵌入 $t = - 1$ （超出 flow matching 的 $[0, 1]$ 范围）
Clean context 帧不参与 loss 计算
推理时所有上下文帧都使用 clean latent（ $t = - 1$ ），与训练分布一致

量化效果（Table 6, UCF-101）：

方法	SSIM↑	PSNR↑	LPIPS↓	FVD↓
w/o SCC	0.540	16.42	0.211	399
w/ SCC	0.596	18.46	0.187	347

2.4 Long Short-Term Context Modeling

Figure 6 解读：Token 上下文长度与视觉上下文长度的关系。(a) 使用 long short-term context modeling 后，token 数量随帧数增长的速度远慢于 uniform context modeling，因为远端帧被压缩。(b) 训练时间对比：FAR-Long 相比 Video DiT 训练时间降低约 5 倍。(c) 训练显存对比：FAR-Long 显存降低约 6 倍。

Asymmetric Patchify Kernels：

Short-term context window（如最近 16 帧）：标准 patchify kernel，每帧 64 tokens
Long-term context window（更早的帧）：大 patchify kernel（如 $4 \times 4$ ），每帧仅 4 tokens
两个 window 使用独立的 projection layers（受 MM-DiT 启发）

Patchify Kernel 选择准则：

c \times c \times d \leq D

其中 $c$ 为 patchify kernel 大小， $d$ 为 latent 维度（32）， $D$ 为模型 hidden dimension。例如 $4 \times 4 \times 32 = 512 < 768$ ，信息基本无损。

Patchify Kernel 消融实验（Table 7）：

Patchify Kernel	SSIM↑	PSNR↑	LPIPS↓	FVD↓	训练显存
[1,1]	-	-	-	-	OOM
[2,2]	0.570	19.1	0.156	38	38.9 G
[4,4]	0.576	19.3	0.153	34	15.3 G
[8,8]	0.558	18.6	0.171	33	0.9 G

$[4, 4]$ kernel 在性能和效率之间取得最佳平衡。

3. Method (方法)

3.1 训练目标：Flow Matching

给定数据样本 $x_{0} \sim p_{data} (x)$ 和噪声 $x_{1} \sim N (0, I)$ ，构建线性插值轨迹：

x (t) = (1 - t) x_{0} + t x_{1}, t \in [0, 1]

对应的常数速度场：

\frac{d x ( t )}{d t} = v^{*} = x_{1} - x_{0}

训练目标为逐帧 flow matching loss：

L (θ) = E_{x_{0}, x_{1}, t} [∥ v_{θ} (x (t), t) - v^{*} ∥^{2}]

关键区别：FAR 对每帧独立采样时间步（diffusion forcing），而非 Video DiT 的统一时间步。这使得帧级自回归成为可能。

3.2 自回归建模

Figure 3 解读：注意力掩码可视化。(a) 短视频训练：标准帧级因果注意力，每帧可以看到自身及之前所有帧的全部 token。(b) 长视频训练：远端帧（long-term context）使用更少的 token（aggressive patchification），近端帧（short-term context）保持完整 token 数。

FAR 将视频建模分解为帧级自回归：

p (x_{1}, x_{2}, \dots, x_{n}) = i = 1 \prod n p (x_{i} ∣ x_{1}, x_{2}, \dots, x_{i - 1})

每个 $p (x_{i} ∣ context)$ 通过 flow matching 实现（条件去噪）。

3.3 Inference-Time KV Cache

Figure 7 解读：短视频模型的 KV Cache 机制。自回归生成时，对每帧先用 flow matching schedule 去噪得到 clean latent，再执行 caching step（ $t = - 1$ ）将 clean frame 编码进 KV cache。后续帧的生成直接复用已缓存的 KV，避免重复计算。

Figure 8 解读：长视频模型的 Multi-Level KV Cache。分为三步：(Step 1) 当一帧离开 short-term window 时，用大 patchify kernel 编码为 L2 cache（4 tokens/frame）；(Step 2) 重新编码 short-term window 中所有帧的 L1 cache；(Step 3) 基于 L1+L2 cache 去噪当前帧。实际实现中三步可合并为单次 forward pass。

KV Cache 推理加速效果（Figure 11）：

无 KV Cache + 无 Long Short-Term：256 帧约 1341 秒
加 KV Cache：约 171 秒（7.8x 加速）
加 Long Short-Term + Multi-Level KV Cache：约 104 秒（12.9x 加速）

3.4 训练伪代码

# FAR Training Pipeline (Short-Video)
def train_step(video_frames, vae, model):
    # 1. Encode to latent space
    Z = vae.encode(video_frames)  # Z: [B, T, H, W, C]
 
    # 2. Sample independent timestep per frame
    t = torch.rand(B, T)  # t[b,i] ~ U(0,1), independent per frame
 
    # 3. Stochastic Clean Context: randomly replace some frames
    clean_mask = torch.rand(B, T) < scc_ratio  # e.g., ratio=0.1
    t[clean_mask] = -1  # special timestep for clean context
 
    # 4. Add noise via linear interpolation (flow matching)
    noise = torch.randn_like(Z)
    Z_noisy = (1 - t) * Z + t * noise  # per-frame interpolation
    Z_noisy[clean_mask] = Z[clean_mask]  # clean frames unchanged
 
    # 5. Forward pass with causal spatiotemporal attention
    v_pred = model(Z_noisy, t, causal_mask=True)
 
    # 6. Compute loss only on non-clean frames
    v_target = noise - Z  # target velocity
    loss_mask = ~clean_mask
    loss = ((v_pred - v_target) ** 2 * loss_mask).mean()
    return loss
 
# FAR Long-Video Training with Asymmetric Patchify
def train_step_long(video_frames, vae, model):
    Z = vae.encode(video_frames)  # Z: [B, T, H, W, C]
 
    # Split into long-term and short-term context
    Z_long = Z[:, :T-n]   # distant frames
    Z_short = Z[:, T-n:]  # recent n frames (e.g., n=16)
 
    # Asymmetric patchification
    tokens_long = large_patchify(Z_long, kernel=4)   # 4 tokens/frame
    tokens_short = standard_patchify(Z_short, kernel=1)  # 64 tokens/frame
 
    # Separate projection layers
    tokens_long = proj_long(tokens_long)
    tokens_short = proj_short(tokens_short)
 
    # Concatenate and apply causal attention + flow matching loss
    tokens = concat(tokens_long, tokens_short)
    # ... (same flow matching loss as above)

4. Experimental Setup (实验设置)

4.1 模型配置

Model Variants（Table 1）：

模型	Layers	Hidden Size	MLP	Heads	Params
FAR-B	12	768	3072	12	130M
FAR-M	12	1024	4096	16	230M
FAR-L	24	1024	4096	16	457M
FAR-XL	28	1152	4608	18	674M
FAR-B-Long	12	768	3072	12	158M
FAR-M-Long	12	1024	4096	16	280M

Long 版本增加了独立的 long-term projection layers，参数略多。

4.2 训练配置

训练配置（Table 8 精选）：

VAE: DC-AE (8x8 压缩)，latent dim = 32
短视频生成（UCF-101）: 16 帧序列，400K steps，LR $1 \times 1 0^{- 4}$ ，batch 32，SCC ratio 0.1
长视频预测（Minecraft/DMLab）: 300 帧序列，short-term window 16 帧，patchify kernel [4,4]，1M steps

4.3 代码结构与论文对应

论文内容	代码路径 (showlab/FAR)	说明
FAR 模型架构 (Table 1)	`far/models/`	DiT/SiT 基础架构 + causal spatiotemporal attention
Causal Attention Mask (Fig.3)	`far/models/`	帧级因果注意力掩码实现
Flow Matching Loss (Eq.3)	`far/losses/`	逐帧 flow matching 训练目标
Stochastic Clean Context	`far/pipelines/`	随机替换帧为 clean latent (t=-1)
Asymmetric Patchify Kernels	`far/models/`	Long/Short-term 独立 projection layers
KV Cache & Multi-Level Cache	`far/models/`	推理时 KV 缓存 + L1/L2 分级缓存
DC-AE Encoder	`train_dcae.py`	图像 VAE 训练（8x8 压缩，latent dim=32）
训练入口	`train.py`	主训练脚本，支持 accelerate 分布式
评测入口	`test.py`	FVD/SSIM/PSNR/LPIPS 评测
数据加载	`far/data/`	UCF-101, BAIR, Minecraft, DMLab 数据处理
训练配置	`options/`	各实验的超参数配置文件
评测指标	`far/metrics/`	FVD, SSIM, PSNR, LPIPS 计算
训练循环	`far/trainers/`	分布式训练 trainer
工具函数	`far/utils/`	通用工具

4.4 可用 Checkpoints

模型	分辨率	任务	FVD	HuggingFace
FAR-L (457M)	128x128	Uncond Gen	280±11.7	guyuchao/FAR_Models
FAR-L (457M)	256x256	Uncond Gen	303±13.5	guyuchao/FAR_Models
FAR-XL (674M)	256x256	Uncond Gen	279±9.2	guyuchao/FAR_Models
FAR-B (130M)	64x64	Prediction	194.1	guyuchao/FAR_Models

5. Experimental Results (实验结果)

5.1 短视频生成（UCF-101）

Figure 5 解读：FAR-L 与 Video DiT-L 在 UCF-101 无条件视频生成上的 FVD 收敛曲线对比。FAR 在整个训练过程中始终优于 Video DiT，收敛更快且最终 FVD 更低，证明了帧级自回归 + flow matching 的优越性。

UCF-101 定量结果（Table 3）：

方法	类型	Params	Cond. FVD↓	Uncond. FVD↓
MAGViTv2-MLM	Non-AR	307M	58†	-
MAGViTv2-AR	Token-AR	840M	109†	-
TATS	Token-AR	331M	332	420
Latte	Video-DiT	674M	-	478
OmniTokenizer	Token-AR	650M	191	-
MAGI	Frame-AR	850M	-	421
FAR-L (Ours)	Frame-AR	457M	99 (57†)	280
FAR-XL (Ours)	Frame-AR	674M	108	279

FAR 以更少参数实现 SOTA，无需额外的 double training cost。

5.2 短视频预测（UCF-101, BAIR）

Table 4 精选（UCF-101, c=4, p=12）：

方法	SSIM↑	PSNR↑	LPIPS↓	FVD↓
MCVD-cp	0.658	21.82	0.088	468.1
ExtDM-K2	0.754	23.89	0.056	394.1
FAR-B (Ours)	0.818	25.64	0.037	194.1

FAR 在所有指标上大幅领先，无需复杂的多尺度融合或光流设计。

5.3 长视频预测（Minecraft, DMLab）

Figure 1 解读：长视频预测（Minecraft 300帧）各方法的 FVD vs LPIPS 散点图。FAR 位于左下角（低 FVD + 低 LPIPS），显著优于所有对比方法（FitVid、CW-VAE、Latent FDM、TECO、Perceiver AR），证明了 FAR 在长程视频建模上的优势。

Test-Time Extrapolation vs Long-Video Training（Table 2）：

方法	SSIM↑	PSNR↑	LPIPS↓	FVD↓
Sliding Window	0.365	12.3	0.415	161
Naive RoPE Ext.	0.372	12.2	0.397	396
RIFLEx	0.372	12.2	0.398	391
FAR-B-Long	0.576	19.3	0.153	34

直接在长视频上训练的效果远超任何 test-time extrapolation 方法，FVD 从 161 降至 34。

长视频预测定量结果（Table 5, c=144, p=156）：

方法	DMLab LPIPS↓	DMLab FVD↓	Minecraft LPIPS↓	Minecraft FVD↓
TECO	0.157	48	0.340	116
Latent FDM	0.222	181	0.429	167
FitVid	0.491	176	0.519	956
FAR-B-Long	0.104	64	0.251	39

FAR 在 Minecraft 上 FVD=39（TECO=116），LPIPS=0.251（TECO=0.340），展现出卓越的长程一致性。

Figure 9/12 解读：DMLab 数据集上长视频预测的定性对比。给定 144 帧上下文，预测 156 帧。FAR 的预测在整个 300 帧序列中保持了与 GT 最接近的场景结构和颜色一致性，而 TECO、Latent FDM 等方法在后期帧中出现严重的场景漂移和模糊。

Figure 13 解读：Minecraft 数据集上的长视频预测定性对比。FAR 在预测帧中成功保持了地形结构、天空颜色和树木纹理的一致性，而其他方法（特别是 FitVid、CW-VAE）在远期帧中出现严重退化。

5.4 消融实验

Short-Term Context Window Size（Figure 10）：

Figure 10 解读：短期上下文窗口大小的消融实验。随着局部上下文长度从 1 增加到 8 帧，FVD 和 PSNR 持续改善。但超过 8 帧后性能趋于饱和，验证了 context redundancy 假设——更多近端帧并不能持续提升性能。因此选择 8 帧作为最优短期窗口大小。

KV Cache 消融（Figure 11）：

Figure 11 解读：KV Cache 对推理速度的影响。四条曲线分别对应：无 KV Cache 的 FAR（最慢，256帧约 1341s）、带 KV Cache 的 FAR（约 171s）、无 Multi-Level Cache 的 FAR-Long（约 200s+）、带 Multi-Level KV Cache 的 FAR-Long（最快，约 104s）。Multi-Level KV Cache 对长视频推理加速效果最显著。

5.5 关键数字速查

指标	数值
FAR-L UCF-101 Cond. FVD	99 (57†)
FAR-XL UCF-101 Uncond. FVD	279
FAR-B UCF-101 Prediction FVD	194.1
FAR-B-Long Minecraft FVD (300帧)	39
FAR-B-Long DMLab FVD (300帧)	64
Long-term vs Test-time FVD	34 vs 161 (Sliding Window)
训练时间节省	~5x (vs Video DiT, 256帧)
训练显存节省	~6x (vs Video DiT, 256帧)
KV Cache 推理加速	~12.9x (256帧, with Multi-Level)
Short-term window 最优大小	8 帧
Long-term patchify kernel	[4,4] (4 tokens/frame vs 64)
SCC ratio	0.1

Paper Notes

探索

FAR: Long-Context Autoregressive Video Modeling with Next-Frame Prediction

FAR: Long-Context Autoregressive Video Modeling with Next-Frame Prediction

1. Motivation (研究动机)

1.1 问题背景

1.2 核心贡献

1.3 个人思考与总结

2. Idea (核心思想)

2.1 FAR 框架总览

2.2 Context Redundancy

2.3 Stochastic Clean Context

2.4 Long Short-Term Context Modeling

3. Method (方法)

3.1 训练目标：Flow Matching

3.2 自回归建模

3.3 Inference-Time KV Cache

3.4 训练伪代码

4. Experimental Setup (实验设置)

4.1 模型配置

4.2 训练配置

4.3 代码结构与论文对应

4.4 可用 Checkpoints

5. Experimental Results (实验结果)

5.1 短视频生成（UCF-101）

5.2 短视频预测（UCF-101, BAIR）

5.3 长视频预测（Minecraft, DMLab）

5.4 消融实验

5.5 关键数字速查

目录