SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Authors: Samir Khaki (NVIDIA/U of Toronto), Junxian Guo (MIT), Jiaming Tang (MIT), Shang Yang (MIT), Yukang Chen (NVIDIA), Konstantinos N. Plataniotis (U of Toronto), Yao Lu (NVIDIA), Song Han (NVIDIA/MIT), Zhijian Liu (NVIDIA/UC San Diego) Affiliations: NVIDIA, MIT, UC San Diego, University of Toronto Venue: ICCV 2025 | arXiv: 2510.17777

机构: NVIDIA, MIT, UC San Diego, University of Toronto

发表: ICCV 2025 | arXiv: 2510.17777

代码: 暂未公开（截至 2026-03-12 未找到官方 GitHub 仓库）

一句话总结: SparseVILA 将视觉 token 稀疏化解耦到 prefill 和 decoding 两个阶段——prefill 阶段用 query-agnostic 方式剪枝冗余视觉 token，decoding 阶段用 query-aware 方式仅检索与当前问题相关的 token，实现最高 4.0x prefill / 2.5x decoding / 2.6x 端到端加速，同时保持甚至提升多轮对话和长视频理解的精度。

1. Motivation (研究动机)

1.1 VLM 推理瓶颈

VLM（Vision Language Models）处理高分辨率图像、长视频时，视觉 token 数量可达数万甚至数十万，导致：

Prefill 阶段：compute-bound，视觉 token 主导计算量
Decoding 阶段：memory-bound，KV cache 中大量视觉条目导致内存带宽瓶颈
多轮对话：每轮需重新处理同一视觉上下文，延迟累积严重

Figure 3 解读: 不同任务中 prefill 与 decoding 的延迟占比差异显著。Image 任务 prefill 占 85%，Video 任务 decoding 占 65%，Reasoning 任务 decoding 占 58%。这表明不能用统一的稀疏策略处理两个阶段——需要根据各阶段的计算特性分别优化。

1.2 现有方法的局限

现有 token 剪枝方法分为两类，各有致命缺陷：

类别	代表方法	优点	致命缺陷
Query-Agnostic	VisionZip, PruMerge, HIRED	不依赖文本输入，多轮稳定	无法适应不同 query 的需求，高稀疏率下丢失细粒度信息
Query-Aware	FastV, PDrop, SparseVLM	根据 query 动态选择相关 token	永久删除 token 后无法恢复，多轮对话性能急剧下降

Figure 2 解读: 即使使用 query-aware oracle（贪心选择使输出与完整模型一致的最优 token 子集），在多轮对话中精度仍持续下降（POPE 数据集上从 R-0 到 R-15 下降约 10%）。这说明 query-aware 在 prefill 阶段做的 token 剪枝是不可逆的——一旦删除，后续轮次无法恢复。

2. Idea (核心思想)

2.1 核心洞察

关键洞察: 视觉稀疏性不应在推理流水线中统一施加，而应根据 prefill 和 decoding 两个阶段的不同计算特性进行解耦。

Prefill 阶段：轻量级 query-agnostic 剪枝，移除全局冗余 token，保留足够的视觉覆盖
Decoding 阶段：激进的 query-aware 检索，从完整 KV cache 中仅激活与当前 query 相关的 token

2.2 Decoupled Sparsity 的设计动机

┌─────────────────────────────────────────────────────────────┐
│                    SparseVILA Pipeline                       │
│                                                             │
│  ┌──────────────────────┐   ┌────────────────────────────┐  │
│  │   Prefill Stage      │   │   Decoding Stage (per turn)│  │
│  │                      │   │                            │  │
│  │  Visual Encoder      │   │  Query Embedding           │
│  │       ↓              │   │       ↓                    │
│  │  Salience Score      │   │  Attn(Q, V_KV) → r_i      │
│  │       ↓              │   │       ↓                    │
│  │  Prune low-salience  │   │  Select top-r_d tokens     │
│  │  (query-agnostic)    │   │  (query-aware)             │
│  │       ↓              │   │       ↓                    │
│  │  LLM Prefill         │   │  Pack into contiguous KV   │
│  │  → Full Visual KV    │   │       ↓                    │
│  │    Cache retained    │   │  Autoregressive Decoding   │
│  └──────────────────────┘   └────────────────────────────┘  │
│                                                             │
│  Sparsity: 轻量级 (60-75%)      Sparsity: 激进 (75-95%)      │
│  执行一次，多轮复用               每轮动态调整                   │
└─────────────────────────────────────────────────────────────┘

Table 1 验证: 在 RoboVQA 上对比 prefill-only 与 decoupled sparsity：

Prefill Sparsity	Decode Sparsity	Prefill 加速	Decode 加速	E2E 加速	RoboVQA
0%	0%	1.0x	1.0x	1.0x	86.4
90%	0%	14.6x	1.1x	1.4x	80.0
70%	85%	4.9x	1.2x	1.4x	89.1

将稀疏性从 prefill 重新分配到 decoding 后，在相同 E2E 加速下精度从 80.0 提升到 89.1（甚至超过无稀疏 baseline 86.4）。

2.3 Visual Attention Sink 与 Retrieval Token 的发现

Figure 8 解读: LLaVA-1.5 不同 LLM 层中视觉 token 的 attention 分析。Layer 2（浅层）中 attention 集中在少数固定 token 上，这些是 Visual Attention Sink——跨 query 保持稳定的”锚点”token。Layer 19（深层）中 attention 模式随 query 变化，出现 Visual Retrieval Token——动态响应 query 内容。SparseVILA 的 decoupled 设计同时捕获两种 token：prefill 保留 sink token（全局重要性高），decoding 检索 retrieval token（query 相关性高）。IoU 分析：Layer 2 的跨 query IoU 为 70%（sink 稳定），Layer 19 为 21%（retrieval 多变），SparseVILA 整体为 40%。

3. Method (方法)

3.1 Prefill 阶段：Query-Agnostic Pruning

目标: 在不知道用户 query 的情况下，基于视觉冗余性移除不重要的 token。

Token Salience Estimation

根据视觉编码器类型，采用不同的 salience 计算策略：

Case 1: 有 summary token 的编码器（如 CLIP）

s_{i} = Attn ([CLS], x_{i}) = \frac{exp ( q _{CLS} \cdot k _{i} / d )}{\sum _{j} exp ( q _{CLS} \cdot k _{j} / d )}

其中 $s_{i}$ 表示第 $i$ 个视觉 token 对全局 [CLS] token 的 attention 贡献。

Case 2: 有多个 summary token 的编码器（如 RADIO）

s_{i} = \frac{1}{∣ S ∣} s \in S \sum Attn (s, x_{i})

对所有 summary token 的 attention 取平均。

Case 3: 无 summary token 的编码器（如 SigLIP, QwenVL）

s_{i} = \frac{1}{N} j = 1 \sum N Attn (x_{j}, x_{i})

对所有 token 之间的 intra-visual attention 取平均。

剪枝规则: 按 salience 排序，移除 salience 最低的 $(1 - r_{p}) \times N$ 个 token（ $r_{p}$ 为 prefill 保留比例）。

高效 Triton Kernel 实现

对长视频（数十万 token），显式构建完整 attention 矩阵会导致内存爆炸。论文实现了自定义 Triton kernel：

流式 softmax 归一化 + salience 累加，无需显式构建完整 attention 矩阵
SigLIP 类编码器加速 3x，QwenVL 类编码器加速 10x

Figure 5 解读: Triton kernel 的性能对比。(a) Vision Encoder 端：SigLIP 在 512 帧时 naive 实现约 300ms，Triton kernel 约 100ms（3x 加速）；QwenVL 在 256K 序列长度时 naive 实现约 15s，Triton kernel 约 1.5s（10x 加速）。(b) LLM Decoder 端：Llama2 backbone 上 Triton kernel 实现约 1.5x 加速，Qwen2 backbone 上约 1.8x 加速。

3.2 Decoding 阶段：Query-Aware Retrieval

目标: 在 decoding 生成时，从保留的视觉 KV cache 中仅检索与当前 query 最相关的 token。

Query-Aware Token Selection

在 decoding 开始前，计算 query embedding 与视觉 KV cache 条目之间的 aggregate attention strength：

r_{i} = q \in Q \sum Attn (q, v_{i}^{KV})

其中 $Q$ 为 query token 集合， $v_{i}^{KV}$ 为第 $i$ 个视觉 token 的 KV cache 条目。选择 $r_{i}$ 最高的 $(1 - r_{d}) \times N^{'}$ 个 token（ $r_{d}$ 为 decoding 稀疏率）。

关键设计:

被选中的 KV 条目被紧凑打包到连续内存区域，避免稀疏访问模式
未被选中的 token 不被删除——仍保留在 KV cache 中，可在后续轮次被检索
该操作与 FlashAttention2 的 prefill path 并发执行，额外开销仅 1.5x naive 实现

Rotary Position Embedding 处理

不同 VLM 使用不同的位置编码方案，剪枝后需要特殊处理：

位置编码类型	使用模型	处理方式
统一 RoPE	LLaVA-NeXT, LongVILA	保留被选 token 的连续位置索引范围
Multimodal RoPE	Qwen2.5-VL	沿 temporal/height/width 维度重建最小连续位置网格，后移文本位置以保持全局连续性

3.3 整体实现要点

Prefill 负责一次性完成全局剪枝，并保留完整视觉 KV cache
Decoding 按轮次检索相关 token，并将其打包成连续 KV cache
两阶段 sparsity 分别面向 compute-bound 与 memory-bound 瓶颈优化

4. Experimental Setup (实验设置)

4.1 评测模型与稀疏配置

Image benchmark: LLaVA-NeXT-7B，Prefill 60% / Decode 75%
Video understanding: LongVILA-7B (256f)、Qwen2.5-VL-7B (4fps)、Nemotron-Nano-VL-8B (256f)
Video captioning: LongVILA-7B (256f)，Prefill 75% / Decode 90%
Physical reasoning: Cosmos-Reason1-7B (24fps)，Prefill 75% / Decode 95%

4.2 评测任务与基准

Image benchmarks: AI2D, ChartQA, DocVQA, GQA, InfoVQA, MME, POPE, ScienceQA, TextVQA
Video understanding: LVB, MLVU, NExT-QA, Video-MME
Video captioning: Video-ChatGPT 指标 CI, DO, CU, TU, C, Overall
Physical reasoning: HoloAssist, RoboFail, RoboVQA
Visual retrieval: V-NIAH（Visual Needle-in-a-Haystack）
Efficiency analysis: kernel runtime、overhead、理论加速与实测加速

4.3 备注

论文展示了不同任务下 prefill 与 decoding 的延迟占比，但具体 prompt 模板、解码超参与部分实现细节在当前 note 中未详细说明。
以下结果均沿用论文中的表格与图示信息整理。

5. Experimental Results (实验结果)

5.1 Image Benchmark（Table 2）

方法	P	D	E2E 加速	AI2D	ChartQA	DocVQA	GQA	InfoVQA	MME	POPE	SQA	TextVQA
Baseline	0	0	1.0x	63.9	53.0	63.6	63.5	28.4	1857.8	84.5	69.3	58.2
FastV	.80	0	1.2x	61.8	31.6	33.5	55.3	22.0	1568.2	76.7	66.7	52.7
SparseVLM	.75	0	1.2x	63.2	39.9	41.8	59.7	22.2	1823.9	83.4	69.6	57.6
VisionZip	.80	0	1.2x	62.9	38.2	48.5	60.3	24.2	1727.4	84.1	67.9	57.1
SparseVILA	.60	.75	1.2x	64.1	47.8	58.0	62.7	25.6	1831.0	85.8	69.6	59.1

关键发现: 在相同 1.2x E2E 加速下，SparseVILA 在 document/chart 理解任务上比 prior art 降低 >15% 的精度退化。在 AI2D、POPE、ScienceQA、TextVQA 上甚至超过 baseline。

5.2 Video Understanding（Table 3）

模型	P	D	P 加速	D 加速	E2E	LVB	MLVU	NExT-QA	Video-MME
LongVILA-7B (256f) baseline	0	0	1.0x	1.0x	1.0x	53.8	64.9	78.6	58.8
+ SparseVILA	.75	.90	5.1x	1.6x	2.1x	54.1	65.3	79.0	58.7
Qwen2.5-VL-7B (4fps) baseline	0	0	1.0x	1.0x	1.0x	59.2	65.5	76.0	62.3
+ SparseVILA	.75	.90	6.0x	2.0x	1.9x	60.1	70.7	81.9	66.3
Nemotron-Nano-VL-8B (256f) baseline	0	0	1.0x	1.0x	1.0x	55.3	60.9	75.8	55.2
+ SparseVILA	.75	.95	4.0x	2.5x	2.6x	55.9	63.1	76.6	56.6

关键发现:

在所有视频理解 benchmark 上，SparseVILA 不仅保持精度还普遍提升（Qwen2.5-VL 上 MLVU +5.2, NExT-QA +5.9）
最高实现 6.0x prefill / 2.5x decoding / 2.6x E2E 加速
精度提升归因于紧凑 KV cache 使模型更聚焦于关键视觉线索

5.3 Video Captioning（Table 4）

模型: LongVILA-7B (256f) | Prefill 75% / Decode 90%

方法	P 加速	D 加速	E2E	CI	DO	CU	TU	C	Overall
Baseline	1.0x	1.0x	1.0x	2.34	2.21	2.81	1.70	2.46	2.31
VisionZip	28.5x	1.5x	2.1x	2.04	2.03	2.56	1.71	2.11	2.09
PruMerge	28.5x	1.5x	2.1x	2.07	2.00	2.57	1.75	2.10	2.10
SparseVILA	5.1x	1.6x	2.1x	2.35	2.27	2.85	1.90	2.39	2.35

SparseVILA 在相同 2.1x 加速下将 Video-ChatGPT 总分从 2.31 提升到 2.35（temporal understanding +0.2）。

5.4 Physical Reasoning（Table 6）

模型: Cosmos-Reason1-7B (24fps) | Prefill 75% / Decode 95%

方法	P 加速	D 加速	E2E	HoloAssist	RoboFail	RoboVQA	Average
Baseline	1.0x	1.0x	1.0x	72.0	54.0	88.2	71.4
FastV	1.0x	1.1x	1.3x	46.0	37.0	80.9	52.5
SparseVILA	0.4x	2.0x	1.9x	75.0	58.0	94.5	75.9

SparseVILA 实现 1.9x E2E 加速 + 4.5% 精度提升（平均从 71.4 到 75.9）。

5.5 Visual Retrieval (V-NIAH)

Figure 6 解读: Visual Needle-in-a-Haystack 检索准确率热力图。SparseVLM 和 FastV 在超过 32 帧后即失效（8K context 限制），出现大面积绿色/黄色区域表示检索失败。SparseVILA 在 200 帧内保持近乎完美的检索准确率（几乎全绿），展示了其在长上下文场景下的显著优势。这得益于 decoupled 设计——prefill 保留完整 KV cache，decoding 时动态检索。

5.6 Efficiency Analysis

Figure 7 解读: Decoding attention kernel 的加速效果。SparseVILA 在所有模型/任务组合上都实现显著加速：LLaVA-NeXT [Image] 3.3x, InternVL-3.5 [Image] 4.7x, LongVILA [Video] 4.2x, Nemotron [Video] 11.4x, Qwen2.5-VL [Video] 4.2x, Cosmos [Reason] 6.8x。加速来源于减少 KV cache 的有效大小，降低 memory-bound decoding 的内存访问量。

Overhead 对比（Table 7）:

方法	LongVILA-7B CUDA Time	Overhead %	理论加速	实测加速
PruMerge	448.3ms	1.84%	7.41x	6.49x
VisionZip	206.3ms	0.85%	7.41x	6.94x
SparseVILA	94.9ms	0.39%	3.98x	3.91x

SparseVILA 的 overhead 仅为 PruMerge 的 1/5，实测加速与理论加速的 gap 最小（3.98x vs 3.91x），说明其 kernel 设计高效。

内存节省: Prefill 阶段 token 剪枝额外减少 KV memory 72.5%，linear FLOPs 减少 87.6%（LongVILA-7B）。

6. 代码实现映射

注：截至 2026-03-12，SparseVILA 官方代码尚未公开。以下为根据论文描述推测的关键实现模块。

论文组件	推测实现位置	关键技术
Token Salience Estimation	`visual_encoder/salience.py`	Triton kernel 流式 softmax + accumulation
Prefill Pruning	`prefill/prune.py`	Top-k selection on salience scores
Query-Aware Retrieval	`decode/retrieval.py`	Attn(Q, V_KV) 计算 + 与 FlashAttention2 并发
KV Cache Packing	`decode/kv_packing.py`	连续内存打包，避免稀疏访问
RoPE Reconstruction	`decode/rope_handler.py`	统一 RoPE: 保留连续 index；M-RoPE: 重建最小 T/H/W 网格
Multi-Turn KV Eviction	`eval/multi_turn.py`	每轮仅删除前一轮 Q/A 的 KV 条目
Inference Pipeline	基于 TinyChat	W8A8 (SmoothQuant) + W4A16 (AWQ) 量化

相关开源项目:

SparseVLMs: 不同论文（SparseVLM），但思路相关的 query-aware token pruning
mit-han-lab/Block-Sparse-Attention: 同组的稀疏 attention kernel

Paper Notes

探索

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

1. Motivation (研究动机)

1.1 VLM 推理瓶颈

1.2 现有方法的局限

2. Idea (核心思想)

2.1 核心洞察

2.2 Decoupled Sparsity 的设计动机

2.3 Visual Attention Sink 与 Retrieval Token 的发现

3. Method (方法)

3.1 Prefill 阶段：Query-Agnostic Pruning

Token Salience Estimation

高效 Triton Kernel 实现

3.2 Decoding 阶段：Query-Aware Retrieval

Query-Aware Token Selection

Rotary Position Embedding 处理

3.3 整体实现要点

4. Experimental Setup (实验设置)

4.1 评测模型与稀疏配置

4.2 评测任务与基准

4.3 备注

5. Experimental Results (实验结果)

5.1 Image Benchmark（Table 2）

5.2 Video Understanding（Table 3）

5.3 Video Captioning（Table 4）

5.4 Physical Reasoning（Table 6）

5.5 Visual Retrieval (V-NIAH)

5.6 Efficiency Analysis

6. 代码实现映射

目录