V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

1. Motivation (研究动机)

这篇论文研究的是 Streaming Video LLM 在边缘设备上的实时推理加速。Streaming Video LLM（如 VideoLLM-Online）需要持续接收视频帧并实时回答用户问题，其核心瓶颈在于 KV Cache 随视频流持续增长。

作者从三个层面指出了现有方案的根本问题：

KV Cache 内存爆炸：KV Cache 以 $O (N^{2} T)$ 复杂度增长（ $N^{2}$ 为空间分辨率， $T$ 为时间长度）。在 10FPS、batch=4 的 edge GPU 设定下，几分钟的视频就会超出 32GB 显存容量；
现有 KV Cache Retrieval 算法在 Prefill 阶段失效：InfiniGen 等方法仅优化 generation 阶段，而 streaming video LLM 的主要瓶颈在 iterative prefill 阶段。在 40K KV Cache 长度时，83% 的延迟来自 prefill 阶段，其中 74% 来自 KV retrieval；
固定 Top-K 选择策略不灵活：GPU-oriented 的算法（FlexGen、InfiniGen、ReKV）依赖固定 top-k 选择。但不同 transformer 层和 attention head 的 token 重要性分布差异巨大（4.2%~44.0%），固定 k 要么浪费带宽、要么损失精度。

因此，论文提出的核心问题是：如何在资源受限的边缘设备上，实现 streaming video LLM 的实时推理（>2 FPS），同时保持几乎无损的精度？

2. Idea (核心思想)

V-Rex 的核心思想可以概括为：软硬件协同设计（SW-HW Co-design），通过算法层面的 ReSV 减少 KV Cache 检索量，通过硬件层面的 DRE 加速不规则的检索计算。

更具体地说，V-Rex 的创新在三个关键维度：

ReSV 算法（Training-free）：利用视频帧间的时空相似性，通过 hash-bit key clustering 把相似 token 聚类到一起，再通过 WiCSum thresholding 动态选择最重要的 cluster，而非固定 top-k。这使得不同 layer/head 可以自适应地选择不同数量的 token；
DRE 硬件引擎：专门为 ReSV 的不规则计算（bit-level XOR、条件分支、early-exit sorting）设计的紧凑硬件单元，仅占芯片面积的 2.0% 和功耗的 2.4%；
层次化 KV Cache 管理（KVMU）：基于 hash cluster 的内存映射策略，使同一 cluster 的 token 在存储中连续排列，最大化 PCIe 带宽利用率。

与 ReKV / InfiniGen 等纯软件方案相比，V-Rex 的根本差异在于：它不是在 GPU 上勉强适配检索算法，而是为检索操作设计了专用硬件，从而将 KV prediction 延迟从总计算的 23% 降低到 0.5%。

3. Method (方法)

3.1 Streaming Video LLM 架构与工作流

Streaming Video LLM 由三个核心模块组成：Vision Tower（如 SigLIP-ViT-L-384）、MLP Projector、LLM（如 Llama-3 8B）。

Figure 2 解读： Figure 2 展示了 streaming video LLM 的工作流程。视频流持续输入，用户可以在任意时刻提出问题（如”How do I make French toast step by step?“），模型需要实时生成回答。与 offline video LLM 不同，streaming 模式下帧是逐个到达的，每帧都需要做一次 iterative prefill。

Figure 3 解读： Figure 3 展示了 streaming video LLM 的模型架构细节。每帧经过 Vision Tower 和 MLP 后生成视觉 token，然后进入 LLM 做 iterative prefill。关键点在于：每帧的 prefill 都要 attend 到之前所有帧的 KV Cache，这使得 KV Cache 不断累积，直到视频结束。在 generation 阶段，模型只基于累积的 frame KV Cache 和 question token 来生成回答。

3.2 现有 KV Cache Retrieval 的挑战

Figure 4 解读： Figure 4 用三组数据揭示了 streaming video LLM 的核心瓶颈。(a) KV Cache 的内存占用随视频时长线性增长，在 10FPS、batch=4 时几分钟就超出 edge GPU 的 32GB 显存；(b) 随着 KV Cache 序列长度增加，prefill 在端到端延迟中的占比从~~60% 飙升到~~83%（在 80K 时）；(c) 在 40K KV Cache 下，InfiniGen 的 KV retrieval 操作数虽然只占 23%，但延迟占比达到 79%，其中 40% 是 KV prediction 计算，39% 是 KV Cache fetch。

现有方案的三大痛点：

痛点 1：Retrieval 算法在 Prefill 阶段效果差。 InfiniGen 等方法设计用于 generation 阶段（每次只有一个 query token），但 streaming video LLM 的 prefill 阶段有多个 query token（一帧图片的所有 visual token），导致 KV prediction 计算量大幅增加。

痛点 2：GPU-oriented 算法的计算和传输开销大。 在 prefill 阶段，KV prediction 需要做 $Q \times K^{T}$ 的全矩阵乘法来评估每个 token 的重要性，这本身就是高开销操作。加上 CPU-GPU 之间的 PCIe 传输（4-32 GB/s 远小于 GPU 内存带宽 1-2 TB/s），延迟进一步恶化。

痛点 3：固定 Top-K 的不灵活性。 不同 layer 和 head 需要的 token 数量差异巨大。固定 k 值要么导致 over-fetching（浪费带宽和能耗），要么导致 under-fetching（损失精度）。

3.3 V-Rex 的软硬件协同设计策略

Figure 5 解读： Figure 5 展示了 V-Rex 的三层优化策略。(i) Vanilla KV Cache on Storage：最简单的 offloading，每层先 load KV，再做 Attention+FFN；(ii) +SW Optimization（加 ReSV）：在 load KV 的同时做 KV Prediction，预测下一层需要的 token，实现 pipeline 重叠，但 KV prediction 本身仍有开销；(iii) +HW Optimization（加 DRE）：KV Prediction 由专用硬件 DRE 执行，可以与 LLM 的 Attention+FFN 完全重叠，实现最大延迟缩减。

3.4 ReSV 算法

Figure 6 解读： Figure 6 是 ReSV 算法的整体流程图。算法分为两个阶段：(1) KV Cache Retrieval Stage（在当前层的 QKV 生成后立即执行，为下一层预取 KV）和 (2) Execution Stage（用检索到的 KV 做 light attention）。KV prediction 的核心是两步：先用 hash-bit key clustering 将相似 token 聚类，再用 WiCSum thresholding 动态选择最重要的 cluster。

ReSV 的两个核心步骤：

3.4.1 Hash-bit Key Clustering

Figure 7 解读： Figure 7 提供了 hash-bit clustering 的理论依据。(a) 相邻帧 key token 之间的 cosine similarity heatmap 显示对角线上有很强的相似性，说明视频帧间 token 存在高度时空相似性，可以被有效聚类；(b) Scatter plot 展示 cosine similarity 与 hash-bit hamming distance 之间有 0.8 的相关性，证明用低维 hash-bit 的 hamming distance 可以近似高维 cosine similarity，从而大幅降低计算开销。

Figure 8 解读： Figure 8 详细展示了 hash-bit key clustering 的数据流。分两步：(1) Hash-bit Generation：Key 矩阵乘以 $N_{h p}$ 个随机超平面，然后二值化（ $\leq 0 \to 0$ ， $> 0 \to 1$ ），得到 Key hash-bit，维度仅为原始 key 维度的 $\leq 0.5%$ ；(2) Hamming Distance Clustering：对当前 key 的 hash-bit 与已有 cluster 的 hash-bit 做 XOR 运算计算 hamming distance，低于阈值 $T h_{h d}$ 的 token 被归入同一 cluster。最终结果存入 Hash Cluster (HC) Table，包含 cluster index、token index、 $Ke y_{c l u s t er}$ （cluster 内 key 均值）、hash-bit、token count。

Hash-bit generation 的计算复杂度极低：

原始 key 维度 $N_{e mb e dd in g}$ 被压缩到 $N_{h p}$ 个 bit（ $N_{h p} \leq 0.5% \times N_{e mb e dd in g}$ ）
Hamming distance 通过 bit-wise XOR + popcount 计算，无需浮点运算
Clustering 在每帧到达时增量更新，不需要重新计算

3.4.2 WiCSum Thresholding

Figure 9 解读： Figure 9 展示了 WiCSum thresholding 的完整数据流。分两步：(1) 计算 $Q u ery \times Ke y_{c l u s t er}^{T}$ 得到 $S cor e_{c l u s t er}$ 矩阵（因为只用 cluster 代表 key，维度大幅降低）；(2) 对每行做 weighted cumulative sum thresholding：先按 score 降序排列 cluster，然后累加 score $\times$ token count，直到超过阈值 $T h_{w i cs}$ ，即停止选择。

核心公式如下：

加权求和（Equation 1）：

S u m_{i} = j = 0 \sum c l u s t er S cor e_{c l u s t e r_{i, j}} \cdot T C_{j}

阈值计算（Equation 2）：

T h_{w i c s_{i}} = S u m_{i} \cdot T h_{r - w i cs}

累加阈值检查（Equation 3）：

A c c_{i} (t) = j = 0 \sum t S cor e_{Cl u s t e r_{i, σ (j)}} \cdot T C_{σ (j)}, A c c_{i} (t) > T h_{w i c s_{i}}

其中 $σ$ 是按 $S cor e_{c l u s t er}$ 降序排列的排列函数， $T C_{j}$ 是第 $j$ 个 cluster 的 token count。

WiCSum 的关键优势在于：不同 layer、不同 head 会自动选择不同数量的 token，因为阈值是基于当前 query 与所有 cluster 的 score 分布动态计算的。这使得 retrieval ratio 在不同层之间可以从 4.2% 变化到 44.0%。

3.5 V-Rex 硬件架构

Figure 10 解读： Figure 10 是 V-Rex 的完整硬件架构图。左侧是系统级视图：V-Rex 加速器包含 LLM Execution Engine (LXE) 和 Dynamic KV Cache Retrieval Engine (DRE)，DRE 又分为 KV Cache Prediction Unit (KVPU) 和 KV Cache Management Unit (KVMU)。右侧展示了 DRE 的内部结构：KVPU 包含 Hash-bit Cluster Unit (HCU，用 XOR accumulators 做 hamming distance 计算) 和 WiCSum Threshold Unit (WTU，用 bucket sorters + early-exit 做动态阈值选择)。执行流程为：(1) LXE 生成 hash-bit → (2) HCU 做 hamming distance clustering → (3) LXE 计算 $Q \times Ke y_{c l u s t er}^{T}$ → (4) WTU 做 WiCSum thresholding → (5) KVMU prefetch 选中的 KV → (6) 检索到的 KV 用于 attention。

硬件参数（每个 V-Rex core）：

DPE: $N_{D PE - h} = 64, N_{D PE - w} = 64$ （BF16 MAC trees）
VPE: $N_{V PE - h} = 1, N_{V PE - w} = 64$
HCU: $N_{H C U - h} = 1, N_{H C U - w} = 16$ （XOR accumulators）
WTU: $N_{W T U - h} = 1, N_{W T U - w} = 16$ （bucket sorters + adder trees）

3.5.1 Early-Exit Sorting

Figure 11 解读： Figure 11 展示了 WTU 中的 early-exit sorting 机制。WTU 先做预处理（计算每行的 weighted sum、min/max、token count、 $T h_{w i cs}$ ），然后从最高分的 bucket 开始做 bucket sort + cumulative sum。一旦累积和超过阈值 $T h_{w i cs}$ ，就提前终止排序。这非常高效，因为少量高分 cluster 通常就能占到 weighted sum 的大部分（平均只需处理 16% 的 cluster）。

3.5.2 层次化内存系统与 Cluster-wise Memory Mapping

Figure 12 解读： Figure 12 展示了 V-Rex 的层次化内存系统。V-Rex 的 on-chip memory 存放最近的 KV Cache，超出容量的旧 KV 被 offload 到 CPU memory 或 storage。KVMU 的关键创新在于 cluster-wise memory mapping：同一个 hash cluster 内的 token 在存储中被连续排列，使得 prefetch 一个 cluster 时可以一次性传输，最大化 PCIe 带宽利用率。这个 reordering 在 V-Rex 内部以 streaming 方式完成，不需要访问 CPU/storage。

3.6 Pseudocode

3.6.1 ReSV 整体流程

def resv_kv_prediction(query, kv_cache, hc_table, hyperplanes, th_hd, th_r_wics):
    """ReSV: 为当前 query 预测需要检索的 KV token indices"""
 
    # ---- Step 1: Hash-bit Key Clustering (增量更新，每新帧执行一次) ----
    # 对新帧的 key 做 hash-bit 生成
    key_hp = key_new @ hyperplanes          # [N_tokens, N_hp]
    key_hashbit = (key_hp > 0).int()        # 二值化: [N_tokens, N_hp]
 
    for token_idx, hashbit in enumerate(key_hashbit):
        # 计算与所有已有 cluster 的 hamming distance
        distances = xor_popcount(hashbit, hc_table.cluster_hashbits)  # bit-wise XOR + count
        min_dist, min_cluster = distances.min(dim=0)
 
        if min_dist < th_hd:
            hc_table.add_to_cluster(min_cluster, token_idx)  # 归入已有 cluster
            hc_table.update_key_cluster(min_cluster)          # 更新 Key_cluster 均值
        else:
            hc_table.create_new_cluster(token_idx, hashbit)   # 创建新 cluster
 
    # ---- Step 2: WiCSum Thresholding ----
    # 用 cluster 代表 key 计算 score（维度大幅降低）
    score_cluster = query @ hc_table.key_clusters.T     # [N_query, N_cluster]
    token_counts = hc_table.token_counts                # [N_cluster]
 
    selected_indices = []
    for i in range(score_cluster.shape[0]):              # 逐 query row
        # 计算 weighted sum 和阈值
        weighted_sum = (score_cluster[i] * token_counts).sum()
        threshold = weighted_sum * th_r_wics
 
        # 按 score 降序排列 cluster
        sorted_idx = score_cluster[i].argsort(descending=True)
 
        # 累加直到超过阈值（early-exit）
        acc = 0.0
        for j in sorted_idx:
            acc += score_cluster[i, j] * token_counts[j]
            selected_indices.extend(hc_table.get_token_indices(j))
            if acc > threshold:
                break
 
    return list(set(selected_indices))  # 去重后返回

3.6.2 V-Rex 端到端推理 Pipeline

def vrex_streaming_inference(video_stream, questions, vrex_accelerator):
    """V-Rex streaming video LLM 推理主循环"""
    kv_cache = HierarchicalKVCache(on_chip=vrex_accelerator.memory,
                                    off_chip="cpu_memory_or_storage")
    hc_table = HashClusterTable()
 
    for frame in video_stream:
        # ---- Iterative Prefill Stage ----
        visual_tokens = vision_tower(frame)               # SigLIP-ViT-L-384
        projected_tokens = mlp_projector(visual_tokens)
 
        for layer_idx in range(num_layers):
            # 当前层：生成 QKV
            q, k, v = compute_qkv(projected_tokens, layer_idx)
 
            # 【并行执行 1】LXE: 用已检索的 KV 做 light attention + FFN
            if layer_idx > 0:
                output = light_attention(q, retrieved_kv[layer_idx], k, v)
                output = ffn(output, layer_idx)
 
            # 【并行执行 2】DRE: 为下一层预测需要的 KV
            # HCU: hash-bit clustering (bit-level XOR, 硬件加速)
            vrex_accelerator.hcu.update_clusters(k, hc_table)
            # LXE: Q × Key_cluster^T
            score_cluster = q @ hc_table.key_clusters.T
            # WTU: WiCSum thresholding with early-exit (硬件加速)
            selected = vrex_accelerator.wtu.threshold_select(score_cluster, hc_table)
            # KVMU: prefetch selected KV from off-chip
            retrieved_kv[layer_idx + 1] = kv_cache.prefetch(selected)
 
            # 更新 KV Cache（自动 offload 旧数据）
            kv_cache.append(k, v, layer_idx)
 
    # ---- Generation Stage (用户问题到来时) ----
    for question in questions:
        question_tokens = tokenize(question)
        # generation 阶段同样使用 ReSV 检索，但 retrieval ratio 更低 (1.4%-2.9%)
        answer = generate_with_resv(question_tokens, kv_cache, hc_table)
        yield answer

3.7 代码映射表

论文概念	说明	备注
ReSV Algorithm	Training-free KV cache retrieval 算法	未开源，paper Section IV
Hash-bit Key Clustering	利用随机超平面 + 二值化 + XOR hamming distance 聚类	HCU 硬件加速
WiCSum Thresholding	加权累积和动态阈值选择	WTU 硬件加速，early-exit
DRE (Dynamic Retrieval Engine)	专用硬件引擎：HCU + WTU	仅占 2.0% 面积，2.4% 功耗
LXE (LLM Execution Engine)	主 LLM 计算引擎：DPE + VPE	BF16，384KB on-chip memory
KVMU	KV Cache 管理单元	Cluster-wise memory mapping
HC Table	Hash Cluster Table	存储 cluster info，仅占 KV Cache 的 1.67%
VideoLLM-Online	基线 streaming video LLM	使用 Llama-3 8B + SigLIP-ViT-L-384
COIN Benchmark	主要评估数据集	5 个 benchmark 的 Top-1 Accuracy

注意：本文为软硬件协同设计论文（HPCA 2026），未公开源码。上述伪代码基于论文描述重构，不对应真实代码仓库。

4. Experimental Setup (实验设置)

4.1 实验设置

配置项	Edge (V-Rex $^{8}$ )	Server (V-Rex $^{48}$ )
V-Rex Cores	8	48
Peak Throughput (TFLOPS)	53.3	319.5
Memory Bandwidth	LPDDR5: 204.8 GB/s	HBM2e: 1935 GB/s
Memory Capacity	32 GB	80 GB
PCIe Bandwidth	4 GB/s	32 GB/s
Power	35 W	203.68 W
Process	14nm, 0.8V, 800MHz	14nm, 0.8V, 800MHz

对比基线：AGX Orin + FlexGen / InfiniGen / InfiniGenP / ReKV，以及 NVIDIA A100 GPU。

模型：Llama-3 8B + SigLIP-ViT-L-384（VideoLLM-Online）。

评估数据集：COIN（5 个 benchmark）。

KV Cache sizes：1K, 5K, 10K, 20K, 40K tokens。

ReSV 超参数： $N_{h p} = 32$ ， $T h_{r - w i cs} = 0.3$ ， $T h_{h d} = 7$ 。

5. Experimental Results (实验结果)

5.1 延迟与能效

Figure 13 解读： Figure 13 是论文最核心的性能对比图，分 Edge GPU (AGX Orin) 和 Server GPU (A100) 两组。左侧两列是延迟，右侧两列是能效（log scale）。

Edge GPU 结果：

Frame processing (Batch 1)：V-Rex $^{8}$ 的 per-frame latency 为 121-254ms（1K-40K），对应 3.9-8.3 FPS，是唯一能在所有 KV Cache 大小下保持实时（ $\geq 2$ FPS）的方案。AGX+FlexGen 在 40K 时延迟达~2400ms；
Text generation (Batch 1)：TPOT 为 89-97ms，加速 1.9-15.1 $\times$ ；
Batch 4 @ Frame：加速 2.1-13.8 $\times$ ；
Energy efficiency：Frame processing 达 5.5-10.2 $\times$ 能效提升，Text generation 达 4.3-18.5 $\times$ 。

Server GPU 结果：

Batch 1：20-48ms per-frame，加速 2.6-7.3 $\times$ ；
Batch 8：加速 3.4-19.7 $\times$ ，TPOT 14-15ms；
Energy efficiency：Frame processing 达 9.0-29.7 $\times$ （Batch 1），Text generation 达 13.2-70.6 $\times$ （Batch 8）。

5.2 端到端延迟分解

Figure 14 解读： Figure 14 对比了 AGX Orin 和 V-Rex $^{8}$ 在不同 KV Cache 长度下的端到端延迟分解。关键发现：AGX+FlexGen 的延迟随 KV Cache 增长而爆炸，且 prefill 占比越来越高；纯软件优化（InfiniGenP、ReKV）在 1K-20K 范围内甚至比 FlexGen 更慢（因为 KV prediction 开销）；V-Rex $^{8}$ 通过硬件卸载将 KV prediction 延迟降到可忽略，在 40K 时实现了 5.4 $\times$ 的端到端延迟缩减。

5.3 与 SOTA 加速器对比

Figure 15 解读： Figure 15 将 V-Rex $^{8}$ 与 Oaken（基于 4-bit KV cache quantization 的 SOTA LLM 加速器）在 AGX Orin 上进行吞吐量对比（Batch 16 @ Frame）。在 1K 序列长度时，V-Rex 比 AGX Orin 快 1.5 $\times$ ，比 Oaken 快 1.1 $\times$ ；但在 20K 以上，Oaken 因为量化 KV cache 仍然增长到 OOM，而 V-Rex 通过 retrieval + offloading 可以持续运行到 40K+，维持约 7 FPS。这证明了 V-Rex 在长序列场景的核心优势。

5.4 消融实验

Figure 16 解读： Figure 16 展示了 V-Rex 各组件的消融实验（40K KV Cache, Batch 1）。左侧是延迟和加速比：AGX+FlexGen 作为基线，AGX+ReSV 通过软件优化实现 2.8 $\times$ 加速，V-Rex $^{8}$ _KVPU（加 DRE 硬件）实现 6.0 $\times$ 加速，V-Rex $^{8}$ _All（加 KVMU 优化 PCIe 带宽）实现 8.1 $\times$ 加速。右侧是能耗：V-Rex $^{8}$ _All 实现 10.2 $\times$ 能效提升。关键结论是：纯软件优化（ReSV on GPU）只能解决一半问题，硬件加速（DRE）是必要的。

5.5 带宽分析

Figure 17 解读： Figure 17 分析了 V-Rex $^{48}$ 在一层 frame processing 中的 DRAM 带宽使用。KV prediction 虽然瞬时带宽飙升到~~600 GB/s，但持续时间很短，可以完全被 attention/FFN 隐藏。KV retrieval（从 CPU memory 到 DRAM）走 PCIe 通道，仅占 DRAM 带宽的~~1%，因此可以与 LLM 计算完全并发执行。

5.6 Roofline 分析

Figure 18 解读： Figure 18 是 AGX Orin 和 V-Rex $^{8}$ 的 roofline model 分析（40K KV Cache, Batch 4, operational intensity 15.2 Op/B）。AGX+FlexGen 仅达到理论峰值的 6.6%（受 PCIe 瓶颈限制），AGX+ReKV 达到 15%（仍受 GPU+SW 优化限制），V-Rex $^{8}$ 达到 71.5%（10.8 $\times$ 改善），从 I/O-bound 成功转移到 compute-bound 区域。

5.7 精度分析

Figure 19 解读： 这是最重要的精度结果表。上半部分是 COIN 5 个 benchmark 的 Top-1 Accuracy：VideoLLM-Online（无 retrieval）平均 50.5%，V-Rex 的 ReSV 达到 50.5%（平均仅 0.8% 精度损失）。InfiniGen 精度无损但无法加速 prefill，InfiniGenP 有 3.4% 精度损失，ReKV 精度接近但 retrieval ratio 高得多。下半部分是 retrieval ratio：ReSV 在 frame processing 阶段为 25.1%-36.1%，text generation 阶段仅 1.4%-2.9%，远低于 InfiniGenP (50.8%/50.8%) 和 ReKV (58.4%/31.2%)。

Figure 20 解读： Figure 20 是 ReSV 算法的消融实验（40K KV Cache）。三种配置：VideoLLM-Online（无 retrieval，基线）、ReSV w/o Clustering（只用 WiCSum thresholding，不做 hash-bit clustering）、ReSV（完整版）。不做 clustering 时，speedup 为 1.6 $\times$ ，精度损失 0.3%；加上 clustering 后，speedup 飙升到 9.4 $\times$ （因为计算量从 $Q \times K^{T}$ 降到 $Q \times Ke y_{c l u s t er}^{T}$ ），精度损失增加到 0.8%。这证明 hash-bit clustering 是加速的关键。

5.8 Per-layer/Per-head Retrieval Ratio 分析

Per-layer/Per-head Retrieval Ratio 分析： 论文 Figure 20 对比了 ReSV、InfiniGenP 和 ReKV 在不同 layer 和不同 head 上的 retrieval ratio。InfiniGenP 和 ReKV 在所有层/头上都使用固定 50% 的 retrieval ratio，而 ReSV 展现出丰富的多样性：某些层仅需 4.2% 的 token，关键层则需要 44.0%，不同 head 之间也有显著差异。平均而言，ReSV 比 ReKV 少检索 3.0 $\times$ 的 token。这验证了 WiCSum 动态阈值的核心优势。

5.9 硬件开销

Figure 21 解读： 每个 V-Rex core 面积 1.89 mm $^{2}$ ，功耗 2.61W。其中 DRE（HCU+WTU+KVMU）仅占面积的 2.0% 和功耗的 2.4%，说明专用检索引擎的硬件开销极小。V-Rex $^{8}$ 总面积 15.12 mm $^{2}$ （远小于 AGX Orin 的 200 mm $^{2}$ ），功耗 35W（低于 AGX Orin 的 40W，节省 11.4%）。V-Rex $^{48}$ 面积 90.57 mm $^{2}$ （远小于 A100 的 826 mm $^{2}$ ），功耗 203.68W（低于 A100 的 300W，节省 32.1%）。

5.10 核心贡献与总结

核心贡献

首个 streaming video LLM 软硬件协同加速器：V-Rex 同时从算法和硬件两个层面解决 KV Cache 的内存和计算瓶颈，而非仅做软件优化；
ReSV 算法：Training-free，利用视频帧间时空相似性做 hash-bit clustering + WiCSum dynamic thresholding，实现自适应的 per-layer/per-head token selection，平均精度损失仅 0.8%；
DRE 硬件引擎：以极小的芯片面积（2.0%）和功耗（2.4%）开销，将 KV prediction 延迟从 23% 降到 0.5%，使检索计算可以与 LLM 推理完全重叠；
实时边缘推理：在 edge device 上实现 3.9-8.3 FPS，加速 1.9-19.7 $\times$ ，能效提升 3.1-18.5 $\times$ 。

局限性与思考

评估范围有限：仅在 COIN 数据集上评估，缺少更多样化的 streaming video benchmark（如 EgoSchema、ActivityNet-QA）；
模型固定：仅测试了 Llama-3 8B + SigLIP，未验证在更大模型（70B）或不同架构（Qwen3）上的表现；
硬件仿真而非真实流片：使用 custom cycle-level simulator + DRAMSim3 + MQSim 仿真，未进行真实芯片验证；
与纯软件 KV Cache 压缩的交叉使用：论文未探讨 ReSV 与 quantization（如 Oaken 的 4-bit KV）是否可以结合获得更大收益；
Clustering 超参数敏感性： $N_{h p} = 32$ , $T h_{h d} = 7$ , $T h_{r - w i cs} = 0.3$ 这些超参是否对不同视频内容/模型敏感，论文未做系统分析。

对后续研究的启示

V-Rex 的核心 insight 是：streaming video LLM 的瓶颈不在 LLM 计算本身，而在 KV Cache 的管理和检索。这意味着：

未来的 video LLM 加速器设计应该重点关注 memory hierarchy 和 data movement，而非仅仅堆叠更多的计算单元；
Hash-bit clustering 利用视频帧间相似性的思路，可以推广到其他需要处理时序冗余的场景（如 robotics、autonomous driving）；
WiCSum 的动态阈值机制比固定 top-k 更适合异质性强的 attention pattern，这对长上下文 LLM 推理也有参考价值。

Paper Notes

探索