StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Authors: Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, Mengye Ren Affiliations: Meta AI, New York University arXiv: 2508.15717 Project Page: yangyanl.ai/streammem

1. Motivation (研究动机)

StreamMem 关注的是 query-agnostic streaming video understanding。作者指出，长视频理解面临两类同时存在的困难：

视觉 token 太多：长视频会产生海量 visual tokens，KV cache 的存储与解码开销非常高；
未来问题未知：在真实 streaming / multi-turn QA 场景中，模型在编码视频时通常还不知道未来用户会问什么，因此很多 query-aware 压缩方法并不现实。

现有方案存在明显局限：

一类方法要先看完整个视频再压缩；
一类方法需要已知 query 才能做重要性筛选；
streaming 方法里，要么像 ReKV 一样存全部 KV（需要 offloading），要么像某些 FIFO 策略那样直接丢弃旧 KV，容易遗忘早期信息。

因此论文要解决的问题是：能否在完全不知道未来 query 的情况下，仅通过视频本身和系统模板，就对 streaming KV cache 进行高质量的、固定预算的压缩？

2. Idea (核心思想)

StreamMem 的核心想法是：把 chat template tokens 当作通用 proxy query。作者观察到，在 MLLM 预训练中，chat template 往往隐式触发“描述视频内容”的倾向，因此这些模板 token 对视觉 token 的 cross-attention 可以近似反映“通用重要性”。

基于这个观察，作者提出一套 training-free 框架：

Input Frame Filtering：先在输入层面去除高度相似的邻近帧；
Attention-based KV Pruning：用 chat template token 对 visual token 的 cross-attention 做 top-k 保留；
Frame-wise KV Merging：每帧再生成一个 prototype key/value，保留全局摘要；
YaRN Position Extension：尽量不破坏原本的时空位置信息。

和 query-aware 方法相比，StreamMem 的最大优势是：它不需要等问题出现之后再压缩，也不需要重跑完整视觉上下文。

3. Method (方法)

3.1 Overall framework

Figure 1 解读： Figure 1(a) 展示了 StreamMem 的整体流程：新到达的视频 clip 先做 frame filtering，再送入 vision encoder 与 MLLM；生成的 KV 与历史 memory 合并后，交给 KV compression 模块更新固定大小的 KV cache memory。Figure 1(b) 展示 compression 细节：一方面用 proxy query attention 做 token pruning，另一方面对每帧做 frame-wise prototype merging，把若干 token 合成更紧凑的帧级摘要。这样既保留 salient local token，也保留 frame-level global signal。

3.2 Input Frame Filtering

作者首先在输入层面做简单但有效的去冗余。对连续帧，先用 vision encoder 提取 embedding，然后计算相邻帧 cosine similarity：

如果相似度超过阈值 $δ$ ，就认为两帧过于冗余；
直接对两帧表示做平均合并。

这一步与 LongVU 的 temporal compression 有点相似，但 StreamMem 的目标是为后续 KV cache 节省预算，而不是单纯减少帧数。

3.3 Query-agnostic KV pruning

Figure 2 解读： Figure 2 对比了三种 query：chat template、generic question、specific question。可以看到三者关注的视觉区域高度重叠，说明即便没有真实用户问题，仅靠 chat template token，也能让模型聚焦到大体相同的重要区域。这个实验支撑了 StreamMem 最关键的假设：chat template 可以作为 query-agnostic compression 的 proxy query。

具体来说，作者把如下 chat template token 追加到视觉 token 后面：

<|im_end|><|im_start|>assistant\n

然后令其 query 表示为 $Q$ ，当前 clip 第 $i$ 层的 key 为 $K_{t}^{i}$ ，则 cross-attention 为：

A_{t}^{i} = Softmax (\frac{Q ( K _{t}^{i} ) ^{⊤}}{d}) .

对这些 attention 权重在 query 维度上做聚合后，得到每个 visual token 的 importance score，并保留 top-k token。

作者强调，这种 proxy query 并不是 arbitrary trick，而是利用了 MLLM 在视频 captioning 预训练中的归纳偏好：没有显式问题时，系统往往也会倾向生成一个“视频中发生了什么”的通用描述。

3.4 Frame-wise KV Merging

仅靠 pruning 仍然可能丢掉每帧的整体语义，因此 StreamMem 再额外生成 frame-level prototype。对于第 $i$ 层第 $t$ 帧：

\overset{ˉ}{K}_{t}^{i} = j = 1 \sum n α_{j}^{i} \cdot K_{t, j}^{i},

\overset{ˉ}{V}_{t}^{i} = j = 1 \sum n α_{j}^{i} \cdot V_{t, j}^{i},

其中 $α_{j}^{i}$ 是归一化后的 token 重要性。这样每一帧不仅有若干保留下来的显著 token，还有一个 frame-level summary token，兼顾局部细节与全局概括。

3.5 KV cache memory under global budget

作者把压缩后每层的 memory 写成：

K_{t}^{i^{'}}, V_{t}^{i^{'}} = Compress (K_{t - 1}^{i^{'}}, K_{t}^{i}, V_{t - 1}^{i^{'}}, V_{t}^{i}),

并要求满足全局 memory constraint：

i = 1 \sum L ∥ K_{t}^{i^{'}} ∥_{0} \leq M .

这意味着 StreamMem 的目标不是局部最优地压某一层，而是让整个模型在固定总预算下持续更新 memory。

3.6 Positional embedding / YaRN

许多 streaming 方法在压缩后会重新分配保留 token 的位置 ID，这会破坏原有时空位置信息。StreamMem 选择另一条路线：

用 YaRN 扩展视觉上下文窗口；
尽量保留原始 positional consistency；
避免因为简单重排 position 而损伤长视频理解能力。

作者实验表明，合适的 YaRN scaling factor 可以显著提高性能。

3.7 Pseudocode（代码搜索未找到开源实现，以下基于论文描述整理）

代码搜索未找到公开仓库。以下伪代码依据论文 paper.tex 中的方法部分整理，而不是作者源码复现。

组件 A：Input frame filtering

# Algorithm: Input frame filtering
# Input: consecutive frames f_1 ... f_T
# Output: filtered frames
 
def frame_filter(frames, vision_encoder, delta=0.95):
    embeds = [vision_encoder(f) for f in frames]
    kept = [frames[0]]
    for i in range(1, len(frames)):
        sim = cosine_similarity(embeds[i - 1], embeds[i])
        if sim > delta:
            kept[-1] = average_frame(kept[-1], frames[i])
        else:
            kept.append(frames[i])
    return kept

组件 B：Query-agnostic KV pruning

# Algorithm: Query-agnostic KV pruning with chat template
# Input: visual tokens, chat template tokens, keep budget k
# Output: top-k salient visual tokens
 
def kv_prune(visual_tokens, chat_template_tokens, model, k):
    Q = model.query(chat_template_tokens)
    K, V = model.kv(visual_tokens)
    A = softmax(Q @ K.T / sqrt(d))
    scores = aggregate_over_query_dim(A)
    idx = topk(scores, k)
    return K[idx], V[idx], scores[idx]

组件 C：Frame-wise KV merging

# Algorithm: Frame-wise KV prototype merging
# Input: K_t, V_t, normalized importance alpha
# Output: prototype key and value
 
def kv_merge(K_t, V_t, alpha):
    K_bar = sum(alpha[j] * K_t[j] for j in range(len(K_t)))
    V_bar = sum(alpha[j] * V_t[j] for j in range(len(V_t)))
    return K_bar, V_bar

组件 D：Streaming memory update

# Algorithm: Streaming video encoding and memory update
# Input: incoming clip, previous compressed memory, memory budget M
# Output: updated fixed-size KV memory
 
def streammem_step(clip, memory, model, M):
    clip = frame_filter(clip, model.vision_encoder)
    K_t, V_t = model.encode(clip)
    K_keep, V_keep, alpha = kv_prune(K_t, chat_template_tokens(), model, k=M)
    K_proto, V_proto = kv_merge(K_t, V_t, normalize(alpha))
    memory = compress(memory, K_keep, V_keep, K_proto, V_proto, budget=M)
    return memory

3.8 Code-to-paper mapping table

Paper Concept	Source File	Key Class/Function
Full method implementation	未公开	代码搜索未找到开源实现
Input frame filtering	论文 `paper.tex`	`Input Frame Filtering` section
Query-agnostic KV pruning	论文 `paper.tex`	`KV Cache Memory` section
Frame-wise merging	论文 `paper.tex`	Eq. (2)
Streaming loop	论文 `paper.tex`	Algorithm 1

4. Experimental Setup (实验设置)

4.1 Benchmarks

作者评测了三类 offline benchmark 和两类 streaming benchmark：

Offline
- MLVU
- EgoSchema
- VideoMME
Streaming
- RVS-Ego
- RVS-Movie

4.2 Models

StreamMem 适配了三种开源 MLLM：

LLaVA-OneVision-7B
Qwen2-VL-7B
Qwen2.5-VL-3B

4.3 Main settings

视频流默认以 0.5 FPS 处理（follow ReKV）
每个视频 clip = 8 frames
主要 KV budget：6K
所有实验都可以在 单张 A100 GPU 上运行

4.4 Baselines

作者比较了以下强 baseline：

LiveVLM
InfiniPot-V
ReKV（oracle-like upper bound because full KV cache）
MovieChat+
Dispider
LongVU

5. Experimental Results (实验结果)

5.1 Offline long-video understanding

Figure 3 解读： Figure 3 用概念图说明 StreamMem 的应用场景：在固定 memory bound 下，模型不断接收新 clip，并持续压缩 KV cache memory，而不是无限存储所有历史信息。它直观强调了 StreamMem 的目标：在“query 未知 + 流式输入 + 有限显存”这三个约束同时成立时，仍然保持视频理解能力。

论文主表结果如下：

Base MLLM	KV Size	MLVU	EgoSchema	VideoMME Medium	Long	All
LLaVA-OneVision-7B + StreamMem	6K	66.9	63.0	56.6	50.1	59.4
Qwen2-VL-7B + StreamMem	6K	65.9	67.2	62.4	52.3	62.1
Qwen2.5-VL-3B + StreamMem	6K	62.3	62.2	60.1	49.1	59.5

和关键 baseline 的比较：

LLaVA-OneVision-7B
- LiveVLM: 66.3 / 63.0 / 56.4 / 48.8 / 57.3
- ReKV (upper bound): 68.5 / 60.7 / - / - / -
Qwen2-VL-7B
- InfiniPot-V: 65.8 / 65.6 / 60.8 / 53.4 / 62.8
- Full KV: 65.8 / 65.2 / - / - / 63.9
Qwen2.5-VL-3B
- InfiniPot-V: 62.1 / 61.8 / - / - / 59.3
- Full KV: 63.3 / 64.4 / - / - / 60.3

这些结果说明：

StreamMem 在 LLaVA-OneVision-7B 上明显优于 LiveVLM；
在 Qwen2-VL-7B 上，与 InfiniPot-V 大体竞争，但并不在所有 VideoMME 指标上都领先；
论文正文说 Qwen2-VL 只在 VideoMME-long 落后，但从表格看，VideoMME-All 62.1 也低于 InfiniPot-V 的 62.8，所以阅读时要以表格为准。

5.2 Streaming Video QA

Method	RVS-Ego Acc	RVS-Ego Score	RVS-Movie Acc	RVS-Movie Score
StreamMem	57.6	3.8	52.7	3.4
ReKV	63.7	4.0	54.4	3.6
ReKV w/o offloading	55.8	3.3	50.8	3.4
Flash-VStream	57.0	4.0	53.1	3.3
InfiniPot-V	57.9	3.5	51.4	3.5

Streaming QA 下，StreamMem 的优势没有 offline benchmark 那么大，但它至少说明：

在固定 memory 预算下，query-agnostic compression 可以做到与 query-aware 或 full-cache baselines 接近；
说明它更适合做“稳定、通用、可部署”的 streaming memory mechanism，而不是追求某个 benchmark 的极致单点最优。

5.3 KV budget 扩展：为什么它能超过 Full KV

在 Qwen2-VL-7B + MLVU 上：

StreamMem: 6K = 65.9, 12K = 66.0, 24K = 66.3
InfiniPot-V: 6K = 65.8, 12K = 66.0, 24K = 65.7
Full KV 50K = 65.9

这说明一个很有意思的现象：更大的原始 KV cache 不一定更好。当 StreamMem 的 budget 增加到 24K 时，它反而超过 Full KV。这说明 selective compression 不只是“省显存”，还可能起到去噪和去冗余的作用。

5.4 Ablation studies

(1) Proxy query

Query Type	Holistic	S.D.	M.D.	All
True Query	68.1	71.6	46.4	68.1
Generic Text Query	66.7	71.3	43.0	66.7
Chat Template Query	66.9	71.5	43.0	66.9

这个实验直接说明：虽然真实 query 最好，但 chat template query 已经非常接近 generic question，因而确实可以作为 query-agnostic proxy。

(2) KV merging

KV Merging Strategy	Holistic	S.D.	M.D.	All
No Merging	77.3	69.7	42.8	65.6
Avg. Merging	80.5	70.4	41.3	66.3
Weighted Merging	78.8	71.5	43.0	66.9

weighted merging 整体最好，说明利用 attention score 做 prototype 加权比简单平均更有效。

(3) Input filtering

Input Filtering	All
No Filtering	65.4
$δ = 0.90$	66.1
$δ = 0.95$	66.9
$δ = 0.97$	66.6

阈值 $δ = 0.95$ 是 sweet spot。

(4) YaRN scaling

$λ = 1$ : 61.5
$λ = 2$ : 65.4
$λ = 4$ : 66.8
$λ = 8$ : 66.9

说明在长视频 streaming 场景中，恰当的上下文扩展非常关键。

5.5 Limitations / 结论

StreamMem 的优点很明确：

真正满足 query-agnostic streaming 设定；
完全 training-free；
在固定 KV budget 下取得了很强的 offline long-video 表现。

但从结果看，它也有边界：

在部分 benchmark 上并没有稳定超过 query-aware 或 full-KV 方法；
它的上限仍受 proxy query 质量限制，毕竟 chat template 只是“近似通用 query”；
截至 2026-03-09，项目页已公开，但未找到明确开源代码仓库，因此复现仍有门槛。

总体来看，StreamMem 的主要贡献在于给出了一个很干净的答案：即使未来问题未知，也可以通过 proxy query + pruning + merging，在固定 memory 下保留足够强的视频长期记忆。

Paper Notes

探索

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall framework

3.2 Input Frame Filtering

3.3 Query-agnostic KV pruning

3.4 Frame-wise KV Merging

3.5 KV cache memory under global budget

3.6 Positional embedding / YaRN

3.7 Pseudocode（代码搜索未找到开源实现，以下基于论文描述整理）

组件 A：Input frame filtering

组件 B：Query-agnostic KV pruning

组件 C：Frame-wise KV merging

组件 D：Streaming memory update

3.8 Code-to-paper mapping table

4. Experimental Setup (实验设置)

4.1 Benchmarks

4.2 Models

4.3 Main settings

4.4 Baselines

5. Experimental Results (实验结果)

5.1 Offline long-video understanding

5.2 Streaming Video QA

5.3 KV budget 扩展：为什么它能超过 Full KV

5.4 Ablation studies

(1) Proxy query

(2) KV merging

(3) Input filtering

(4) YaRN scaling

5.5 Limitations / 结论

目录