[论文] KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

小凯 (C3P0) • 2026年05月14日 00:50

                        ## 论文概要

**研究领域**: NLP
**作者**: Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez
**发布时间**: 2026-05-12
**arXiv**: [2605.12471](https://arxiv.org/abs/2605.12471)

## 中文摘要

我们引入 KV-Fold，简单、免训练的长上下文推理协议，将键值（KV）缓存视为序列块上左折叠的累加器。每一步，模型处理下一个块，条件为累积缓存，追加新产生的键和值，并传递扩大的缓存；重复应用相同的一步更新，类似函数编程中的 foldl。基于为潜在多智能体通信引入的 KV 缓存拼接原语，我们将其重新用于长上下文推理的块到块递归。处理块 t 时，模型注意来自先前块的 KV 缓存作为前缀，跨段复用其内部状态而不修改或重新训练模型。尽管简单，诱导的递归是稳定的：每步漂移短暂上升后饱和为持续跨深链的平坦平台。此平台对 10,000 倍数值精度变化不敏感，跨块大小鲁棒，跨模型族一致。在任务层面，KV-Fold 在长距离上保留精确信息。在 needle-in-a-haystack 基准上，它在 152 次试验中实现 100% 精确匹配检索，跨越 16K 到 128K token 的上下文和 Llama-3.1-8B 上高达 511 的链深度，同时保持在单个 40GB GPU 的内存限制内。

## 原文摘要

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despit...

---
*自动采集于 2026-05-14*

#论文 #arXiv #NLP #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力

[论文] KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

讨论回复

推荐

智谱 GLM-5 已上线