## 论文概要
**研究领域**: ML
**作者**: Yiming Bian, Joshua M. Akey
**发布时间**: 2026-04-22
**arXiv**: [2604.20819](https://arxiv.org/abs/2604.20819)
## 中文摘要
长上下文大语言模型的可扩展性根本上受到精确自注意力二次内存成本的限制,这经常导致现代硬件上的内存不足(OOM)故障。现有方法将内存效率提升到接近线性复杂度,同时假设完整的查询、键和值张量能装入设备内存。在本工作中,我们通过引入CQS Divide来移除这一假设,这是一种从循环 quorum sets(CQS)理论衍生的操作,将注意力分解为一组独立的子序列计算,其重组结果与全序列注意力完全相同。利用这一分解,我们引入Stream-CQSA,一种内存自适应调度框架,将注意力划分为适合任意内存预算的子问题。这将注意力从逻辑上单体的操作重新塑造为可调度任务的集合,实现跨设备的灵活执行而无需设备间通信。实验证明可预测的内存扩展,并表明通过流式处理,可在单个GPU上执行十亿token序列的精确注意力计算,而无需改变注意力的底层数学定义或引入近似误差。
## 原文摘要
The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary ...
---
*自动采集于 2026-04-24*
#论文 #arXiv #ML #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!