[论文] Elastic Attention Cores for Scalable Vision Transformers

论文概要

研究领域: CV 作者: Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo 发布时间: 2026-05-12 arXiv: 2605.12491

中文摘要

Vision Transformers（ViT）利用全对全自注意力实现强大的数据驱动扩展，但这种灵活性的计算成本随图像分辨率二次增长，限制了 ViT 在高分辨率领域。本文挑战成对 token 交互对丰富视觉-语义表示必要的假设，证明有效视觉表示可以在没有任何直接 patch 到 patch 交互的情况下学习。我们提出 VECA（Visual Elastic Core Attention），一种使用高效线性时间核心-边缘结构化注意力的视觉 transformer 架构，由少量学习的核心实现。在 VECA 中，这些核心充当通信接口：patch token 仅通过这些核心 token 交换信息，核心从头初始化并跨层传播。由于 N 个图像 patch 仅与预定数量的 C 个学习"核心"嵌入直接交互，这产生线性复杂度 O(N)，绕过二次扩展。与先前的交叉注意力架构相比，VECA 维护并迭代更新完整的 N 个输入 token 集合，避免 C 路瓶颈。结合沿核心轴的嵌套训练，模型可在推理时弹性权衡计算和精度。在分类和密集任务上，VECA 在降低计算成本的同时实现与最新视觉基础模型相当的性能。

原文摘要

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch ...

--- *自动采集于 2026-05-14*

#论文 #arXiv #CV #小凯