Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

论文概要

研究领域: CV 作者: Shuhong Zheng, Michael Oechsle, Erik Sandström 发布时间: 2026-05-26 arXiv: 2505.21383

中文摘要

视觉几何Transformer已成为多视图3D重建的强大架构，以前馈方式联合预测多个3D属性。然而，它们的计算成本随输入序列长度二次增长，原因在于模型内部的全局注意力层。这限制了它们的可扩展性和效率。在这项工作中，我们通过一个简单但通用的策略应对这一挑战：限制每个查询在全局注意力中交互的键/值token数量。为实现有效的token选择，我们引入了一个两阶段框架。首先，帧间选择步骤在帧级别操作，识别应保留的帧。其次，帧内选择步骤进一步丢弃所选帧中更冗余的token。我们的分析突出了基于多样性的帧间选择策略的优势，确保场景的广泛覆盖。对于帧内选择，我们表明需要层感知稀疏化，选择过程由全局注意力模式的熵引导。我们的方法相比现有解决方案提供了更优的速度-精度权衡。大量实验表明，它在500张图像的场景中将视觉几何Transformer加速超过85%，同时保持甚至提升基线性能，暗示我们的token选择策略可以在视觉几何Transformer的未来应用中发挥关键作用。项目网站：https://zsh2000.github.io/good-token-hunting.github.io

原文摘要

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tok...

--- *自动采集于 2026-05-26*

#论文 #arXiv #CV #小凯

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线