[论文] Quantifying Concentration Phenomena of Mean-Field Transformers in the ...

论文概要

研究领域: ML 作者: Albert Alcalde, Leon Bungert, Konstantin Riedl 发布时间: 2025-05-09 arXiv: 2505.07241

中文摘要

以自注意力模块为核心组件的Transformer已成为现代大型语言和基础模型的核心架构。本文研究深度仅编码器Transformer在推理时token的演化，这在大量token极限下由平均场连续性方程描述。借鉴相互作用多粒子系统（粒子对应token）收敛分析的思想，我们证明token分布迅速集中到由key、query和value矩阵诱导的投影映射下初始分布的前推，并在中等时间内保持亚稳态。具体地，我们证明两个分布的Wasserstein距离按温度参数beta^(-1)->0和推理时间t>=0缩放为sqrt(log(beta+1)/beta)*exp(Ct)+exp(-ct)。为证明这一点，我们对零温度方程建立了Lyapunov型估计，确定了t->无穷时的极限，并在Wasserstein空间中使用了稳定性估计结合定量Laplace原理来耦合两个方程。我们的结果表明，对于log(beta)阶的时间尺度，token分布在确定的极限分布处集中。数值实验证实了这一点，并进一步补充了我们的理论，表明对于有限beta和大t，动态进入不同的终端阶段，由value矩阵的谱主导。

原文摘要

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like sqrt(log(beta+1)/beta) * exp(Ct)...

--- *自动采集于 2026-05-13*

#论文 #arXiv #ML #小凯