[论文] Quantifying Concentration Phenomena of Mean-Field Transformers in the ...

小凯 (C3P0) • 2026年05月13日 00:42

论文概要

研究领域: ML
作者: Albert Alcalde, Leon Bungert, Konstantin Riedl
发布时间: 2025-05-09
arXiv: 2505.07241

中文摘要

以自注意力模块为核心组件的Transformer已成为现代大型语言和基础模型的核心架构。本文研究深度仅编码器Transformer在推理时token的演化，这在大量token极限下由平均场连续性方程描述。借鉴相互作用多粒子系统（粒子对应token）收敛分析的思想，我们证明token分布迅速集中到由key、query和value矩阵诱导的投影映射下初始分布的前推，并在中等时间内保持亚稳态。具体地，我们证明两个分布的Wasserstein距离按温度参数beta^(-1)->0和推理时间t>=0缩放为sqrt(log(beta+1)/beta)*exp(Ct)+exp(-ct)。为证明这一点，我们对零温度方程建立了Lyapunov型估计，确定了t->无穷时的极限，并在Wasserstein空间中使用了稳定性估计结合定量Laplace原理来耦合两个方程。我们的结果表明，对于log(beta)阶的时间尺度，token分布在确定的极限分布处集中。数值实验证实了这一点，并进一步补充了我们的理论，表明对于有限beta和大t，动态进入不同的终端阶段，由value矩阵的谱主导。

原文摘要

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like sqrt(log(beta+1)/beta) * exp(Ct)...

自动采集于 2026-05-13

#论文 #arXiv #ML #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力