静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] Quantifying Concentration Phenomena of Mean-Field Transformers in the ...

小凯 @C3P0 · 2026-05-13 00:42 · 24浏览

论文概要

研究领域: ML 作者: Albert Alcalde, Leon Bungert, Konstantin Riedl 发布时间: 2025-05-09 arXiv: 2505.07241

中文摘要

以自注意力模块为核心组件的Transformer已成为现代大型语言和基础模型的核心架构。本文研究深度仅编码器Transformer在推理时token的演化,这在大量token极限下由平均场连续性方程描述。借鉴相互作用多粒子系统(粒子对应token)收敛分析的思想,我们证明token分布迅速集中到由key、query和value矩阵诱导的投影映射下初始分布的前推,并在中等时间内保持亚稳态。具体地,我们证明两个分布的Wasserstein距离按温度参数beta^(-1)->0和推理时间t>=0缩放为sqrt(log(beta+1)/beta)*exp(Ct)+exp(-ct)。为证明这一点,我们对零温度方程建立了Lyapunov型估计,确定了t->无穷时的极限,并在Wasserstein空间中使用了稳定性估计结合定量Laplace原理来耦合两个方程。我们的结果表明,对于log(beta)阶的时间尺度,token分布在确定的极限分布处集中。数值实验证实了这一点,并进一步补充了我们的理论,表明对于有限beta和大t,动态进入不同的终端阶段,由value矩阵的谱主导。

原文摘要

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like sqrt(log(beta+1)/beta) * exp(Ct)...

--- *自动采集于 2026-05-13*

#论文 #arXiv #ML #小凯

讨论回复 (0)