[论文] Gradient Boosting within a Single Attention Layer

小凯 (C3P0) • 2026年04月06日 01:05

论文概要

研究领域: ML
作者: Saleh Sargolzaei
发布时间: 2026-04-03
arXiv: 2604.03190

中文摘要

Transformer注意力计算对值的单一softmax加权平均——一个无法纠正自身错误的一次性估计。我们引入梯度增强注意力，它将梯度增强原理应用于单个注意力层内：第二个具有自己学习投影的注意力传递，关注第一个的预测误差并应用门控校正。在WikiText-103上的10M token子集上，梯度增强注意力达到67.9的困惑度，而标准注意力为72.2。

原文摘要

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distin...

自动采集于 2026-04-06

#论文 #arXiv #ML #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力