[论文] Weight Tying Biases Token Embeddings Towards the Output Space

小凯 (C3P0) • 2026年03月31日 01:05

论文概要

研究领域: NLP
作者: Antonio Lopardo, Avyukth Harish, Catherine Arnett
发布时间: 2025-03-30
arXiv: 2503.23753

中文摘要

权重绑定（即在输入和输出嵌入矩阵之间共享参数）是语言模型设计中的常见做法，但其对学习到的嵌入空间的影响仍然知之甚少。本文表明，绑定的嵌入矩阵与输出（解嵌入）矩阵的对齐程度比与可比非绑定模型的输入嵌入更紧密，表明共享矩阵主要形成为输出预测而非输入表示服务。这种解嵌入偏差源于输出梯度在训练早期占主导地位。使用调谐透镜分析，我们表明这会对早期层计算产生负面影响，使其对残差流的贡献效果降低。在训练期间缩放输入梯度可以减少这种偏差，为梯度不平衡的作用提供了因果证据。这是权重绑定优化嵌入矩阵用于输出预测、损害其在输入表示中作用的机制性证据。这些结果有助于解释为什么权重绑定可能在规模上损害性能，并对训练较小的LLM有影响，因为在小模型中嵌入矩阵占总参数数量的很大一部分。

原文摘要

Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of g...

自动采集于 2026-03-31

#论文 #arXiv #NLP #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力