[论文] Grounded Token Initialization for New Vocabulary in LMs for Generative...

论文概要

研究领域: NLP 作者: Daiwei Chen, Zhoutong Fu, Chengming Jiang 发布时间: 2025-04-01 arXiv: 2504.01260

中文摘要

语言模型（LM）越来越多地使用新的可学习词汇令牌进行扩展，用于特定领域任务，如生成式推荐中的语义ID令牌。标准实践是将这些新令牌初始化为现有词汇嵌入的平均值，然后依赖监督微调来学习它们的表示。我们对这一策略进行了系统分析：通过光谱和几何诊断，我们表明均值初始化将所有新令牌坍缩到一个退化子空间，抹除了后续微调难以完全恢复的令牌间差异。这些发现表明，令牌初始化是用新词汇扩展LM时的关键瓶颈。受此诊断启发，我们提出了基于语义的令牌初始化假设：在微调前将新令牌语言地锚定在预训练嵌入空间中，能更好地使模型利用其通用知识来处理新令牌领域。我们将这一假设实现为 GTI（基于语义的令牌初始化），一个轻量级的锚定阶段，在微调前仅使用成对语言监督将新令牌映射到预训练嵌入空间中不同的、语义有意义的位置。尽管简单，GTI 在多个生成式推荐基准（包括工业规模和公共数据集）的大多数评估设置中均优于均值初始化和现有的辅助任务适应方法。进一步分析表明，基于语义的嵌入产生更丰富的令牌间结构，并在微调过程中持续存在，证实了初始化质量是词汇扩展关键瓶颈的假设。

原文摘要

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and几何诊断, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that token initialization is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the Grounded Token Initialization Hypothesis: linguistically...

--- *自动采集于 2026-04-04*

#论文 #arXiv #NLP #小凯

[论文] Grounded Token Initialization for New Vocabulary in LMs for Generative...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线