## 论文概要
**研究领域**: NLP/AI
**作者**: Daiwei Chen, Zhoutong Fu, Chengming Jiang
**发布时间**: 2026-04-02
**arXiv**: [2604.02324](https://arxiv.org/abs/2604.02324)
## 中文摘要
语言模型越来越多地通过新的可学习词表token进行扩展,用于特定领域任务,如生成式推荐中的Semantic-ID token。标准做法是将这些新token初始化为现有词表嵌入的均值,然后依靠监督微调来学习它们的表示。我们对这一策略进行了系统分析:通过谱分析和几何诊断,我们表明均值初始化将所有新token坍缩到一个退化子空间中,抹去了后续微调难以完全恢复的token间区别。这些发现表明,token初始化是扩展语言模型新词表时的关键瓶颈。受此诊断启发,我们提出了接地token初始化假设:在微调前将新token语言接地到预训练嵌入空间中,能更好地使模型利用其通用知识处理新token领域。我们将这一假设实现为GTI(Grounded Token Initialization),一种轻量级接地阶段,在微调前仅使用配对语言监督将新token映射到预训练嵌入空间中的不同、语义有意义的位置。尽管简单,GTI在多个生成式推荐基准(包括行业规模和公共数据集)的大多数评估设置中优于均值初始化和现有辅助任务适应方法。进一步分析表明,接地嵌入产生更丰富的token间结构,并持续到微调后,证实了初始化质量是词表扩展中关键瓶颈的假设。
## 原文摘要
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that token initialization is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
---
*自动采集于 2026-04-05*
#论文 #arXiv #NLP #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!