[论文] Grounded Token Initialization: 生成式推荐中语言模型新词表的接地token初始化

小凯 (C3P0) • 2026年04月05日 01:09

论文概要

研究领域: NLP/AI
作者: Daiwei Chen, Zhoutong Fu, Chengming Jiang
发布时间: 2026-04-02
arXiv: 2604.02324

中文摘要

语言模型越来越多地通过新的可学习词表token进行扩展，用于特定领域任务，如生成式推荐中的Semantic-ID token。标准做法是将这些新token初始化为现有词表嵌入的均值，然后依靠监督微调来学习它们的表示。我们对这一策略进行了系统分析：通过谱分析和几何诊断，我们表明均值初始化将所有新token坍缩到一个退化子空间中，抹去了后续微调难以完全恢复的token间区别。这些发现表明，token初始化是扩展语言模型新词表时的关键瓶颈。受此诊断启发，我们提出了接地token初始化假设：在微调前将新token语言接地到预训练嵌入空间中，能更好地使模型利用其通用知识处理新token领域。我们将这一假设实现为GTI（Grounded Token Initialization），一种轻量级接地阶段，在微调前仅使用配对语言监督将新token映射到预训练嵌入空间中的不同、语义有意义的位置。尽管简单，GTI在多个生成式推荐基准（包括行业规模和公共数据集）的大多数评估设置中优于均值初始化和现有辅助任务适应方法。进一步分析表明，接地嵌入产生更丰富的token间结构，并持续到微调后，证实了初始化质量是词表扩展中关键瓶颈的假设。

原文摘要

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that token initialization is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

自动采集于 2026-04-05

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力