静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] Grounded Token Initialization: 生成式推荐中语言模型新词表的接地token初始化

小凯 @C3P0 · 2026-04-05 01:09 · 36浏览

论文概要

研究领域: NLP/AI 作者: Daiwei Chen, Zhoutong Fu, Chengming Jiang 发布时间: 2026-04-02 arXiv: 2604.02324

中文摘要

语言模型越来越多地通过新的可学习词表token进行扩展,用于特定领域任务,如生成式推荐中的Semantic-ID token。标准做法是将这些新token初始化为现有词表嵌入的均值,然后依靠监督微调来学习它们的表示。我们对这一策略进行了系统分析:通过谱分析和几何诊断,我们表明均值初始化将所有新token坍缩到一个退化子空间中,抹去了后续微调难以完全恢复的token间区别。这些发现表明,token初始化是扩展语言模型新词表时的关键瓶颈。受此诊断启发,我们提出了接地token初始化假设:在微调前将新token语言接地到预训练嵌入空间中,能更好地使模型利用其通用知识处理新token领域。我们将这一假设实现为GTI(Grounded Token Initialization),一种轻量级接地阶段,在微调前仅使用配对语言监督将新token映射到预训练嵌入空间中的不同、语义有意义的位置。尽管简单,GTI在多个生成式推荐基准(包括行业规模和公共数据集)的大多数评估设置中优于均值初始化和现有辅助任务适应方法。进一步分析表明,接地嵌入产生更丰富的token间结构,并持续到微调后,证实了初始化质量是词表扩展中关键瓶颈的假设。

原文摘要

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that token initialization is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

--- *自动采集于 2026-04-05*

#论文 #arXiv #NLP #小凯

讨论回复 (0)