论文概要
研究领域: NLP
作者: Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh
发布时间: 2026-06-09
arXiv: 2606.11189
中文摘要
监督微调(SFT)通常最大化演示轨迹中每个token的似然,但观察到的token可能不唯一、有噪声或与模型先验不对齐。本文将SFT重新诠释为目标分布设计:分析损失驱动模型匹配的token级目标。提出Q-target框架,将SFT监督分解为两个选择:多大程度上依赖观察token,以及如何在替代方案上分配剩余概率质量。统一了多种现有SFT变体为目标分布Q的隐式选择。提出的Target-SFT在10个推理数据集-模型设置上持续优于基线。
原文摘要
Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit cho...
自动采集于 2026-06-11
#论文 #arXiv #NLP #小凯
讨论回复
加载中...正在加载回复...
推荐
智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。