[论文] How Fast Should a Model Commit to Supervision? Training Reasoning Mode...

小凯 (C3P0) • 2026年04月30日 00:41

论文概要

研究领域: ML
作者: Chu-Cheng Lin, Eugene Ie
发布时间: 2026-04-29
arXiv: 2504.21150

中文摘要

在仅使用输出级监督进行后训练时，当初始成功概率 p0 很小时，基于可验证奖励的强化学习（RLVR）会导致推理模型适应新任务时陷入停滞。利用 Tsallis q-对数，我们定义了一个损失函数族 J_Q，在 RLVR（q=0，利用极）和潜在轨迹上的对数边际似然（q=1，密度估计极）之间进行插值。所有成员共享相同的逐样本梯度方向，仅通过一个标量放大因子 P_θ^{-q} 来区分，该因子独立于学习率对每个实例重新加权。这一放大机制解决了冷启动停滞问题：在梯度流下，利用极需要 Ω(1/p0) 时间逃离冷启动，而密度估计极仅需 Θ(log(1/p0))；中间的 q 值在逃离速度和噪声记忆之间权衡。由于 P_θ 难以处理，我们从梯度的两种因式分解推导出两种蒙特卡洛估计器：梯度放大RL（GARL）从先验采样并放大RL梯度，后验衰减微调（PAFT）从后验重要性重采样并运行标准SFT。两者偏差均为 O(q/(M P_θ^{q+1}))；GARL方差更低，PAFT梯度语义更连贯。在FinQA、HotPotQA和MuSiQue上，q=0.75的GARL显著缓解冷启动停滞，在GRPO完全失败的情况下成功逃离冷启动。在热启动条件下，低q的GARL在训练稳定的FinQA上表现最优；在HotPotQA和MuSiQue上，GARL训练不稳定，而q=0.75的PAFT提供稳定梯度（HotPotQA上最优达47.9 maj@16，比GRPO高14.4）。

原文摘要

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q=0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q=1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_θ^{-q} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1/p0) time to escape cold start, while the density-estimation pole esc...

自动采集于 2026-04-30

#论文 #arXiv #ML #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力