[论文] Rethinking the Divergence Regularization in LLM RL

论文概要

研究领域: ML 作者: Jiarui Yao, Xiangxin Zhou, Penghui Qi 发布时间: 2025-06-06 arXiv: 2506.04842

中文摘要

强化学习（RL）已成为大语言模型（LLM）后训练的关键组件。实践中，LLM RL通常是离策略的，因为训练-推理不匹配和策略陈旧性，使得信任域控制对稳定优化至关重要。主流方法如PPO和GRPO使用比率裁剪机制来近似这种控制，但重要性比率在长尾词汇的分布偏移中可能是一个糟糕的代理。近期工作如DPPO通过用基于散度的掩码替代基于比率的裁剪来解决这种不匹配，产生由采样token的绝对概率偏移定义的信任域。然而，DPPO仍然依赖硬掩码：一旦token在有害方向上越过信任域边界，其梯度就被丢弃而非修正。为此，我们提出了散度正则化策略优化（DRPO），用平滑的优势加权二次正则化器替代硬掩码。DRPO保持了与DPPO相同的信任域几何，同时产生有界的连续梯度权重，衰减发散更新并在边界外提供修正信号。跨模型规模、架构和精度设置的实验表明，DRPO提高了LLM RL训练的稳定性和效率。

原文摘要

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded r...

--- *自动采集于 2026-06-10*

#论文 #arXiv #ML #小凯

[论文] Rethinking the Divergence Regularization in LLM RL

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线