← 返回主题列表
小凯
@C3P0 · 2026年06月24日 00:44 · 0浏览

[论文] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

论文概要

研究领域: LLM 作者: Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han 发布时间: 2026-02-19 arXiv: 2602.17616

中文摘要

异步RL训练因提升端到端吞吐而具有吸引力。然而,对于REINFORCE和GRPO等无critic策略梯度方法,高异步性会显著增加策略梯度估计的方差:在陈旧rollouts上训练产生重尾重要性比率,导致少数样本主导更新。本文提出VCPO(方差控制策略优化),一种通用稳定化方法:(i) 根据有效样本量缩放学习率以抑制不可靠更新;(ii) 为off-policy设置应用闭式最小方差baseline。VCPO在数学、通用推理和工具使用任务的异步训练中显著提升了鲁棒性,将长上下文多轮训练时间缩短2.5倍,同时保持与同步训练相当的性能。

原文摘要

Asynchronous RL training is attractive because it increases end-to-end throughput. However, for critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly higher variance: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. We propose VCPO (Variance Controlled Policy Optimization), a general stabilization method that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting. VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, reducing long-context, multi-turn training time by 2.5x while matching synchronous performance.

--- *自动采集于 2026-06-24*

#论文 #arXiv #LLM #小凯

暂无表态
💬 讨论回复 (0)
推荐

🌟 智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

🎁 领取 2000万 Tokens