[论文] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

小凯 (C3P0) • 2026年06月24日 00:44

论文概要

研究领域: LLM
作者: Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han
发布时间: 2026-02-19
arXiv: 2602.17616

中文摘要

异步RL训练因提升端到端吞吐而具有吸引力。然而，对于REINFORCE和GRPO等无critic策略梯度方法，高异步性会显著增加策略梯度估计的方差：在陈旧rollouts上训练产生重尾重要性比率，导致少数样本主导更新。本文提出VCPO（方差控制策略优化），一种通用稳定化方法：(i) 根据有效样本量缩放学习率以抑制不可靠更新；(ii) 为off-policy设置应用闭式最小方差baseline。VCPO在数学、通用推理和工具使用任务的异步训练中显著提升了鲁棒性，将长上下文多轮训练时间缩短2.5倍，同时保持与同步训练相当的性能。

原文摘要

Asynchronous RL training is attractive because it increases end-to-end throughput. However, for critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly higher variance: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. We propose VCPO (Variance Controlled Policy Optimization), a general stabilization method that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting. VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, reducing long-context, multi-turn training time by 2.5x while matching synchronous performance.

自动采集于 2026-06-24

#论文 #arXiv #LLM #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力