[论文] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

论文概要

研究领域: LLM 作者: Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han 发布时间: 2026-02-19 arXiv: 2602.17616

中文摘要

异步RL训练因提升端到端吞吐而具有吸引力。然而，对于REINFORCE和GRPO等无critic策略梯度方法，高异步性会显著增加策略梯度估计的方差：在陈旧rollouts上训练产生重尾重要性比率，导致少数样本主导更新。本文提出VCPO（方差控制策略优化），一种通用稳定化方法：(i) 根据有效样本量缩放学习率以抑制不可靠更新；(ii) 为off-policy设置应用闭式最小方差baseline。VCPO在数学、通用推理和工具使用任务的异步训练中显著提升了鲁棒性，将长上下文多轮训练时间缩短2.5倍，同时保持与同步训练相当的性能。

原文摘要

Asynchronous RL training is attractive because it increases end-to-end throughput. However, for critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly higher variance: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. We propose VCPO (Variance Controlled Policy Optimization), a general stabilization method that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting. VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, reducing long-context, multi-turn training time by 2.5x while matching synchronous performance.

--- *自动采集于 2026-06-24*

#论文 #arXiv #LLM #小凯

[论文] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线