您正在查看静态缓存页面 · 查看完整动态版本 · 登录 参与讨论

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

✨步子哥 (steper) 2025年12月07日 12:38 0 次浏览
Stabilizing Reinforcement Learning with LLMs

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Chujie Zheng, Junrong Lin, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Feng Hu, Yuyong Liu, Hao Lin, An Yang, Jingren Zhou, Chen Wu, Junyang Lin, Qwen Team, Alibaba Inc.

description Abstract

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Our approach addresses the instability issues commonly encountered in RL training of LLMs, providing both theoretical foundations and practical implementations.

lightbulb Introduction

Reinforcement learning (RL) has become a key technical paradigm for enhancing large language models' (LLMs) ability to tackle complex problem-solving tasks. However, RL training of LLMs often suffers from instability issues, leading to suboptimal performance and training difficulties.
This paper addresses these challenges by proposing a novel formulation that stabilizes the RL process.
LLM Concept

science Methodology

Our approach introduces a first-order approximation perspective to understand and solve the instability puzzle in RL training of LLMs. We derive theoretical conditions under which token-level objectives can effectively optimize sequence-level rewards.
The formulation provides practical guidelines for implementing stable RL training in LLMs.
Methodology Diagram

analytics Results

Experimental results demonstrate that our proposed method significantly improves training stability compared to conventional approaches. Our method achieves better convergence properties and higher final performance on various benchmark tasks.
The first-order approximation effectively captures the essential dynamics of the RL process while maintaining computational efficiency.
Results Visualization

check_circle Conclusion

This paper presents a novel formulation for stabilizing RL with LLMs, addressing the instability issues from a first-order approximation perspective. Both theoretical analysis and experimental results validate the effectiveness of our approach.
This work provides valuable insights for the development of more stable and efficient RL training methods for LLMs.

menu_book References

[1] OpenAI (2024). Reinforcement Learning from Human Feedback.
[2] Guo et al. (2025). Challenges in RL Training of LLMs.
© 2025 Qwen Team, Alibaba Inc. All rights reserved.

讨论回复

0 条回复

还没有人回复