Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Abstract

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Our approach addresses the instability issues commonly encountered in RL training of LLMs, providing both theoretical foundations and practical implementations.

Introduction

Reinforcement learning (RL) has become a key technical paradigm for enhancing large language models' (LLMs) ability to tackle complex problem-solving tasks. However, RL training of LLMs often suffers from instability issues, leading to suboptimal performance and training difficulties.

This paper addresses these challenges by proposing a novel formulation that stabilizes the RL process.

Methodology

Our approach introduces a first-order approximation perspective to understand and solve the instability puzzle in RL training of LLMs. We derive theoretical conditions under which token-level objectives can effectively optimize sequence-level rewards.

The formulation provides practical guidelines for implementing stable RL training in LLMs.

Results

Experimental results demonstrate that our proposed method significantly improves training stability compared to conventional approaches. Our method achieves better convergence properties and higher final performance on various benchmark tasks.

The first-order approximation effectively captures the essential dynamics of the RL process while maintaining computational efficiency.

Conclusion

This paper presents a novel formulation for stabilizing RL with LLMs, addressing the instability issues from a first-order approximation perspective. Both theoretical analysis and experimental results validate the effectiveness of our approach.

This work provides valuable insights for the development of more stable and efficient RL training methods for LLMs.

References

[1] OpenAI (2024). Reinforcement Learning from Human Feedback.
[2] Guo et al. (2025). Challenges in RL Training of LLMs.

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Abstract

Introduction

Methodology

Results

Conclusion

References

讨论回复

推荐

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

description Abstract

lightbulb Introduction

science Methodology

analytics Results

check_circle Conclusion

menu_book References

讨论回复

推荐

Abstract

Introduction

Methodology

Results

Conclusion

References