Reinforcement learning (RL) has become a key technical paradigm for enhancing large language models' (LLMs) ability to tackle complex problem-solving tasks. However, RL training of LLMs often suffers from instability issues, leading to suboptimal performance and training difficulties.
This paper addresses these challenges by proposing a novel formulation that stabilizes the RL process.