Loading...
正在加载...
请稍候

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

✨步子哥 (steper) 2025年12月07日 12:38
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Stabilizing Reinforcement Learning with LLMs</title> <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&display=swap" rel="stylesheet"> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <style> :root { --primary: #1565c0; --primary-light: #5e92f3; --primary-dark: #003c8f; --secondary: #2979ff; --text-on-primary: #ffffff; --text-primary: #212121; --text-secondary: #424242; --background: #f5f7fa; --card-bg: #ffffff; --accent: #29b6f6; } * { margin: 0; padding: 0; box-sizing: border-box; } body { font-family: 'Roboto', sans-serif; background-color: var(--background); color: var(--text-primary); line-height: 1.6; } .poster-container { width: 720px; min-height: 960px; margin: 0 auto; background: linear-gradient(135deg, #f5f7fa 0%, #e4eaf5 100%); position: relative; overflow: hidden; padding: 40px; box-shadow: 0 10px 30px rgba(0, 0, 0, 0.1); } .background-shape { position: absolute; border-radius: 50%; opacity: 0.1; z-index: 0; } .shape-1 { width: 500px; height: 500px; background: var(--primary); top: -200px; right: -200px; } .shape-2 { width: 400px; height: 400px; background: var(--secondary); bottom: -150px; left: -150px; } .header { text-align: center; margin-bottom: 30px; position: relative; z-index: 1; } .title { font-size: 36px; font-weight: 700; color: var(--primary-dark); margin-bottom: 15px; line-height: 1.2; } .authors { font-size: 16px; color: var(--text-secondary); margin-bottom: 20px; max-width: 90%; margin-left: auto; margin-right: auto; } .logo { width: 120px; margin: 0 auto; } .content-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 25px; position: relative; z-index: 1; } .section { background: var(--card-bg); border-radius: 16px; padding: 25px; box-shadow: 0 4px 15px rgba(0, 0, 0, 0.05); transition: transform 0.3s ease, box-shadow 0.3s ease; } .section:hover { transform: translateY(-5px); box-shadow: 0 8px 25px rgba(0, 0, 0, 0.1); } .section-title { font-size: 22px; font-weight: 500; color: var(--primary); margin-bottom: 15px; display: flex; align-items: center; } .section-title .material-icons { margin-right: 8px; font-size: 24px; } .section-content { font-size: 16px; color: var(--text-secondary); } .full-width { grid-column: span 2; } .image-container { margin: 15px 0; text-align: center; } .image-container img { max-width: 100%; border-radius: 12px; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.1); } .highlight { background: linear-gradient(transparent 50%, rgba(41, 182, 246, 0.2) 50%); padding: 0 4px; } .key-point { background-color: rgba(41, 182, 246, 0.1); border-left: 4px solid var(--accent); padding: 10px 15px; margin: 15px 0; border-radius: 0 8px 8px 0; } .footer { text-align: center; margin-top: 30px; padding-top: 20px; border-top: 1px solid rgba(0, 0, 0, 0.1); color: var(--text-secondary); font-size: 14px; position: relative; z-index: 1; } </style> </head> <body> <div class="poster-container"> <!-- Background shapes --> <div class="background-shape shape-1"></div> <div class="background-shape shape-2"></div> <!-- Header section --> <header class="header"> <h1 class="title">Stabilizing Reinforcement Learning with LLMs: Formulation and Practices</h1> <p class="authors">Chujie Zheng, Junrong Lin, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Feng Hu, Yuyong Liu, Hao Lin, An Yang, Jingren Zhou, Chen Wu, Junyang Lin, Qwen Team, Alibaba Inc.</p> <div class="logo"> <img src="https://sfile.chatglm.cn/moeSlide/image/02/02c26323.jpg" alt="Qwen Team Logo" style="width: 100%; border-radius: 8px;"> </div> </header> <!-- Content grid --> <div class="content-grid"> <!-- Abstract section --> <section class="section full-width"> <h2 class="section-title"> <i class="material-icons">description</i> Abstract </h2> <div class="section-content"> This paper proposes a <span class="highlight">novel formulation</span> for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Our approach addresses the <span class="highlight">instability issues</span> commonly encountered in RL training of LLMs, providing both theoretical foundations and practical implementations. </div> </section> <!-- Introduction section --> <section class="section"> <h2 class="section-title"> <i class="material-icons">lightbulb</i> Introduction </h2> <div class="section-content"> Reinforcement learning (RL) has become a key technical paradigm for enhancing large language models' (LLMs) ability to tackle complex problem-solving tasks. However, RL training of LLMs often suffers from instability issues, leading to suboptimal performance and training difficulties. <div class="key-point"> This paper addresses these challenges by proposing a novel formulation that stabilizes the RL process. </div> </div> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/c6/c6fe3060.jpg" alt="LLM Concept"> </div> </section> <!-- Methodology section --> <section class="section"> <h2 class="section-title"> <i class="material-icons">science</i> Methodology </h2> <div class="section-content"> Our approach introduces a <span class="highlight">first-order approximation perspective</span> to understand and solve the instability puzzle in RL training of LLMs. We derive theoretical conditions under which token-level objectives can effectively optimize sequence-level rewards. <div class="key-point"> The formulation provides practical guidelines for implementing stable RL training in LLMs. </div> </div> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/ad/ad1b9a1c.jpg" alt="Methodology Diagram"> </div> </section> <!-- Results section --> <section class="section"> <h2 class="section-title"> <i class="material-icons">analytics</i> Results </h2> <div class="section-content"> Experimental results demonstrate that our proposed method significantly improves training stability compared to conventional approaches. Our method achieves better convergence properties and higher final performance on various benchmark tasks. <div class="key-point"> The first-order approximation effectively captures the essential dynamics of the RL process while maintaining computational efficiency. </div> </div> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/14/14c69370.jpg" alt="Results Visualization"> </div> </section> <!-- Conclusion section --> <section class="section"> <h2 class="section-title"> <i class="material-icons">check_circle</i> Conclusion </h2> <div class="section-content"> This paper presents a novel formulation for stabilizing RL with LLMs, addressing the instability issues from a first-order approximation perspective. Both theoretical analysis and experimental results validate the effectiveness of our approach. <div class="key-point"> This work provides valuable insights for the development of more stable and efficient RL training methods for LLMs. </div> </div> </section> <!-- References section --> <section class="section full-width"> <h2 class="section-title"> <i class="material-icons">menu_book</i> References </h2> <div class="section-content"> [1] OpenAI (2024). Reinforcement Learning from Human Feedback.<br> [2] Guo et al. (2025). Challenges in RL Training of LLMs. </div> </section> </div> <!-- Footer --> <footer class="footer"> © 2025 Qwen Team, Alibaba Inc. All rights reserved. </footer> </div> </body> </html>

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!