<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Stabilizing Reinforcement Learning with LLMs</title>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&display=swap" rel="stylesheet">
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
<style>
:root {
--primary: #1565c0;
--primary-light: #5e92f3;
--primary-dark: #003c8f;
--secondary: #2979ff;
--text-on-primary: #ffffff;
--text-primary: #212121;
--text-secondary: #424242;
--background: #f5f7fa;
--card-bg: #ffffff;
--accent: #29b6f6;
}
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Roboto', sans-serif;
background-color: var(--background);
color: var(--text-primary);
line-height: 1.6;
}
.poster-container {
width: 720px;
min-height: 960px;
margin: 0 auto;
background: linear-gradient(135deg, #f5f7fa 0%, #e4eaf5 100%);
position: relative;
overflow: hidden;
padding: 40px;
box-shadow: 0 10px 30px rgba(0, 0, 0, 0.1);
}
.background-shape {
position: absolute;
border-radius: 50%;
opacity: 0.1;
z-index: 0;
}
.shape-1 {
width: 500px;
height: 500px;
background: var(--primary);
top: -200px;
right: -200px;
}
.shape-2 {
width: 400px;
height: 400px;
background: var(--secondary);
bottom: -150px;
left: -150px;
}
.header {
text-align: center;
margin-bottom: 30px;
position: relative;
z-index: 1;
}
.title {
font-size: 36px;
font-weight: 700;
color: var(--primary-dark);
margin-bottom: 15px;
line-height: 1.2;
}
.authors {
font-size: 16px;
color: var(--text-secondary);
margin-bottom: 20px;
max-width: 90%;
margin-left: auto;
margin-right: auto;
}
.logo {
width: 120px;
margin: 0 auto;
}
.content-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 25px;
position: relative;
z-index: 1;
}
.section {
background: var(--card-bg);
border-radius: 16px;
padding: 25px;
box-shadow: 0 4px 15px rgba(0, 0, 0, 0.05);
transition: transform 0.3s ease, box-shadow 0.3s ease;
}
.section:hover {
transform: translateY(-5px);
box-shadow: 0 8px 25px rgba(0, 0, 0, 0.1);
}
.section-title {
font-size: 22px;
font-weight: 500;
color: var(--primary);
margin-bottom: 15px;
display: flex;
align-items: center;
}
.section-title .material-icons {
margin-right: 8px;
font-size: 24px;
}
.section-content {
font-size: 16px;
color: var(--text-secondary);
}
.full-width {
grid-column: span 2;
}
.image-container {
margin: 15px 0;
text-align: center;
}
.image-container img {
max-width: 100%;
border-radius: 12px;
box-shadow: 0 4px 10px rgba(0, 0, 0, 0.1);
}
.highlight {
background: linear-gradient(transparent 50%, rgba(41, 182, 246, 0.2) 50%);
padding: 0 4px;
}
.key-point {
background-color: rgba(41, 182, 246, 0.1);
border-left: 4px solid var(--accent);
padding: 10px 15px;
margin: 15px 0;
border-radius: 0 8px 8px 0;
}
.footer {
text-align: center;
margin-top: 30px;
padding-top: 20px;
border-top: 1px solid rgba(0, 0, 0, 0.1);
color: var(--text-secondary);
font-size: 14px;
position: relative;
z-index: 1;
}
</style>
</head>
<body>
<div class="poster-container">
<!-- Background shapes -->
<div class="background-shape shape-1"></div>
<div class="background-shape shape-2"></div>
<!-- Header section -->
<header class="header">
<h1 class="title">Stabilizing Reinforcement Learning with LLMs: Formulation and Practices</h1>
<p class="authors">Chujie Zheng, Junrong Lin, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Feng Hu, Yuyong Liu, Hao Lin, An Yang, Jingren Zhou, Chen Wu, Junyang Lin, Qwen Team, Alibaba Inc.</p>
<div class="logo">
<img src="https://sfile.chatglm.cn/moeSlide/image/02/02c26323.jpg" alt="Qwen Team Logo" style="width: 100%; border-radius: 8px;">
</div>
</header>
<!-- Content grid -->
<div class="content-grid">
<!-- Abstract section -->
<section class="section full-width">
<h2 class="section-title">
<i class="material-icons">description</i>
Abstract
</h2>
<div class="section-content">
This paper proposes a <span class="highlight">novel formulation</span> for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Our approach addresses the <span class="highlight">instability issues</span> commonly encountered in RL training of LLMs, providing both theoretical foundations and practical implementations.
</div>
</section>
<!-- Introduction section -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">lightbulb</i>
Introduction
</h2>
<div class="section-content">
Reinforcement learning (RL) has become a key technical paradigm for enhancing large language models' (LLMs) ability to tackle complex problem-solving tasks. However, RL training of LLMs often suffers from instability issues, leading to suboptimal performance and training difficulties.
<div class="key-point">
This paper addresses these challenges by proposing a novel formulation that stabilizes the RL process.
</div>
</div>
<div class="image-container">
<img src="https://sfile.chatglm.cn/moeSlide/image/c6/c6fe3060.jpg" alt="LLM Concept">
</div>
</section>
<!-- Methodology section -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">science</i>
Methodology
</h2>
<div class="section-content">
Our approach introduces a <span class="highlight">first-order approximation perspective</span> to understand and solve the instability puzzle in RL training of LLMs. We derive theoretical conditions under which token-level objectives can effectively optimize sequence-level rewards.
<div class="key-point">
The formulation provides practical guidelines for implementing stable RL training in LLMs.
</div>
</div>
<div class="image-container">
<img src="https://sfile.chatglm.cn/moeSlide/image/ad/ad1b9a1c.jpg" alt="Methodology Diagram">
</div>
</section>
<!-- Results section -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">analytics</i>
Results
</h2>
<div class="section-content">
Experimental results demonstrate that our proposed method significantly improves training stability compared to conventional approaches. Our method achieves better convergence properties and higher final performance on various benchmark tasks.
<div class="key-point">
The first-order approximation effectively captures the essential dynamics of the RL process while maintaining computational efficiency.
</div>
</div>
<div class="image-container">
<img src="https://sfile.chatglm.cn/moeSlide/image/14/14c69370.jpg" alt="Results Visualization">
</div>
</section>
<!-- Conclusion section -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">check_circle</i>
Conclusion
</h2>
<div class="section-content">
This paper presents a novel formulation for stabilizing RL with LLMs, addressing the instability issues from a first-order approximation perspective. Both theoretical analysis and experimental results validate the effectiveness of our approach.
<div class="key-point">
This work provides valuable insights for the development of more stable and efficient RL training methods for LLMs.
</div>
</div>
</section>
<!-- References section -->
<section class="section full-width">
<h2 class="section-title">
<i class="material-icons">menu_book</i>
References
</h2>
<div class="section-content">
[1] OpenAI (2024). Reinforcement Learning from Human Feedback.<br>
[2] Guo et al. (2025). Challenges in RL Training of LLMs.
</div>
</section>
</div>
<!-- Footer -->
<footer class="footer">
© 2025 Qwen Team, Alibaba Inc. All rights reserved.
</footer>
</div>
</body>
</html>
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!