<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>RLVR的稀疏性之谜:三道门理论与山脊山谷比喻</title>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
<link href="https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@400;500;700;900&display=swap" rel="stylesheet">
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Noto Sans SC', sans-serif;
background-color: #f5f7fa;
color: #1a237e;
line-height: 1.6;
}
.poster-container {
width: 920px;
min-height: 960px;
margin: 0 auto;
background: linear-gradient(135deg, #e3f2fd, #bbdefb);
padding: 40px;
position: relative;
overflow: hidden;
}
.background-shape {
position: absolute;
border-radius: 50%;
opacity: 0.1;
z-index: 0;
}
.shape1 {
width: 500px;
height: 500px;
background: linear-gradient(45deg, #1976d2, #64b5f6);
top: -200px;
right: -200px;
}
.shape2 {
width: 400px;
height: 400px;
background: linear-gradient(45deg, #43a047, #81c784);
bottom: -150px;
left: -150px;
}
.header {
position: relative;
z-index: 1;
text-align: center;
margin-bottom: 40px;
}
.title {
font-size: 42px;
font-weight: 900;
color: #0d47a1;
margin-bottom: 15px;
line-height: 1.2;
}
.subtitle {
font-size: 22px;
color: #1565c0;
font-weight: 500;
}
.content {
position: relative;
z-index: 1;
display: flex;
flex-direction: column;
gap: 30px;
}
.section {
background: rgba(255, 255, 255, 0.85);
border-radius: 16px;
padding: 25px;
box-shadow: 0 8px 16px rgba(0, 0, 0, 0.1);
backdrop-filter: blur(10px);
border: 1px solid rgba(255, 255, 255, 0.3);
}
.section-title {
display: flex;
align-items: center;
font-size: 28px;
font-weight: 700;
color: #0d47a1;
margin-bottom: 15px;
}
.section-title .material-icons {
margin-right: 10px;
font-size: 32px;
}
.section-content {
font-size: 18px;
}
.highlight {
background: linear-gradient(transparent 40%, rgba(77, 182, 172, 0.3) 40%, rgba(77, 182, 172, 0.3) 85%, transparent 85%);
padding: 0 4px;
}
.two-column {
display: flex;
gap: 20px;
margin-top: 15px;
}
.column {
flex: 1;
}
.gate {
background: #e3f2fd;
border-radius: 12px;
padding: 15px;
margin-bottom: 15px;
border-left: 5px solid #1976d2;
}
.gate-title {
font-weight: 700;
color: #0d47a1;
margin-bottom: 8px;
font-size: 20px;
}
.gate-description {
font-size: 16px;
}
.comparison {
display: flex;
gap: 20px;
margin-top: 15px;
}
.comparison-item {
flex: 1;
padding: 15px;
border-radius: 12px;
}
.ridge {
background: linear-gradient(135deg, #ffebee, #ffcdd2);
border-left: 5px solid #f44336;
}
.valley {
background: linear-gradient(135deg, #e8f5e9, #c8e6c9);
border-left: 5px solid #4caf50;
}
.comparison-title {
font-weight: 700;
font-size: 20px;
margin-bottom: 8px;
}
.ridge .comparison-title {
color: #c62828;
}
.valley .comparison-title {
color: #2e7d32;
}
.method {
background: #f5f5f5;
border-radius: 12px;
padding: 15px;
margin-bottom: 15px;
}
.method-title {
font-weight: 700;
font-size: 20px;
margin-bottom: 8px;
display: flex;
align-items: center;
}
.method-title .material-icons {
margin-right: 8px;
}
.lora {
border-left: 5px solid #4caf50;
}
.lora .method-title {
color: #2e7d32;
}
.pissa {
border-left: 5px solid #f44336;
}
.pissa .method-title {
color: #c62828;
}
.mountain-visual {
width: 100%;
height: 200px;
background: linear-gradient(to bottom, #bbdefb, #e3f2fd);
border-radius: 12px;
margin: 15px 0;
position: relative;
overflow: hidden;
}
.ridge-path {
position: absolute;
top: 50px;
left: 50px;
width: 200px;
height: 100px;
border-top: 4px solid #f44336;
border-radius: 50% 50% 0 0;
}
.valley-path {
position: absolute;
bottom: 50px;
left: 100px;
width: 400px;
height: 50px;
border-bottom: 4px solid #4caf50;
}
.mountain {
position: absolute;
bottom: 0;
width: 150px;
height: 150px;
background: #90a4ae;
clip-path: polygon(50% 0%, 0% 100%, 100% 100%);
}
.mountain1 {
left: 50px;
height: 180px;
}
.mountain2 {
left: 200px;
height: 120px;
}
.mountain3 {
right: 50px;
height: 160px;
}
.sparsity-visual {
display: flex;
justify-content: space-between;
margin: 15px 0;
}
.matrix {
width: 150px;
height: 150px;
display: grid;
grid-template-columns: repeat(10, 1fr);
grid-template-rows: repeat(10, 1fr);
gap: 2px;
}
.cell {
background-color: #e0e0e0;
border-radius: 2px;
}
.cell.active {
background-color: #1976d2;
}
.sparsity-label {
text-align: center;
font-weight: 500;
margin-top: 5px;
}
</style>
</head>
<body>
<div class="poster-container">
<!-- Background Shapes -->
<div class="background-shape shape1"></div>
<div class="background-shape shape2"></div>
<!-- Header -->
<header class="header">
<h1 class="title">RLVR的稀疏性之谜</h1>
<p class="subtitle">三道门理论与山脊山谷比喻</p>
</header>
<!-- Content -->
<div class="content">
<!-- Section 1: RLVR稀疏性的基本概念 -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">psychology</i>
RLVR稀疏性的基本概念
</h2>
<div class="section-content">
<p>强化学习在提升推理、编程能力时,参数更新呈现出<span class="highlight">极度的稀疏性</span>。就像钢琴家只动小拇指就能演奏神曲,这种"四两拨千斤"的背后机制是什么?</p>
<div class="sparsity-visual">
<div>
<div class="matrix" id="sft-matrix"></div>
<div class="sparsity-label">SFT更新(稠密)</div>
</div>
<div>
<div class="matrix" id="rlvr-matrix"></div>
<div class="sparsity-label">RLVR更新(稀疏)</div>
</div>
</div>
<p>RLVR(Reinforcement Learning with Value Regularization)是一个悖论现象:高成本、高收益的训练过程却只改变极小部分参数。这种稀疏性并非随机,而是由模型的内在几何结构决定的。</p>
</div>
</section>
<!-- Section 2: 三道门理论 -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">filter_frames</i>
三道门理论
</h2>
<div class="section-content">
<p>RLVR的稀疏性可以通过"三道门理论"来解释,每道门都对参数更新施加了约束:</p>
<div class="gate">
<div class="gate-title">门一:KL锚 (KL Anchor)</div>
<div class="gate-description">RL诱导一个单步策略-KL约束,保持更新接近基础策略,限制参数更新的幅度。</div>
</div>
<div class="gate">
<div class="gate-title">门二:模型几何 (Model Geometry)</div>
<div class="gate-description">将更新引导向低曲率、保持谱结构的方向,这是一个数据不变的特征,迫使模型避开"主方向"。</div>
</div>
<div class="gate">
<div class="gate-title">门三:精度 (Precision)</div>
<div class="gate-description">bfloat16格式作为一个透镜,通过隐藏微更新来放大这种偏差,使底层模式表现为明显的稀疏性。</div>
</div>
</div>
</section>
<!-- Section 3: 山脊vs山谷的几何比喻 -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">terrain</i>
山脊 vs 山谷
</h2>
<div class="section-content">
<p>这是一个精彩的几何比喻。监督微调(SFT)和RLVR在参数空间中选择了完全不同的路径:</p>
<div class="mountain-visual">
<div class="mountain mountain1"></div>
<div class="mountain mountain2"></div>
<div class="mountain mountain3"></div>
<div class="ridge-path"></div>
<div class="valley-path"></div>
</div>
<div class="comparison">
<div class="comparison-item ridge">
<div class="comparison-title">山脊 (SFT路径)</div>
<p>沿着高曲率的"主干方向"攀登险峰,导致剧烈的谱漂移,改变模型的核心知识结构。</p>
</div>
<div class="comparison-item valley">
<div class="comparison-title">山谷 (RLVR路径)</div>
<p>选择在平缓的"偏离主干"山谷中徒步,保留模型核心知识结构,实现高效且安全的学习。</p>
</div>
</div>
</div>
</section>
<!-- Section 4: LoRA与PiSSA的实战启示 -->
<section class="section">
<h2 class="section-title">
<i class="material-icons">compare_arrows</i>
LoRA与PiSSA的实战启示
</h2>
<div class="section-content">
<p>为什么低秩适配器(LoRA)天然适合强化学习?相反,专为SFT设计的PiSSA为何在RL任务中会导致训练崩溃?</p>
<div class="method lora">
<div class="method-title">
<i class="material-icons">check_circle</i>
LoRA:天然适合RL
</div>
<p>LoRA自然地更新非主方向,与RLVR的"山谷路径"完美契合。它在低秩空间中学习,不会破坏模型的核心几何结构,因此能够稳定地提升推理能力。</p>
</div>
<div class="method pissa">
<div class="method-title">
<i class="material-icons">error</i>
PiSSA:RL中的"登山者"
</div>
<p>PiSSA专注于更新主奇异值对应的"主方向",这相当于强制模型沿着"山脊"攀登。在RL任务中,这种策略会导致训练崩溃,因为它违背了RLVR的基本优化原理。</p>
</div>
<p>实验证明,PiSSA在RLVR中不仅没有比普通LoRA更好,反而因为强制模型走"高山"路径而更容易训练崩溃。这表明RL和SFT需要不同的参数高效微调策略。</p>
</div>
</section>
</div>
</div>
<script>
// Create sparsity visualization
function createMatrix(matrixId, density) {
const matrix = document.getElementById(matrixId);
const cells = [];
for (let i = 0; i < 100; i++) {
const cell = document.createElement('div');
cell.className = 'cell';
if (Math.random() < density) {
cell.classList.add('active');
}
cells.push(cell);
matrix.appendChild(cell);
}
return cells;
}
// Create SFT matrix (dense)
createMatrix('sft-matrix', 0.7);
// Create RLVR matrix (sparse)
createMatrix('rlvr-matrix', 0.15);
</script>
</body>
</html>
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!