<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<title>FlashPrefill Poster</title>
<style>
:root {
--primary-color: #003366; /* Deep Academic Blue */
--secondary-color: #0066cc; /* Lighter Blue */
--accent-color: #e67e22; /* Orange for highlights */
--success-color: #27ae60; /* Green for results */
--bg-color: #f0f3f5;
--card-bg: #ffffff;
--text-main: #2c3e50;
--text-light: #576574;
}
* {
box-sizing: border-box;
margin: 20;
padding: 20;
}
body {
width: 1200px;
height: 3000px;
font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
background-color: var(--bg-color);
color: var(--text-main);
line-height: 1.5;
overflow: hidden;
}
.poster-container {
display: grid;
grid-template-columns: 1fr;
grid-template-rows: auto auto auto auto auto;
gap: 40px;
padding: 60px;
}
/* Header */
header {
background: linear-gradient(135deg, var(--primary-color), var(--secondary-color));
color: white;
padding: 60px;
border-radius: 20px;
box-shadow: 0 10px 20px rgba(0,0,0,0.15);
text-align: center;
}
h1 {
font-size: 72px;
font-weight: 800;
margin-bottom: 20px;
letter-spacing: -1px;
}
.authors {
font-size: 32px;
opacity: 0.9;
margin-bottom: 15px;
}
.links {
font-size: 24px;
font-family: 'Courier New', monospace;
background: rgba(255,255,255,0.2);
display: inline-block;
padding: 10px 20px;
border-radius: 10px;
}
/* Highlight Box */
.highlight-box {
background: linear-gradient(to right, #fff3cd, #ffeeba);
border-left: 15px solid var(--accent-color);
padding: 40px;
border-radius: 15px;
font-size: 36px;
font-weight: bold;
color: #856404;
box-shadow: 0 5px 15px rgba(0,0,0,0.05);
text-align: center;
}
/* Section Styles */
.section-title {
font-size: 48px;
font-weight: 700;
color: var(--primary-color);
border-bottom: 4px solid var(--secondary-color);
padding-bottom: 10px;
margin-bottom: 30px;
display: flex;
align-items: center;
}
.section-title::before {
content: '';
display: inline-block;
width: 20px;
height: 50px;
background: var(--secondary-color);
margin-right: 20px;
border-radius: 5px;
}
.card {
background: var(--card-bg);
border-radius: 20px;
padding: 40px;
box-shadow: 0 8px 16px rgba(0,0,0,0.05);
margin-bottom: 20px;
}
/* Layout Grid */
.content-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 40px;
}
/* Motivation & Pain Points */
.pain-point {
background: #fff5f5;
border-left: 8px solid #e74c3c;
padding: 20px;
margin-bottom: 20px;
}
.pain-point h4 {
color: #c0392b;
font-size: 28px;
margin-bottom: 10px;
}
/* Methodology Section */
.method-container {
display: grid;
grid-template-columns: 1fr 1fr 1fr;
gap: 30px;
}
.method-card {
background: white;
border-top: 10px solid var(--secondary-color);
border-radius: 15px;
padding: 30px;
box-shadow: 0 5px 15px rgba(0,0,0,0.08);
}
.method-card h3 {
font-size: 32px;
color: var(--secondary-color);
margin-bottom: 20px;
min-height: 80px; /* Align titles */
}
.method-step {
font-size: 24px;
margin-bottom: 15px;
padding-left: 20px;
border-left: 4px solid #bdc3c7;
}
/* Results Section */
.results-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 40px;
}
.stat-box {
background: var(--primary-color);
color: white;
padding: 40px;
border-radius: 15px;
text-align: center;
}
.stat-number {
font-size: 96px;
font-weight: 800;
color: var(--accent-color);
line-height: 1.1;
}
.stat-label {
font-size: 28px;
margin-top: 10px;
}
.chart-placeholder {
background: #ecf0f1;
height: 300px;
border-radius: 10px;
display: flex;
align-items: center;
justify-content: center;
color: #7f8c8d;
font-size: 24px;
margin-top: 20px;
border: 2px dashed #bdc3c7;
}
/* Footer */
footer {
text-align: center;
font-size: 24px;
color: var(--text-light);
margin-top: 20px;
padding-top: 40px;
border-top: 2px solid #dcdcdc;
}
ul {
list-style-position: inside;
font-size: 26px;
}
li {
margin-bottom: 10px;
}
.tag {
display: inline-block;
padding: 5px 15px;
border-radius: 5px;
font-size: 20px;
font-weight: bold;
color: white;
margin-right: 10px;
}
.tag-blue { background: var(--secondary-color); }
.tag-orange { background: var(--accent-color); }
.formula {
background: #f8f9fa;
padding: 15px;
border-radius: 10px;
font-family: 'Times New Roman', serif;
font-style: italic;
font-size: 28px;
text-align: center;
margin: 15px 0;
border: 1px solid #dee2e6;
}
</style>
</head>
<body>
<div class="poster-container">
<!-- 1. Header -->
<header>
<h1>FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling</h1>
<div class="authors">微信 · 中科院自动化所 | arXiv:2603.06199 | GitHub: qhfan/FlashPrefill</div>
</header>
<!-- 2. Key Insight Highlight -->
<div class="highlight-box">
💡 <span style="font-size: 40px; color: #d35400;">核心突破:</span>“寻找重要块”和“筛选策略”做到近乎零成本 + 动态阈值剪掉注意力长尾 + 索引压紧实现物理跳转
<br>🚀 在 256K 上下文下实现 <strong>27.78倍</strong> 算子加速
</div>
<!-- 3. Motivation & Pain Points -->
<div class="content-grid">
<div class="card">
<div class="section-title">背景与动机</div>
<p style="font-size: 26px; margin-bottom: 20px;"><strong>Prefill 阶段瓶颈:</strong></p>
<ul>
<li>Prefill 需遍历 Prompt 计算 KV Cache,复杂度 $O(L^2)$。</li>
<li>长上下文导致 TTFT (首字延迟) 爆炸式增长。</li>
<li>用户界面卡顿,体验极差。</li>
</ul>
<div class="chart-placeholder" style="height: 200px;">
<img src="https://img.icons8.com/ios-filled/100/000000/time.png" style="width: 60px; margin-right: 15px;">
<span>Prefill 耗时随长度指数增长</span>
</div>
</div>
<div class="card">
<div class="section-title">现有方法痛点</div>
<div class="pain-point">
<h4>❌ 模式发现延迟高</h4>
<p style="font-size: 24px;">估算 Block 重要性本身计算量大。</p>
</div>
<div class="pain-point">
<h4>❌ 筛选策略昂贵</h4>
<p style="font-size: 24px;">Top-k 排序、Top-p 累加难以并行。</p>
</div>
<div class="pain-point">
<h4>❌ 稀疏不彻底</h4>
<p style="font-size: 24px;">长尾分布下,为凑够 K 个/P 概率引入大量无效块。</p>
</div>
</div>
</div>
<!-- 4. Methodology -->
<div class="card">
<div class="section-title">FlashPrefill 方法</div>
<div class="method-container">
<!-- Method 1 -->
<div class="method-card">
<h3>1. 瞬时模式发现</h3>
<div class="tag tag-blue">近似运算</div>
<div class="method-step">Block 级近似,均值代理排序。</div>
<div class="method-step">低方差假设:块内语义相似。</div>
<div class="tag tag-blue" style="margin-top: 20px;">显存优化</div>
<div class="method-step">Key/Query 池化 + 全局再加权。</div>
<div class="method-step">生成全局“注意力地图”。</div>
<div style="margin-top: 15px; font-size: 22px; color: #666;">
📉 数据搬运从 $L imes L/B$ 降至 $(L/B)^2$。
</div>
</div>
<!-- Method 2 -->
<div class="method-card">
<h3>2. 动态阈值剪枝</h3>
<div class="formula">
Threshold = α × max(Score)
</div>
<ul style="font-size: 22px;">
<li><strong>计算极轻:</strong>仅需一次 Max-reduction。</li>
<li><strong>剪掉长尾:</strong>不凑数,彻底丢弃低分块。</li>
</ul>
<div style="background: #eef2f5; padding: 15px; border-radius: 10px; margin-top: 15px;">
<p style="font-size: 22px; margin-bottom: 10px;">🛡️ 安全网机制:</p>
<ul style="font-size: 20px; color: #555;">
<li>保留 Attention Sinks (前256)</li>
<li>保留 Local Window (近512)</li>
</ul>
</div>
</div>
<!-- Method 3 -->
<div class="method-card">
<h3>3. 索引压紧物理跳转</h3>
<p style="font-size: 24px;"><strong>问题:</strong> 逻辑跳过 (if mask=0) 仍遍历所有块,分支开销大。</p>
<hr style="margin: 20px 0; border-style: dashed;">
<p style="font-size: 24px;"><strong>方案:</strong></p>
<ul style="font-size: 22px;">
<li>压紧有效块索引为连续列表。</li>
<li>Kernel 内层仅遍历有效索引。</li>
</ul>
<div style="margin-top: 20px; background: #d5f5e3; padding: 15px; border-radius: 10px; text-align: center;">
<span style="font-size: 24px; color: #27ae60; font-weight: bold;">逻辑跳过 → 物理跳转</span>
<br>
<span style="font-size: 20px;">内存访问更集中,循环次数骤减</span>
</div>
</div>
</div>
</div>
<!-- 5. Results -->
<div class="card" style="background: #f8fbfd;">
<div class="section-title">实验结果</div>
<div class="results-grid">
<div>
<h3 style="font-size: 32px; margin-bottom: 20px;">⚡ 性能加速</h3>
<div class="stat-box">
<div class="stat-number">27.78x</div>
<div class="stat-label">Prefill 算子加速 (Qwen3-30B @ 256K)</div>
</div>
<div style="margin-top: 20px; font-size: 24px;">
<p><strong>稀疏度对比 (256K):</strong></p>
<ul>
<li>FlashPrefill: <strong>3.5%</strong> (保留块)</li>
<li>FlexPrefill: 8.4%</li>
<li>XAttention: 18.5%</li>
</ul>
</div>
</div>
<div>
<h3 style="font-size: 32px; margin-bottom: 20px;">🎯 准确率保持</h3>
<ul style="font-size: 26px; line-height: 1.8;">
<li><strong>RULER (Qwen3-30B):</strong> 92.68 vs Full 93.28 (微降)</li>
<li><strong>InfiniteBench (Qwen2.5-7B):</strong> 24.93 vs Full 23.87 (略优)</li>
<li><strong>Needle-in-Haystack:</strong> 2K~256K 检索能力几乎无损。</li>
</ul>
<div class="chart-placeholder" style="height: 150px;">
<p>端到端 TTFT 显著降低 (集成 vLLM)</p>
</div>
</div>
</div>
</div>
<!-- Footer -->
<footer>
总结:FlashPrefill 通过近似计算、动态阈值和物理跳转,实现了长上下文 Prefill 的零成本稀疏化,极大提升了推理效率。
</footer>
</div>
</body>
</html>
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!