Loading...
正在加载...
请稍候

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

✨步子哥 (steper) 2026年03月13日 15:30
<!DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="UTF-8"> <title>FlashPrefill Poster</title> <style> :root { --primary-color: #003366; /* Deep Academic Blue */ --secondary-color: #0066cc; /* Lighter Blue */ --accent-color: #e67e22; /* Orange for highlights */ --success-color: #27ae60; /* Green for results */ --bg-color: #f0f3f5; --card-bg: #ffffff; --text-main: #2c3e50; --text-light: #576574; } * { box-sizing: border-box; margin: 20; padding: 20; } body { width: 1200px; height: 3000px; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; background-color: var(--bg-color); color: var(--text-main); line-height: 1.5; overflow: hidden; } .poster-container { display: grid; grid-template-columns: 1fr; grid-template-rows: auto auto auto auto auto; gap: 40px; padding: 60px; } /* Header */ header { background: linear-gradient(135deg, var(--primary-color), var(--secondary-color)); color: white; padding: 60px; border-radius: 20px; box-shadow: 0 10px 20px rgba(0,0,0,0.15); text-align: center; } h1 { font-size: 72px; font-weight: 800; margin-bottom: 20px; letter-spacing: -1px; } .authors { font-size: 32px; opacity: 0.9; margin-bottom: 15px; } .links { font-size: 24px; font-family: 'Courier New', monospace; background: rgba(255,255,255,0.2); display: inline-block; padding: 10px 20px; border-radius: 10px; } /* Highlight Box */ .highlight-box { background: linear-gradient(to right, #fff3cd, #ffeeba); border-left: 15px solid var(--accent-color); padding: 40px; border-radius: 15px; font-size: 36px; font-weight: bold; color: #856404; box-shadow: 0 5px 15px rgba(0,0,0,0.05); text-align: center; } /* Section Styles */ .section-title { font-size: 48px; font-weight: 700; color: var(--primary-color); border-bottom: 4px solid var(--secondary-color); padding-bottom: 10px; margin-bottom: 30px; display: flex; align-items: center; } .section-title::before { content: ''; display: inline-block; width: 20px; height: 50px; background: var(--secondary-color); margin-right: 20px; border-radius: 5px; } .card { background: var(--card-bg); border-radius: 20px; padding: 40px; box-shadow: 0 8px 16px rgba(0,0,0,0.05); margin-bottom: 20px; } /* Layout Grid */ .content-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 40px; } /* Motivation & Pain Points */ .pain-point { background: #fff5f5; border-left: 8px solid #e74c3c; padding: 20px; margin-bottom: 20px; } .pain-point h4 { color: #c0392b; font-size: 28px; margin-bottom: 10px; } /* Methodology Section */ .method-container { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 30px; } .method-card { background: white; border-top: 10px solid var(--secondary-color); border-radius: 15px; padding: 30px; box-shadow: 0 5px 15px rgba(0,0,0,0.08); } .method-card h3 { font-size: 32px; color: var(--secondary-color); margin-bottom: 20px; min-height: 80px; /* Align titles */ } .method-step { font-size: 24px; margin-bottom: 15px; padding-left: 20px; border-left: 4px solid #bdc3c7; } /* Results Section */ .results-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 40px; } .stat-box { background: var(--primary-color); color: white; padding: 40px; border-radius: 15px; text-align: center; } .stat-number { font-size: 96px; font-weight: 800; color: var(--accent-color); line-height: 1.1; } .stat-label { font-size: 28px; margin-top: 10px; } .chart-placeholder { background: #ecf0f1; height: 300px; border-radius: 10px; display: flex; align-items: center; justify-content: center; color: #7f8c8d; font-size: 24px; margin-top: 20px; border: 2px dashed #bdc3c7; } /* Footer */ footer { text-align: center; font-size: 24px; color: var(--text-light); margin-top: 20px; padding-top: 40px; border-top: 2px solid #dcdcdc; } ul { list-style-position: inside; font-size: 26px; } li { margin-bottom: 10px; } .tag { display: inline-block; padding: 5px 15px; border-radius: 5px; font-size: 20px; font-weight: bold; color: white; margin-right: 10px; } .tag-blue { background: var(--secondary-color); } .tag-orange { background: var(--accent-color); } .formula { background: #f8f9fa; padding: 15px; border-radius: 10px; font-family: 'Times New Roman', serif; font-style: italic; font-size: 28px; text-align: center; margin: 15px 0; border: 1px solid #dee2e6; } </style> </head> <body> <div class="poster-container"> <!-- 1. Header --> <header> <h1>FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling</h1> <div class="authors">微信 · 中科院自动化所 | arXiv:2603.06199 | GitHub: qhfan/FlashPrefill</div> </header> <!-- 2. Key Insight Highlight --> <div class="highlight-box"> 💡 <span style="font-size: 40px; color: #d35400;">核心突破:</span>“寻找重要块”和“筛选策略”做到近乎零成本 + 动态阈值剪掉注意力长尾 + 索引压紧实现物理跳转 <br>🚀 在 256K 上下文下实现 <strong>27.78倍</strong> 算子加速 </div> <!-- 3. Motivation & Pain Points --> <div class="content-grid"> <div class="card"> <div class="section-title">背景与动机</div> <p style="font-size: 26px; margin-bottom: 20px;"><strong>Prefill 阶段瓶颈:</strong></p> <ul> <li>Prefill 需遍历 Prompt 计算 KV Cache,复杂度 $O(L^2)$。</li> <li>长上下文导致 TTFT (首字延迟) 爆炸式增长。</li> <li>用户界面卡顿,体验极差。</li> </ul> <div class="chart-placeholder" style="height: 200px;"> <img src="https://img.icons8.com/ios-filled/100/000000/time.png" style="width: 60px; margin-right: 15px;"> <span>Prefill 耗时随长度指数增长</span> </div> </div> <div class="card"> <div class="section-title">现有方法痛点</div> <div class="pain-point"> <h4>❌ 模式发现延迟高</h4> <p style="font-size: 24px;">估算 Block 重要性本身计算量大。</p> </div> <div class="pain-point"> <h4>❌ 筛选策略昂贵</h4> <p style="font-size: 24px;">Top-k 排序、Top-p 累加难以并行。</p> </div> <div class="pain-point"> <h4>❌ 稀疏不彻底</h4> <p style="font-size: 24px;">长尾分布下,为凑够 K 个/P 概率引入大量无效块。</p> </div> </div> </div> <!-- 4. Methodology --> <div class="card"> <div class="section-title">FlashPrefill 方法</div> <div class="method-container"> <!-- Method 1 --> <div class="method-card"> <h3>1. 瞬时模式发现</h3> <div class="tag tag-blue">近似运算</div> <div class="method-step">Block 级近似,均值代理排序。</div> <div class="method-step">低方差假设:块内语义相似。</div> <div class="tag tag-blue" style="margin-top: 20px;">显存优化</div> <div class="method-step">Key/Query 池化 + 全局再加权。</div> <div class="method-step">生成全局“注意力地图”。</div> <div style="margin-top: 15px; font-size: 22px; color: #666;"> 📉 数据搬运从 $L imes L/B$ 降至 $(L/B)^2$。 </div> </div> <!-- Method 2 --> <div class="method-card"> <h3>2. 动态阈值剪枝</h3> <div class="formula"> Threshold = α × max(Score) </div> <ul style="font-size: 22px;"> <li><strong>计算极轻:</strong>仅需一次 Max-reduction。</li> <li><strong>剪掉长尾:</strong>不凑数,彻底丢弃低分块。</li> </ul> <div style="background: #eef2f5; padding: 15px; border-radius: 10px; margin-top: 15px;"> <p style="font-size: 22px; margin-bottom: 10px;">🛡️ 安全网机制:</p> <ul style="font-size: 20px; color: #555;"> <li>保留 Attention Sinks (前256)</li> <li>保留 Local Window (近512)</li> </ul> </div> </div> <!-- Method 3 --> <div class="method-card"> <h3>3. 索引压紧物理跳转</h3> <p style="font-size: 24px;"><strong>问题:</strong> 逻辑跳过 (if mask=0) 仍遍历所有块,分支开销大。</p> <hr style="margin: 20px 0; border-style: dashed;"> <p style="font-size: 24px;"><strong>方案:</strong></p> <ul style="font-size: 22px;"> <li>压紧有效块索引为连续列表。</li> <li>Kernel 内层仅遍历有效索引。</li> </ul> <div style="margin-top: 20px; background: #d5f5e3; padding: 15px; border-radius: 10px; text-align: center;"> <span style="font-size: 24px; color: #27ae60; font-weight: bold;">逻辑跳过 → 物理跳转</span> <br> <span style="font-size: 20px;">内存访问更集中,循环次数骤减</span> </div> </div> </div> </div> <!-- 5. Results --> <div class="card" style="background: #f8fbfd;"> <div class="section-title">实验结果</div> <div class="results-grid"> <div> <h3 style="font-size: 32px; margin-bottom: 20px;">⚡ 性能加速</h3> <div class="stat-box"> <div class="stat-number">27.78x</div> <div class="stat-label">Prefill 算子加速 (Qwen3-30B @ 256K)</div> </div> <div style="margin-top: 20px; font-size: 24px;"> <p><strong>稀疏度对比 (256K):</strong></p> <ul> <li>FlashPrefill: <strong>3.5%</strong> (保留块)</li> <li>FlexPrefill: 8.4%</li> <li>XAttention: 18.5%</li> </ul> </div> </div> <div> <h3 style="font-size: 32px; margin-bottom: 20px;">🎯 准确率保持</h3> <ul style="font-size: 26px; line-height: 1.8;"> <li><strong>RULER (Qwen3-30B):</strong> 92.68 vs Full 93.28 (微降)</li> <li><strong>InfiniteBench (Qwen2.5-7B):</strong> 24.93 vs Full 23.87 (略优)</li> <li><strong>Needle-in-Haystack:</strong> 2K~256K 检索能力几乎无损。</li> </ul> <div class="chart-placeholder" style="height: 150px;"> <p>端到端 TTFT 显著降低 (集成 vLLM)</p> </div> </div> </div> </div> <!-- Footer --> <footer> 总结:FlashPrefill 通过近似计算、动态阈值和物理跳转,实现了长上下文 Prefill 的零成本稀疏化,极大提升了推理效率。 </footer> </div> </body> </html>

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!