Loading...
正在加载...
请稍候

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

✨步子哥 (steper) 2025年12月11日 08:27
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation</title> <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&family=Roboto+Slab:wght@400;700&display=swap" rel="stylesheet"> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <style> * { margin: 0; padding: 0; box-sizing: border-box; } body { font-family: 'Roboto', sans-serif; background-color: #f0f4f8; color: #333; line-height: 1.6; } .poster-container { width: 720px; min-height: 960px; margin: 0 auto; background: linear-gradient(135deg, #e6f0ff 0%, #f5f9ff 100%); padding: 40px 30px; position: relative; overflow: hidden; } .poster-container::before { content: ""; position: absolute; top: 0; left: 0; width: 100%; height: 100%; background-image: radial-gradient(circle at 10% 20%, rgba(100, 149, 237, 0.1) 0%, transparent 20%), radial-gradient(circle at 90% 80%, rgba(65, 105, 225, 0.1) 0%, transparent 20%), linear-gradient(45deg, rgba(100, 149, 237, 0.05) 0%, transparent 70%); z-index: 0; } .grid-texture { position: absolute; top: 0; left: 0; width: 100%; height: 100%; background-image: linear-gradient(rgba(255, 255, 255, 0.1) 1px, transparent 1px), linear-gradient(90deg, rgba(255, 255, 255, 0.1) 1px, transparent 1px); background-size: 20px 20px; z-index: 0; } .content { position: relative; z-index: 1; } .header { text-align: center; margin-bottom: 30px; padding-bottom: 20px; border-bottom: 2px solid #4169e1; } .title { font-family: 'Roboto Slab', serif; font-size: 36px; font-weight: 700; color: #1a3a8f; margin-bottom: 15px; line-height: 1.2; } .authors { font-size: 16px; color: #4169e1; margin-bottom: 10px; } .affiliations { font-size: 14px; color: #555; margin-bottom: 10px; } .publication { font-size: 14px; color: #666; font-style: italic; } .section { background-color: rgba(255, 255, 255, 0.85); border-radius: 12px; padding: 20px; margin-bottom: 25px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.05); backdrop-filter: blur(5px); } .section-title { font-family: 'Roboto Slab', serif; font-size: 24px; font-weight: 700; color: #1a3a8f; margin-bottom: 15px; display: flex; align-items: center; } .section-title .material-icons { margin-right: 10px; color: #4169e1; } .section-content { font-size: 16px; } .highlight { background-color: rgba(65, 105, 225, 0.1); padding: 2px 5px; border-radius: 4px; font-weight: 500; } .bullet-list { padding-left: 25px; margin-bottom: 15px; } .bullet-list li { margin-bottom: 8px; } .two-column { display: flex; gap: 20px; margin-bottom: 15px; } .column { flex: 1; } .image-container { text-align: center; margin: 15px 0; } .image-container img { max-width: 100%; border-radius: 8px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); } .image-caption { font-size: 14px; color: #666; margin-top: 8px; text-align: center; } .finding-card { background-color: rgba(65, 105, 225, 0.05); border-left: 4px solid #4169e1; padding: 12px 15px; margin-bottom: 12px; border-radius: 0 8px 8px 0; } .code-link { display: inline-flex; align-items: center; background-color: #4169e1; color: white; padding: 8px 15px; border-radius: 20px; text-decoration: none; font-weight: 500; margin-top: 10px; } .code-link .material-icons { margin-right: 5px; font-size: 18px; } .footer { text-align: center; margin-top: 30px; color: #666; font-size: 14px; } </style> </head> <body> <div class="poster-container"> <div class="grid-texture"></div> <div class="content"> <!-- Header Section --> <div class="header"> <h1 class="title">Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation</h1> <p class="authors">Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li</p> <p class="affiliations">Georgia Institute of Technology, Meta AI, University of Illinois Urbana-Champaign, National University of Singapore</p> <p class="publication">arXiv:2510.07414 (October 2025)</p> </div> <!-- Introduction Section --> <div class="section"> <h2 class="section-title"> <i class="material-icons">lightbulb</i> Introduction </h2> <div class="section-content"> <div class="two-column"> <div class="column"> <ul class="bullet-list"> <li>Modern long-context LLMs perform well on synthetic <span class="highlight">"needle-in-a-haystack" (NIAH)</span> benchmarks</li> <li>These tests overlook how noisy contexts arise from biased retrieval and agentic workflows</li> <li>Need for more realistic evaluation that captures real-world factors</li> </ul> </div> <div class="column"> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/75/752c3cec.jpg" alt="Needle in a haystack visualization" width="300"> <p class="image-caption">Traditional needle-in-a-haystack evaluation</p> </div> </div> </div> </div> </div> <!-- Haystack Engineering Section --> <div class="section"> <h2 class="section-title"> <i class="material-icons">architecture</i> Haystack Engineering </h2> <div class="section-content"> <ul class="bullet-list"> <li>New paradigm to construct realistic noisy long contexts</li> <li>Captures key real-world factors: <ul class="bullet-list"> <li>Distraction from heterogeneous biased retrievers</li> <li>Cascading errors in agentic workflows</li> </ul> </li> <li>Contrast with "context engineering" (optimizing inputs for best performance)</li> </ul> </div> </div> <!-- HaystackCraft Benchmark Section --> <div class="section"> <h2 class="section-title"> <i class="material-icons">assessment</i> HaystackCraft Benchmark </h2> <div class="section-content"> <ul class="bullet-list"> <li>Built on full English Wikipedia hyperlink network</li> <li>Features multi-hop questions</li> <li>Extends traditional NIAH evaluations in two ways: <ul class="bullet-list"> <li>Heterogeneous Retrieval-Dependent Haystacks</li> <li>Dynamic, LLM-Dependent Agentic Context Engineering</li> </ul> </li> </ul> </div> </div> <!-- Heterogeneous Retrieval Strategies Section --> <div class="section"> <h2 class="section-title"> <i class="material-icons">compare_arrows</i> Heterogeneous Retrieval Strategies </h2> <div class="section-content"> <div class="two-column"> <div class="column"> <p>Evaluates how different retrieval strategies affect:</p> <ul class="bullet-list"> <li>Distractor composition</li> <li>Haystack ordering</li> <li>LLM performance</li> </ul> <p>Strategies compared:</p> <ul class="bullet-list"> <li>Sparse Retrieval (BM25)</li> <li>Dense Retrieval (Qwen3-Embedding-0.6B)</li> <li>Hybrid Retrieval (BM25 + Qwen3-Embedding-0.6B)</li> <li>Graph-Based Reranking (Personalized PageRank - PPR)</li> </ul> </div> <div class="column"> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/9f/9f0f5ca8.jpg" alt="Comparison of retrieval strategies" width="300"> <p class="image-caption">Comparison of different retrieval methods</p> </div> </div> </div> </div> </div> <!-- Agentic Context Engineering Section --> <div class="section"> <h2 class="section-title"> <i class="material-icons">psychology</i> Agentic Context Engineering </h2> <div class="section-content"> <div class="two-column"> <div class="column"> <p>Extends NIAH to dynamic, LLM-dependent settings</p> <p>Simulates agentic operations where models:</p> <ul class="bullet-list"> <li>Refine queries</li> <li>Reflect on past reasonings</li> <li>Decide when to stop</li> </ul> <p>Two dynamic settings:</p> <ul class="bullet-list"> <li>Enforced Multi-Round</li> <li>Variable-Round</li> </ul> </div> <div class="column"> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/47/47288779.jpg" alt="Agentic workflow visualization" width="300"> <p class="image-caption">Agentic workflow with cascading errors</p> </div> </div> </div> </div> </div> <!-- Key Findings Section --> <div class="section"> <h2 class="section-title"> <i class="material-icons">insights</i> Key Findings </h2> <div class="section-content"> <div class="finding-card"> <p>Dense retrievers introduce more challenging distractors than sparse ones</p> </div> <div class="finding-card"> <p>Graph-based reranking with PPR significantly improves retrieval effectiveness</p> </div> <div class="finding-card"> <p>Document ordering effects are model-dependent</p> </div> <div class="finding-card"> <p>Even advanced models (Gemini 2.5 Pro, GPT-5) suffer from cascading self-distraction</p> </div> <div class="finding-card"> <p>Models are more robust to noisy long contexts ("width") than to noisy reasoning iterations ("depth")</p> </div> <div class="finding-card"> <p>Most models struggle with appropriate early stopping in variable-round settings</p> </div> </div> </div> <!-- Conclusion Section --> <div class="section"> <h2 class="section-title"> <i class="material-icons">flag</i> Conclusion </h2> <div class="section-content"> <ul class="bullet-list"> <li>Robust agentic long-context reasoning remains an unsolved challenge</li> <li>HaystackCraft established as a valuable testbed for future progress</li> </ul> <a href="https://github.com/Graph-COM/HaystackCraft" class="code-link" target="_blank"> <i class="material-icons">code</i> Code available at GitHub </a> </div> </div> <div class="footer"> © 2025 Haystack Engineering Research Team </div> </div> </div> </body> </html>

讨论回复

1 条回复
✨步子哥 (steper) #1
12-11 08:38
<!DOCTYPE html> <html lang="zh"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>海草堆工程:用于异构和代理长上下文评估的上下文工程</title> <link href="https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@300;400;500;700&family=Noto+Serif+SC:wght@400;700&display=swap" rel="stylesheet"> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <style> * { margin: 0; padding: 0; box-sizing: border-box; } body { font-family: 'Noto Sans SC', sans-serif; background-color: #f0f4f8; color: #333; line-height: 1.6; } .poster-container { width: 720px; min-height: 960px; margin: 0 auto; background: linear-gradient(135deg, #e6f0ff 0%, #f5f9ff 100%); padding: 40px 30px; position: relative; overflow: hidden; } .poster-container::before { content: ""; position: absolute; top: 0; left: 0; width: 100%; height: 100%; background-image: radial-gradient(circle at 10% 20%, rgba(100, 149, 237, 0.1) 0%, transparent 20%), radial-gradient(circle at 90% 80%, rgba(65, 105, 225, 0.1) 0%, transparent 20%), linear-gradient(45deg, rgba(100, 149, 237, 0.05) 0%, transparent 70%); z-index: 0; } .grid-texture { position: absolute; top: 0; left: 0; width: 100%; height: 100%; background-image: linear-gradient(rgba(255, 255, 255, 0.1) 1px, transparent 1px), linear-gradient(90deg, rgba(255, 255, 255, 0.1) 1px, transparent 1px); background-size: 20px 20px; z-index: 0; } .content { position: relative; z-index: 1; } .header { text-align: center; margin-bottom: 30px; padding-bottom: 20px; border-bottom: 2px solid #4169e1; } .title { font-family: 'Noto Serif SC', serif; font-size: 36px; font-weight: 700; color: #1a3a8f; margin-bottom: 15px; line-height: 1.2; } .authors { font-size: 16px; color: #4169e1; margin-bottom: 10px; } .affiliations { font-size: 14px; color: #555; margin-bottom: 10px; } .publication { font-size: 14px; color: #666; font-style: italic; } .section { background-color: rgba(255, 255, 255, 0.85); border-radius: 12px; padding: 20px; margin-bottom: 25px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.05); backdrop-filter: blur(5px); } .section-title { font-family: 'Noto Serif SC', serif; font-size: 24px; font-weight: 700; color: #1a3a8f; margin-bottom: 15px; display: flex; align-items: center; } .section-title .material-icons { margin-right: 10px; color: #4169e1; } .section-content { font-size: 16px; } .highlight { background-color: rgba(65, 105, 225, 0.1); padding: 2px 5px; border-radius: 4px; font-weight: 500; } .bullet-list { padding-left: 25px; margin-bottom: 15px; } .bullet-list li { margin-bottom: 8px; } .two-column { display: flex; gap: 20px; margin-bottom: 15px; } .column { flex: 1; } .image-container { text-align: center; margin: 15px 0; } .image-container img { max-width: 100%; border-radius: 8px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); } .image-caption { font-size: 14px; color: #666; margin-top: 8px; text-align: center; } .finding-card { background-color: rgba(65, 105, 225, 0.05); border-left: 4px solid #4169e1; padding: 12px 15px; margin-bottom: 12px; border-radius: 0 8px 8px 0; } .code-link { display: inline-flex; align-items: center; background-color: #4169e1; color: white; padding: 8px 15px; border-radius: 20px; text-decoration: none; font-weight: 500; margin-top: 10px; } .code-link .material-icons { margin-right: 5px; font-size: 18px; } .footer { text-align: center; margin-top: 30px; color: #666; font-size: 14px; } </style> </head> <body> <div class="poster-container"> <div class="grid-texture"></div> <div class="content"> <!-- 标题区 --> <div class="header"> <h1 class="title">海草堆工程:用于异构和代理长上下文评估的上下文工程</h1> <p class="authors">李木飞、付东奇、王丽梅、张思、曾汉青、桑卡克·卡安、邱瑞中、王浩宇、何欣欣、布列松·泽维尔、夏英龙、孙崇林、李攀</p> <p class="affiliations">佐治亚理工学院、Meta AI、伊利诺伊大学厄巴纳-香槟分校、新加坡国立大学</p> <p class="publication">arXiv:2510.07414(2025年10月)</p> </div> <!-- 介绍部分 --> <div class="section"> <h2 class="section-title"> <i class="material-icons">lightbulb</i> 介绍 </h2> <div class="section-content"> <div class="two-column"> <div class="column"> <ul class="bullet-list"> <li>现代长上下文大语言模型(LLM)在合成的<span class="highlight">"大海捞针"(NIAH)</span>基准测试中表现良好</li> <li>这些测试忽略了有偏见的检索和代理工作流如何产生嘈杂的上下文</li> <li>需要更真实的评估,捕捉现实世界因素</li> </ul> </div> <div class="column"> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/75/752c3cec.jpg" alt="大海捞针概念图" width="300"> <p class="image-caption">传统的大海捞针评估</p> </div> </div> </div> </div> </div> <!-- 海草堆工程部分 --> <div class="section"> <h2 class="section-title"> <i class="material-icons">architecture</i> 海草堆工程 </h2> <div class="section-content"> <ul class="bullet-list"> <li>构建真实嘈杂长上下文的新范式</li> <li>捕捉关键现实世界因素: <ul class="bullet-list"> <li>来自异构有偏见检索器的干扰</li> <li>代理工作流中的级联错误</li> </ul> </li> <li>与"上下文工程"(优化输入以获得最佳性能)形成对比</li> </ul> </div> </div> <!-- HaystackCraft基准测试部分 --> <div class="section"> <h2 class="section-title"> <i class="material-icons">assessment</i> HaystackCraft基准测试 </h2> <div class="section-content"> <ul class="bullet-list"> <li>建立在完整的英文维基百科超链接网络上</li> <li>包含多跳问题</li> <li>以两种方式扩展传统NIAH评估: <ul class="bullet-list"> <li>异构检索依赖的海草堆</li> <li>动态的、LLM依赖的代理上下文工程</li> </ul> </li> </ul> </div> </div> <!-- 异构检索策略部分 --> <div class="section"> <h2 class="section-title"> <i class="material-icons">compare_arrows</i> 异构检索策略 </h2> <div class="section-content"> <div class="two-column"> <div class="column"> <p>评估不同检索策略如何影响:</p> <ul class="bullet-list"> <li>干扰项组成</li> <li>海草堆排序</li> <li>LLM性能</li> </ul> <p>比较的策略:</p> <ul class="bullet-list"> <li>稀疏检索(BM25)</li> <li>密集检索(Qwen3-Embedding-0.6B)</li> <li>混合检索(BM25 + Qwen3-Embedding-0.6B)</li> <li>基于图的重新排序(个性化PageRank - PPR)</li> </ul> </div> <div class="column"> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/9f/9f0f5ca8.jpg" alt="不同检索策略比较" width="300"> <p class="image-caption">不同检索方法的比较</p> </div> </div> </div> </div> </div> <!-- 代理上下文工程部分 --> <div class="section"> <h2 class="section-title"> <i class="material-icons">psychology</i> 代理上下文工程 </h2> <div class="section-content"> <div class="two-column"> <div class="column"> <p>将NIAH扩展到动态的、LLM依赖的设置</p> <p>模拟代理操作,其中模型:</p> <ul class="bullet-list"> <li>优化查询</li> <li>反思过去的推理</li> <li>决定何时停止</li> </ul> <p>两种动态设置:</p> <ul class="bullet-list"> <li>强制多轮</li> <li>可变轮</li> </ul> </div> <div class="column"> <div class="image-container"> <img src="https://sfile.chatglm.cn/moeSlide/image/47/47288779.jpg" alt="代理工作流可视化" width="300"> <p class="image-caption">具有级联错误的代理工作流</p> </div> </div> </div> </div> </div> <!-- 关键发现部分 --> <div class="section"> <h2 class="section-title"> <i class="material-icons">insights</i> 关键发现 </h2> <div class="section-content"> <div class="finding-card"> <p>密集检索器比稀疏检索器引入更具挑战性的干扰项</p> </div> <div class="finding-card"> <p>使用PPR的基于图重新排序显著提高检索有效性</p> </div> <div class="finding-card"> <p>文档排序效果高度依赖于模型</p> </div> <div class="finding-card"> <p>即使是先进模型(Gemini 2.5 Pro、GPT-5)也会遭受级联自我干扰</p> </div> <div class="finding-card"> <p>模型对嘈杂的长上下文("宽度")比对嘈杂的推理迭代("深度")更加鲁棒</p> </div> <div class="finding-card"> <p>大多数模型在可变轮设置中难以执行适当的早期停止</p> </div> </div> </div> <!-- 结论部分 --> <div class="section"> <h2 class="section-title"> <i class="material-icons">flag</i> 结论 </h2> <div class="section-content"> <ul class="bullet-list"> <li>强大的代理长上下文推理仍然是一个未解决的挑战</li> <li>HaystackCraft作为未来进展的有价值测试平台</li> </ul> <a href="https://github.com/Graph-COM/HaystackCraft" class="code-link" target="_blank"> <i class="material-icons">code</i> 代码获取:GitHub </a> </div> </div> <div class="footer"> © 2025 海草堆工程研究团队 </div> </div> </div> </body> </html>