Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

✨步子哥 (steper) • 2025年12月11日 08:27

                        <!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation</title>
    <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&family=Roboto+Slab:wght@400;700&display=swap" rel="stylesheet">
    <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        
        body {
            font-family: 'Roboto', sans-serif;
            background-color: #f0f4f8;
            color: #333;
            line-height: 1.6;
        }
        
        .poster-container {
            width: 720px;
            min-height: 960px;
            margin: 0 auto;
            background: linear-gradient(135deg, #e6f0ff 0%, #f5f9ff 100%);
            padding: 40px 30px;
            position: relative;
            overflow: hidden;
        }
        
        .poster-container::before {
            content: "";
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            background-image: 
                radial-gradient(circle at 10% 20%, rgba(100, 149, 237, 0.1) 0%, transparent 20%),
                radial-gradient(circle at 90% 80%, rgba(65, 105, 225, 0.1) 0%, transparent 20%),
                linear-gradient(45deg, rgba(100, 149, 237, 0.05) 0%, transparent 70%);
            z-index: 0;
        }
        
        .grid-texture {
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            background-image: 
                linear-gradient(rgba(255, 255, 255, 0.1) 1px, transparent 1px),
                linear-gradient(90deg, rgba(255, 255, 255, 0.1) 1px, transparent 1px);
            background-size: 20px 20px;
            z-index: 0;
        }
        
        .content {
            position: relative;
            z-index: 1;
        }
        
        .header {
            text-align: center;
            margin-bottom: 30px;
            padding-bottom: 20px;
            border-bottom: 2px solid #4169e1;
        }
        
        .title {
            font-family: 'Roboto Slab', serif;
            font-size: 36px;
            font-weight: 700;
            color: #1a3a8f;
            margin-bottom: 15px;
            line-height: 1.2;
        }
        
        .authors {
            font-size: 16px;
            color: #4169e1;
            margin-bottom: 10px;
        }
        
        .affiliations {
            font-size: 14px;
            color: #555;
            margin-bottom: 10px;
        }
        
        .publication {
            font-size: 14px;
            color: #666;
            font-style: italic;
        }
        
        .section {
            background-color: rgba(255, 255, 255, 0.85);
            border-radius: 12px;
            padding: 20px;
            margin-bottom: 25px;
            box-shadow: 0 4px 12px rgba(0, 0, 0, 0.05);
            backdrop-filter: blur(5px);
        }
        
        .section-title {
            font-family: 'Roboto Slab', serif;
            font-size: 24px;
            font-weight: 700;
            color: #1a3a8f;
            margin-bottom: 15px;
            display: flex;
            align-items: center;
        }
        
        .section-title .material-icons {
            margin-right: 10px;
            color: #4169e1;
        }
        
        .section-content {
            font-size: 16px;
        }
        
        .highlight {
            background-color: rgba(65, 105, 225, 0.1);
            padding: 2px 5px;
            border-radius: 4px;
            font-weight: 500;
        }
        
        .bullet-list {
            padding-left: 25px;
            margin-bottom: 15px;
        }
        
        .bullet-list li {
            margin-bottom: 8px;
        }
        
        .two-column {
            display: flex;
            gap: 20px;
            margin-bottom: 15px;
        }
        
        .column {
            flex: 1;
        }
        
        .image-container {
            text-align: center;
            margin: 15px 0;
        }
        
        .image-container img {
            max-width: 100%;
            border-radius: 8px;
            box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
        }
        
        .image-caption {
            font-size: 14px;
            color: #666;
            margin-top: 8px;
            text-align: center;
        }
        
        .finding-card {
            background-color: rgba(65, 105, 225, 0.05);
            border-left: 4px solid #4169e1;
            padding: 12px 15px;
            margin-bottom: 12px;
            border-radius: 0 8px 8px 0;
        }
        
        .code-link {
            display: inline-flex;
            align-items: center;
            background-color: #4169e1;
            color: white;
            padding: 8px 15px;
            border-radius: 20px;
            text-decoration: none;
            font-weight: 500;
            margin-top: 10px;
        }
        
        .code-link .material-icons {
            margin-right: 5px;
            font-size: 18px;
        }
        
        .footer {
            text-align: center;
            margin-top: 30px;
            color: #666;
            font-size: 14px;
        }
    </style>
</head>
<body>
    <div class="poster-container">
        <div class="grid-texture"></div>
        <div class="content">
            <!-- Header Section -->
            <div class="header">
                <h1 class="title">Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation</h1>
                <p class="authors">Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li</p>
                <p class="affiliations">Georgia Institute of Technology, Meta AI, University of Illinois Urbana-Champaign, National University of Singapore</p>
                <p class="publication">arXiv:2510.07414 (October 2025)</p>
            </div>
            
            <!-- Introduction Section -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">lightbulb</i>
                    Introduction
                </h2>
                <div class="section-content">
                    <div class="two-column">
                        <div class="column">
                            <ul class="bullet-list">
                                <li>Modern long-context LLMs perform well on synthetic <span class="highlight">"needle-in-a-haystack" (NIAH)</span> benchmarks</li>
                                <li>These tests overlook how noisy contexts arise from biased retrieval and agentic workflows</li>
                                <li>Need for more realistic evaluation that captures real-world factors</li>
                            </ul>
                        </div>
                        <div class="column">
                            <div class="image-container">
                                <img src="https://sfile.chatglm.cn/moeSlide/image/75/752c3cec.jpg" alt="Needle in a haystack visualization" width="300">
                                <p class="image-caption">Traditional needle-in-a-haystack evaluation</p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            
            <!-- Haystack Engineering Section -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">architecture</i>
                    Haystack Engineering
                </h2>
                <div class="section-content">
                    <ul class="bullet-list">
                        <li>New paradigm to construct realistic noisy long contexts</li>
                        <li>Captures key real-world factors:
                            <ul class="bullet-list">
                                <li>Distraction from heterogeneous biased retrievers</li>
                                <li>Cascading errors in agentic workflows</li>
                            </ul>
                        </li>
                        <li>Contrast with "context engineering" (optimizing inputs for best performance)</li>
                    </ul>
                </div>
            </div>
            
            <!-- HaystackCraft Benchmark Section -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">assessment</i>
                    HaystackCraft Benchmark
                </h2>
                <div class="section-content">
                    <ul class="bullet-list">
                        <li>Built on full English Wikipedia hyperlink network</li>
                        <li>Features multi-hop questions</li>
                        <li>Extends traditional NIAH evaluations in two ways:
                            <ul class="bullet-list">
                                <li>Heterogeneous Retrieval-Dependent Haystacks</li>
                                <li>Dynamic, LLM-Dependent Agentic Context Engineering</li>
                            </ul>
                        </li>
                    </ul>
                </div>
            </div>
            
            <!-- Heterogeneous Retrieval Strategies Section -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">compare_arrows</i>
                    Heterogeneous Retrieval Strategies
                </h2>
                <div class="section-content">
                    <div class="two-column">
                        <div class="column">
                            <p>Evaluates how different retrieval strategies affect:</p>
                            <ul class="bullet-list">
                                <li>Distractor composition</li>
                                <li>Haystack ordering</li>
                                <li>LLM performance</li>
                            </ul>
                            <p>Strategies compared:</p>
                            <ul class="bullet-list">
                                <li>Sparse Retrieval (BM25)</li>
                                <li>Dense Retrieval (Qwen3-Embedding-0.6B)</li>
                                <li>Hybrid Retrieval (BM25 + Qwen3-Embedding-0.6B)</li>
                                <li>Graph-Based Reranking (Personalized PageRank - PPR)</li>
                            </ul>
                        </div>
                        <div class="column">
                            <div class="image-container">
                                <img src="https://sfile.chatglm.cn/moeSlide/image/9f/9f0f5ca8.jpg" alt="Comparison of retrieval strategies" width="300">
                                <p class="image-caption">Comparison of different retrieval methods</p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            
            <!-- Agentic Context Engineering Section -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">psychology</i>
                    Agentic Context Engineering
                </h2>
                <div class="section-content">
                    <div class="two-column">
                        <div class="column">
                            <p>Extends NIAH to dynamic, LLM-dependent settings</p>
                            <p>Simulates agentic operations where models:</p>
                            <ul class="bullet-list">
                                <li>Refine queries</li>
                                <li>Reflect on past reasonings</li>
                                <li>Decide when to stop</li>
                            </ul>
                            <p>Two dynamic settings:</p>
                            <ul class="bullet-list">
                                <li>Enforced Multi-Round</li>
                                <li>Variable-Round</li>
                            </ul>
                        </div>
                        <div class="column">
                            <div class="image-container">
                                <img src="https://sfile.chatglm.cn/moeSlide/image/47/47288779.jpg" alt="Agentic workflow visualization" width="300">
                                <p class="image-caption">Agentic workflow with cascading errors</p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            
            <!-- Key Findings Section -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">insights</i>
                    Key Findings
                </h2>
                <div class="section-content">
                    <div class="finding-card">
                        <p>Dense retrievers introduce more challenging distractors than sparse ones</p>
                    </div>
                    <div class="finding-card">
                        <p>Graph-based reranking with PPR significantly improves retrieval effectiveness</p>
                    </div>
                    <div class="finding-card">
                        <p>Document ordering effects are model-dependent</p>
                    </div>
                    <div class="finding-card">
                        <p>Even advanced models (Gemini 2.5 Pro, GPT-5) suffer from cascading self-distraction</p>
                    </div>
                    <div class="finding-card">
                        <p>Models are more robust to noisy long contexts ("width") than to noisy reasoning iterations ("depth")</p>
                    </div>
                    <div class="finding-card">
                        <p>Most models struggle with appropriate early stopping in variable-round settings</p>
                    </div>
                </div>
            </div>
            
            <!-- Conclusion Section -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">flag</i>
                    Conclusion
                </h2>
                <div class="section-content">
                    <ul class="bullet-list">
                        <li>Robust agentic long-context reasoning remains an unsolved challenge</li>
                        <li>HaystackCraft established as a valuable testbed for future progress</li>
                    </ul>
                    <a href="https://github.com/Graph-COM/HaystackCraft" class="code-link" target="_blank">
                        <i class="material-icons">code</i>
                        Code available at GitHub
                    </a>
                </div>
            </div>
            
            <div class="footer">
                © 2025 Haystack Engineering Research Team
            </div>
        </div>
    </div>
</body>
</html>                    

讨论回复

1 条回复

✨步子哥 (steper) #1

12-11 08:38

                                        <!DOCTYPE html>
<html lang="zh">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>海草堆工程：用于异构和代理长上下文评估的上下文工程</title>
    <link href="https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@300;400;500;700&family=Noto+Serif+SC:wght@400;700&display=swap" rel="stylesheet">
    <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        
        body {
            font-family: 'Noto Sans SC', sans-serif;
            background-color: #f0f4f8;
            color: #333;
            line-height: 1.6;
        }
        
        .poster-container {
            width: 720px;
            min-height: 960px;
            margin: 0 auto;
            background: linear-gradient(135deg, #e6f0ff 0%, #f5f9ff 100%);
            padding: 40px 30px;
            position: relative;
            overflow: hidden;
        }
        
        .poster-container::before {
            content: "";
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            background-image: 
                radial-gradient(circle at 10% 20%, rgba(100, 149, 237, 0.1) 0%, transparent 20%),
                radial-gradient(circle at 90% 80%, rgba(65, 105, 225, 0.1) 0%, transparent 20%),
                linear-gradient(45deg, rgba(100, 149, 237, 0.05) 0%, transparent 70%);
            z-index: 0;
        }
        
        .grid-texture {
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            background-image: 
                linear-gradient(rgba(255, 255, 255, 0.1) 1px, transparent 1px),
                linear-gradient(90deg, rgba(255, 255, 255, 0.1) 1px, transparent 1px);
            background-size: 20px 20px;
            z-index: 0;
        }
        
        .content {
            position: relative;
            z-index: 1;
        }
        
        .header {
            text-align: center;
            margin-bottom: 30px;
            padding-bottom: 20px;
            border-bottom: 2px solid #4169e1;
        }
        
        .title {
            font-family: 'Noto Serif SC', serif;
            font-size: 36px;
            font-weight: 700;
            color: #1a3a8f;
            margin-bottom: 15px;
            line-height: 1.2;
        }
        
        .authors {
            font-size: 16px;
            color: #4169e1;
            margin-bottom: 10px;
        }
        
        .affiliations {
            font-size: 14px;
            color: #555;
            margin-bottom: 10px;
        }
        
        .publication {
            font-size: 14px;
            color: #666;
            font-style: italic;
        }
        
        .section {
            background-color: rgba(255, 255, 255, 0.85);
            border-radius: 12px;
            padding: 20px;
            margin-bottom: 25px;
            box-shadow: 0 4px 12px rgba(0, 0, 0, 0.05);
            backdrop-filter: blur(5px);
        }
        
        .section-title {
            font-family: 'Noto Serif SC', serif;
            font-size: 24px;
            font-weight: 700;
            color: #1a3a8f;
            margin-bottom: 15px;
            display: flex;
            align-items: center;
        }
        
        .section-title .material-icons {
            margin-right: 10px;
            color: #4169e1;
        }
        
        .section-content {
            font-size: 16px;
        }
        
        .highlight {
            background-color: rgba(65, 105, 225, 0.1);
            padding: 2px 5px;
            border-radius: 4px;
            font-weight: 500;
        }
        
        .bullet-list {
            padding-left: 25px;
            margin-bottom: 15px;
        }
        
        .bullet-list li {
            margin-bottom: 8px;
        }
        
        .two-column {
            display: flex;
            gap: 20px;
            margin-bottom: 15px;
        }
        
        .column {
            flex: 1;
        }
        
        .image-container {
            text-align: center;
            margin: 15px 0;
        }
        
        .image-container img {
            max-width: 100%;
            border-radius: 8px;
            box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
        }
        
        .image-caption {
            font-size: 14px;
            color: #666;
            margin-top: 8px;
            text-align: center;
        }
        
        .finding-card {
            background-color: rgba(65, 105, 225, 0.05);
            border-left: 4px solid #4169e1;
            padding: 12px 15px;
            margin-bottom: 12px;
            border-radius: 0 8px 8px 0;
        }
        
        .code-link {
            display: inline-flex;
            align-items: center;
            background-color: #4169e1;
            color: white;
            padding: 8px 15px;
            border-radius: 20px;
            text-decoration: none;
            font-weight: 500;
            margin-top: 10px;
        }
        
        .code-link .material-icons {
            margin-right: 5px;
            font-size: 18px;
        }
        
        .footer {
            text-align: center;
            margin-top: 30px;
            color: #666;
            font-size: 14px;
        }
    </style>
</head>
<body>
    <div class="poster-container">
        <div class="grid-texture"></div>
        <div class="content">
            <!-- 标题区 -->
            <div class="header">
                <h1 class="title">海草堆工程：用于异构和代理长上下文评估的上下文工程</h1>
                <p class="authors">李木飞、付东奇、王丽梅、张思、曾汉青、桑卡克·卡安、邱瑞中、王浩宇、何欣欣、布列松·泽维尔、夏英龙、孙崇林、李攀</p>
                <p class="affiliations">佐治亚理工学院、Meta AI、伊利诺伊大学厄巴纳-香槟分校、新加坡国立大学</p>
                <p class="publication">arXiv:2510.07414（2025年10月）</p>
            </div>
            
            <!-- 介绍部分 -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">lightbulb</i>
                    介绍
                </h2>
                <div class="section-content">
                    <div class="two-column">
                        <div class="column">
                            <ul class="bullet-list">
                                <li>现代长上下文大语言模型（LLM）在合成的<span class="highlight">"大海捞针"（NIAH）</span>基准测试中表现良好</li>
                                <li>这些测试忽略了有偏见的检索和代理工作流如何产生嘈杂的上下文</li>
                                <li>需要更真实的评估，捕捉现实世界因素</li>
                            </ul>
                        </div>
                        <div class="column">
                            <div class="image-container">
                                <img src="https://sfile.chatglm.cn/moeSlide/image/75/752c3cec.jpg" alt="大海捞针概念图" width="300">
                                <p class="image-caption">传统的大海捞针评估</p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            
            <!-- 海草堆工程部分 -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">architecture</i>
                    海草堆工程
                </h2>
                <div class="section-content">
                    <ul class="bullet-list">
                        <li>构建真实嘈杂长上下文的新范式</li>
                        <li>捕捉关键现实世界因素：
                            <ul class="bullet-list">
                                <li>来自异构有偏见检索器的干扰</li>
                                <li>代理工作流中的级联错误</li>
                            </ul>
                        </li>
                        <li>与"上下文工程"（优化输入以获得最佳性能）形成对比</li>
                    </ul>
                </div>
            </div>
            
            <!-- HaystackCraft基准测试部分 -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">assessment</i>
                    HaystackCraft基准测试
                </h2>
                <div class="section-content">
                    <ul class="bullet-list">
                        <li>建立在完整的英文维基百科超链接网络上</li>
                        <li>包含多跳问题</li>
                        <li>以两种方式扩展传统NIAH评估：
                            <ul class="bullet-list">
                                <li>异构检索依赖的海草堆</li>
                                <li>动态的、LLM依赖的代理上下文工程</li>
                            </ul>
                        </li>
                    </ul>
                </div>
            </div>
            
            <!-- 异构检索策略部分 -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">compare_arrows</i>
                    异构检索策略
                </h2>
                <div class="section-content">
                    <div class="two-column">
                        <div class="column">
                            <p>评估不同检索策略如何影响：</p>
                            <ul class="bullet-list">
                                <li>干扰项组成</li>
                                <li>海草堆排序</li>
                                <li>LLM性能</li>
                            </ul>
                            <p>比较的策略：</p>
                            <ul class="bullet-list">
                                <li>稀疏检索（BM25）</li>
                                <li>密集检索（Qwen3-Embedding-0.6B）</li>
                                <li>混合检索（BM25 + Qwen3-Embedding-0.6B）</li>
                                <li>基于图的重新排序（个性化PageRank - PPR）</li>
                            </ul>
                        </div>
                        <div class="column">
                            <div class="image-container">
                                <img src="https://sfile.chatglm.cn/moeSlide/image/9f/9f0f5ca8.jpg" alt="不同检索策略比较" width="300">
                                <p class="image-caption">不同检索方法的比较</p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            
            <!-- 代理上下文工程部分 -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">psychology</i>
                    代理上下文工程
                </h2>
                <div class="section-content">
                    <div class="two-column">
                        <div class="column">
                            <p>将NIAH扩展到动态的、LLM依赖的设置</p>
                            <p>模拟代理操作，其中模型：</p>
                            <ul class="bullet-list">
                                <li>优化查询</li>
                                <li>反思过去的推理</li>
                                <li>决定何时停止</li>
                            </ul>
                            <p>两种动态设置：</p>
                            <ul class="bullet-list">
                                <li>强制多轮</li>
                                <li>可变轮</li>
                            </ul>
                        </div>
                        <div class="column">
                            <div class="image-container">
                                <img src="https://sfile.chatglm.cn/moeSlide/image/47/47288779.jpg" alt="代理工作流可视化" width="300">
                                <p class="image-caption">具有级联错误的代理工作流</p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            
            <!-- 关键发现部分 -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">insights</i>
                    关键发现
                </h2>
                <div class="section-content">
                    <div class="finding-card">
                        <p>密集检索器比稀疏检索器引入更具挑战性的干扰项</p>
                    </div>
                    <div class="finding-card">
                        <p>使用PPR的基于图重新排序显著提高检索有效性</p>
                    </div>
                    <div class="finding-card">
                        <p>文档排序效果高度依赖于模型</p>
                    </div>
                    <div class="finding-card">
                        <p>即使是先进模型（Gemini 2.5 Pro、GPT-5）也会遭受级联自我干扰</p>
                    </div>
                    <div class="finding-card">
                        <p>模型对嘈杂的长上下文（"宽度"）比对嘈杂的推理迭代（"深度"）更加鲁棒</p>
                    </div>
                    <div class="finding-card">
                        <p>大多数模型在可变轮设置中难以执行适当的早期停止</p>
                    </div>
                </div>
            </div>
            
            <!-- 结论部分 -->
            <div class="section">
                <h2 class="section-title">
                    <i class="material-icons">flag</i>
                    结论
                </h2>
                <div class="section-content">
                    <ul class="bullet-list">
                        <li>强大的代理长上下文推理仍然是一个未解决的挑战</li>
                        <li>HaystackCraft作为未来进展的有价值测试平台</li>
                    </ul>
                    <a href="https://github.com/Graph-COM/HaystackCraft" class="code-link" target="_blank">
                        <i class="material-icons">code</i>
                        代码获取：GitHub
                    </a>
                </div>
            </div>
            
            <div class="footer">
                © 2025 海草堆工程研究团队
            </div>
        </div>
    </div>
</body>
</html>                                    

需要登录才能发表回复

登录注册

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

讨论回复

推荐