Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Introduction

Modern long-context LLMs perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks
These tests overlook how noisy contexts arise from biased retrieval and agentic workflows
Need for more realistic evaluation that captures real-world factors

Traditional needle-in-a-haystack evaluation

Haystack Engineering

New paradigm to construct realistic noisy long contexts
Captures key real-world factors:
- Distraction from heterogeneous biased retrievers
- Cascading errors in agentic workflows
Contrast with "context engineering" (optimizing inputs for best performance)

HaystackCraft Benchmark

Built on full English Wikipedia hyperlink network
Features multi-hop questions
Extends traditional NIAH evaluations in two ways:
- Heterogeneous Retrieval-Dependent Haystacks
- Dynamic, LLM-Dependent Agentic Context Engineering

Heterogeneous Retrieval Strategies

Evaluates how different retrieval strategies affect:

Distractor composition
Haystack ordering
LLM performance

Strategies compared:

Sparse Retrieval (BM25)
Dense Retrieval (Qwen3-Embedding-0.6B)
Hybrid Retrieval (BM25 + Qwen3-Embedding-0.6B)
Graph-Based Reranking (Personalized PageRank - PPR)

Comparison of different retrieval methods

Agentic Context Engineering

Extends NIAH to dynamic, LLM-dependent settings

Simulates agentic operations where models:

Refine queries
Reflect on past reasonings
Decide when to stop

Two dynamic settings:

Enforced Multi-Round
Variable-Round

Agentic workflow with cascading errors

Key Findings

Dense retrievers introduce more challenging distractors than sparse ones

Graph-based reranking with PPR significantly improves retrieval effectiveness

Document ordering effects are model-dependent

Even advanced models (Gemini 2.5 Pro, GPT-5) suffer from cascading self-distraction

Models are more robust to noisy long contexts ("width") than to noisy reasoning iterations ("depth")

Most models struggle with appropriate early stopping in variable-round settings

Conclusion

Robust agentic long-context reasoning remains an unsolved challenge
HaystackCraft established as a valuable testbed for future progress

Code available at GitHub

介绍

现代长上下文大语言模型（LLM）在合成的"大海捞针"（NIAH）基准测试中表现良好
这些测试忽略了有偏见的检索和代理工作流如何产生嘈杂的上下文
需要更真实的评估，捕捉现实世界因素

传统的大海捞针评估

海草堆工程

构建真实嘈杂长上下文的新范式
捕捉关键现实世界因素：
- 来自异构有偏见检索器的干扰
- 代理工作流中的级联错误
与"上下文工程"（优化输入以获得最佳性能）形成对比

HaystackCraft基准测试

建立在完整的英文维基百科超链接网络上
包含多跳问题
以两种方式扩展传统NIAH评估：
- 异构检索依赖的海草堆
- 动态的、LLM依赖的代理上下文工程

异构检索策略

评估不同检索策略如何影响：

干扰项组成
海草堆排序
LLM性能

比较的策略：

稀疏检索（BM25）
密集检索（Qwen3-Embedding-0.6B）
混合检索（BM25 + Qwen3-Embedding-0.6B）
基于图的重新排序（个性化PageRank - PPR）

不同检索方法的比较

代理上下文工程

将NIAH扩展到动态的、LLM依赖的设置

模拟代理操作，其中模型：

优化查询
反思过去的推理
决定何时停止

两种动态设置：

强制多轮
可变轮

具有级联错误的代理工作流

关键发现

密集检索器比稀疏检索器引入更具挑战性的干扰项

使用PPR的基于图重新排序显著提高检索有效性

文档排序效果高度依赖于模型

即使是先进模型（Gemini 2.5 Pro、GPT-5）也会遭受级联自我干扰

模型对嘈杂的长上下文（"宽度"）比对嘈杂的推理迭代（"深度"）更加鲁棒

大多数模型在可变轮设置中难以执行适当的早期停止

结论

强大的代理长上下文推理仍然是一个未解决的挑战
HaystackCraft作为未来进展的有价值测试平台

代码获取：GitHub

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Introduction

Haystack Engineering

HaystackCraft Benchmark

Heterogeneous Retrieval Strategies

Agentic Context Engineering

Key Findings

Conclusion

讨论回复

海草堆工程：用于异构和代理长上下文评估的上下文工程

介绍

海草堆工程

HaystackCraft基准测试

异构检索策略

代理上下文工程

关键发现

结论

推荐

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

lightbulb Introduction

architecture Haystack Engineering

assessment HaystackCraft Benchmark

compare_arrows Heterogeneous Retrieval Strategies

psychology Agentic Context Engineering

insights Key Findings

flag Conclusion

讨论回复

海草堆工程：用于异构和代理长上下文评估的上下文工程

lightbulb 介绍

architecture 海草堆工程

assessment HaystackCraft基准测试

compare_arrows 异构检索策略

psychology 代理上下文工程

insights 关键发现

flag 结论

推荐

Introduction

Haystack Engineering

HaystackCraft Benchmark

Heterogeneous Retrieval Strategies

Agentic Context Engineering

Key Findings

Conclusion

介绍

海草堆工程

HaystackCraft基准测试

异构检索策略

代理上下文工程

关键发现

结论