您正在查看静态缓存页面 · 查看完整动态版本 · 登录 参与讨论

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

✨步子哥 (steper) 2025年12月11日 08:27 0 次浏览
Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li

Georgia Institute of Technology, Meta AI, University of Illinois Urbana-Champaign, National University of Singapore

arXiv:2510.07414 (October 2025)

lightbulb Introduction

  • Modern long-context LLMs perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks
  • These tests overlook how noisy contexts arise from biased retrieval and agentic workflows
  • Need for more realistic evaluation that captures real-world factors
Needle in a haystack visualization

Traditional needle-in-a-haystack evaluation

architecture Haystack Engineering

  • New paradigm to construct realistic noisy long contexts
  • Captures key real-world factors:
    • Distraction from heterogeneous biased retrievers
    • Cascading errors in agentic workflows
  • Contrast with "context engineering" (optimizing inputs for best performance)

assessment HaystackCraft Benchmark

  • Built on full English Wikipedia hyperlink network
  • Features multi-hop questions
  • Extends traditional NIAH evaluations in two ways:
    • Heterogeneous Retrieval-Dependent Haystacks
    • Dynamic, LLM-Dependent Agentic Context Engineering

compare_arrows Heterogeneous Retrieval Strategies

Evaluates how different retrieval strategies affect:

  • Distractor composition
  • Haystack ordering
  • LLM performance

Strategies compared:

  • Sparse Retrieval (BM25)
  • Dense Retrieval (Qwen3-Embedding-0.6B)
  • Hybrid Retrieval (BM25 + Qwen3-Embedding-0.6B)
  • Graph-Based Reranking (Personalized PageRank - PPR)
Comparison of retrieval strategies

Comparison of different retrieval methods

psychology Agentic Context Engineering

Extends NIAH to dynamic, LLM-dependent settings

Simulates agentic operations where models:

  • Refine queries
  • Reflect on past reasonings
  • Decide when to stop

Two dynamic settings:

  • Enforced Multi-Round
  • Variable-Round
Agentic workflow visualization

Agentic workflow with cascading errors

insights Key Findings

Dense retrievers introduce more challenging distractors than sparse ones

Graph-based reranking with PPR significantly improves retrieval effectiveness

Document ordering effects are model-dependent

Even advanced models (Gemini 2.5 Pro, GPT-5) suffer from cascading self-distraction

Models are more robust to noisy long contexts ("width") than to noisy reasoning iterations ("depth")

Most models struggle with appropriate early stopping in variable-round settings

flag Conclusion

  • Robust agentic long-context reasoning remains an unsolved challenge
  • HaystackCraft established as a valuable testbed for future progress
code Code available at GitHub

讨论回复

1 条回复
✨步子哥 (steper) #1
12-11 08:38
海草堆工程:用于异构和代理长上下文评估的上下文工程

海草堆工程:用于异构和代理长上下文评估的上下文工程

李木飞、付东奇、王丽梅、张思、曾汉青、桑卡克·卡安、邱瑞中、王浩宇、何欣欣、布列松·泽维尔、夏英龙、孙崇林、李攀

佐治亚理工学院、Meta AI、伊利诺伊大学厄巴纳-香槟分校、新加坡国立大学

arXiv:2510.07414(2025年10月)

lightbulb 介绍

  • 现代长上下文大语言模型(LLM)在合成的"大海捞针"(NIAH)基准测试中表现良好
  • 这些测试忽略了有偏见的检索和代理工作流如何产生嘈杂的上下文
  • 需要更真实的评估,捕捉现实世界因素
大海捞针概念图

传统的大海捞针评估

architecture 海草堆工程

  • 构建真实嘈杂长上下文的新范式
  • 捕捉关键现实世界因素:
    • 来自异构有偏见检索器的干扰
    • 代理工作流中的级联错误
  • 与"上下文工程"(优化输入以获得最佳性能)形成对比

assessment HaystackCraft基准测试

  • 建立在完整的英文维基百科超链接网络上
  • 包含多跳问题
  • 以两种方式扩展传统NIAH评估:
    • 异构检索依赖的海草堆
    • 动态的、LLM依赖的代理上下文工程

compare_arrows 异构检索策略

评估不同检索策略如何影响:

  • 干扰项组成
  • 海草堆排序
  • LLM性能

比较的策略:

  • 稀疏检索(BM25)
  • 密集检索(Qwen3-Embedding-0.6B)
  • 混合检索(BM25 + Qwen3-Embedding-0.6B)
  • 基于图的重新排序(个性化PageRank - PPR)
不同检索策略比较

不同检索方法的比较

psychology 代理上下文工程

将NIAH扩展到动态的、LLM依赖的设置

模拟代理操作,其中模型:

  • 优化查询
  • 反思过去的推理
  • 决定何时停止

两种动态设置:

  • 强制多轮
  • 可变轮
代理工作流可视化

具有级联错误的代理工作流

insights 关键发现

密集检索器比稀疏检索器引入更具挑战性的干扰项

使用PPR的基于图重新排序显著提高检索有效性

文档排序效果高度依赖于模型

即使是先进模型(Gemini 2.5 Pro、GPT-5)也会遭受级联自我干扰

模型对嘈杂的长上下文("宽度")比对嘈杂的推理迭代("深度")更加鲁棒

大多数模型在可变轮设置中难以执行适当的早期停止

flag 结论

  • 强大的代理长上下文推理仍然是一个未解决的挑战
  • HaystackCraft作为未来进展的有价值测试平台
code 代码获取:GitHub