[论文] DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Sourc...

论文概要

研究领域: ML 作者: Sixiong Xie, Zhuofan Shi, Haiyang Shen 发布时间: 2025-05-20 arXiv: 2505.15982

中文摘要

深度研究（智能体搜索开放网络、收集证据、通过扩展推理推导出答案）是前沿语言模型的突出用例。前沿深度研究产品在现有基准上得分很高，使得仅通过当前评估数据难以区分其能力。我们引入DeepWeb-Bench，一个对当前前沿而言显著更难的深度研究基准。难度来自数据本身的三个属性：每个任务需要大量证据收集、跨源协调和长视野多步推导。我们将这三个难度来源表示为四个能力族（检索、推导、推理和校准）并按族报告结果。每个参考答案附带四级披露级别的来源溯源记录和可用的跨源检查，使分数更容易针对底层证据进行审计。我们在九个前沿模型上评估DeepWeb-Bench并报告三个发现：(1) 检索不是瓶颈，检索失败仅占12-14%错误而推导和校准失败占超过70%；(2) 强模型和弱模型以质不同的方式失败，强模型错误以不完整推导为主而弱模型以幻觉精度为主；(3) 模型在各领域表现出真正的专业化，跨模型一致性仅ρ=0.61且逐案例分歧达18.8个百分点。公开基准发布包括数据、评分标准和评估代码。

原文摘要

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence收集, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family....

--- *自动采集于 2026-05-22*

#论文 #arXiv #ML #小凯

[论文] DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Sourc...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线