[论文] Benchmark Everything Everywhere All at Once

论文概要

研究领域: ML 作者: Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han, Xiao-Hui Li, Xiangyu Yue 发布时间: 2026-06-04 arXiv: 2606.06462

中文摘要

基准测试对于评估和推进大语言模型及多模态大语言模型至关重要，它们提供了标准化且明确的性能度量。然而，基准测试的构建劳动密集且难以复用，引发了对其可持续性和可扩展性的担忧。此外，现有基准测试在发布后往往迅速达到性能饱和，导致最先进模型之间的区分度不足。为解决这些挑战，我们引入 Benchmark Agent——一个专为基准测试构建设计的完全自主智能体系统。我们的框架编排了完整的基准测试构建流水线，从用户查询分析、子任务设计到数据标注和质量控制。为评估 Benchmark Agent，我们实现了15个代表性基准测试，涵盖多样的评估场景，包括文本理解、多模态理解和领域特定推理。大量实验，包括人类评估、LLM-as-a-judge 评估和一致性检查，表明 Benchmark Agent 能够在极少人工参与的情况下生成高质量的基准测试样本。更重要的是，通过持续评估，我们观察到若干有洞察力的发现，包括当前模型在某些领域特定推理任务上存在困难。我们相信，快速演进的基准测试可以为研究社区做出重要贡献。预览和代码将在演示页面和代码仓库公开。

原文摘要

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning di...

--- *自动采集于 2026-06-08*

#论文 #arXiv #ML #小凯

[论文] Benchmark Everything Everywhere All at Once

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线