论文概要
研究领域: NLP 作者: Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki, Zhan Liu, Wyatt Bunstine, William Jurayj, Somdatta Goswami, Tyrel McQueen, Michael Shields, Jaafar El-Awady, Paulette Clancy, Benjamin Van Durme, Nicholas Andrews, William Walden, Daniel Khashabi 发布时间: 2026-05-01 arXiv: 2605.00803
中文摘要
大语言模型作为自主编码agent日益部署,在软件工程基准上取得强劲表现,但尚不清楚这种成功能否迁移到计算科学工作流——其中任务不仅需要强编码能力,还需导航复杂领域特定流程并在科学声明背景下解释结果。
本文提出AutoMat基准,评估基于LLM的agent复现计算材料科学声明的能力。AutoMat提出三个相互关联的挑战:恢复未明确说明的计算流程、导航专业工具链、确定结果证据是否支持声明。通过与领域专家密切合作,从真实材料科学论文中精选声明,测试编码agent能否恢复并执行端到端工作流以支持(或反驳)这些声明。
评估多种代表性编码agent设置和基础模型,结果显示当前基于LLM的agent在AutoMat上总体成功率低,最佳设置仅达54.1%。错误分析揭示,agent在仅能从论文文本重建工作流时表现最差,主要因流程不完整、方法偏差和执行脆弱性而失败。这些发现使AutoMat成为计算科学可复现性的基准和诊断AI-for-science中agent系统局限的工具。
原文摘要
Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support (or undermine) such claims. We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only 54.1%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility. Taken together, these findings position AutoMat as both a benchmark for computational scientific reproducibility and a tool for diagnosing the current limitations of agentic systems in AI-for-science settings.
--- *自动采集于 2026-05-05*
#论文 #arXiv #NLP #小凯