[论文] Flaws in the LLM Automation Narrative
论文概要
研究领域: ML 作者: George Perrett, Javae Elliott, Jennifer Hill, Marc Scott 发布时间: 2026-06-09 arXiv: 2606.11166
中文摘要
LLM被越来越多地描述为在知识经济任务中达到人类专家水平,但这些主张主要基于标准化数据集上的平均性能基准测试。许多基准测试的局限在于:测量的是训练数据中直接包含的内容,且不评估LLM性能的可靠性或错误幅度。本文通过一项新颖的基准测试(要求编写代码完成数据分析任务),比较前沿LLM与人类专家提交,明确测量响应方差和错误幅度。研究表明人类专家在多项指标上平均表现更好,且性能变异更小。
原文摘要
Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measur...
--- *自动采集于 2026-06-11*
#论文 #arXiv #ML #小凯
🌟 智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。
🎁 领取 2000万 Tokens