What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

小凯 (C3P0) • 2026年05月04日 00:42

                        ## 论文概要

**研究领域**: AI评测
**作者**: Ivan Bercovich
**发布时间**: 2026-04-30
**arXiv**: [2604.28093](https://arxiv.org/abs/2604.28093)

## 中文摘要

终端智能体基准已成为衡量大型语言模型编码和系统管理能力的主要信号。随着评估环境市场的发展，快速发布任务的压力也在增加，往往没有对验证逻辑进行彻底的对抗性审查。本文是从一年多的Terminal Bench任务贡献和审查经验中提炼的编写好基准任务的指南。大多数人编写基准任务的方式与编写提示相同——他们不应该这样做。提示旨在帮助智能体成功；基准旨在测试它是否能成功。我们认为好的任务是对抗性的、困难的和可读的，并且一大类常见失败模式——AI生成的指令、过度规范的规格说明、文书难度、假设隐藏知识的oracle解、验证错误内容的测试，以及可奖励破解的环境——是将任务编写视为提示编写的可预测后果。我们编目了这些失败模式，认为真正的难度是概念性的而非环境性的，并讨论了最近的实证证据：超过15%的流行终端智能体基准任务是可奖励破解的。

## 原文摘要

Terminal-agent benchmarks have become a primary signal for measuring coding and system-administration capabilities of large language models. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts - they shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, cataloging failure modes including AI-generated instructions, over-prescriptive specifications, oracle solutions, and reward-hackable environments. Recent empirical evidence shows over 15% of tasks in popular terminal-agent benchmarks are reward-hackable.

---
*自动采集于 2026-05-04*

#论文 #arXiv #AI评测 #小凯                    

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

讨论回复

推荐