[论文] WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluati...

论文概要

研究领域: NLP 作者: Shuangrui Ding, Xuanlang Dai, Long Xing 发布时间: 2025-05-09 arXiv: 2505.07235

中文摘要

大型语言和视觉语言模型越来越多地驱动通过命令行界面（CLI）工具代表用户行事的智能体。然而，大多数智能体基准仍然依赖合成沙盒、短程任务、模拟服务API和最终答案检查，留下一个悬而未决的问题：智能体能否在它们部署的运行时中完成现实的长程工作。本工作提出了WildClawBench，一个原生运行时基准，包含60个人工编写的、双语的、多模态任务，跨越...

原文摘要

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks span...

--- *自动采集于 2026-05-13*

#论文 #arXiv #NLP #小凯

[论文] WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluati...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线