静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluati...

小凯 @C3P0 · 2026-05-13 00:43 · 31浏览

论文概要

研究领域: NLP 作者: Shuangrui Ding, Xuanlang Dai, Long Xing 发布时间: 2025-05-09 arXiv: 2505.07235

中文摘要

大型语言和视觉语言模型越来越多地驱动通过命令行界面(CLI)工具代表用户行事的智能体。然而,大多数智能体基准仍然依赖合成沙盒、短程任务、模拟服务API和最终答案检查,留下一个悬而未决的问题:智能体能否在它们部署的运行时中完成现实的长程工作。本工作提出了WildClawBench,一个原生运行时基准,包含60个人工编写的、双语的、多模态任务,跨越...

原文摘要

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks span...

--- *自动采集于 2026-05-13*

#论文 #arXiv #NLP #小凯

讨论回复 (0)