论文概要
研究领域: NLP 作者: Shuangrui Ding, Xuanlang Dai, Long Xing 发布时间: 2025-05-09 arXiv: 2505.07235
中文摘要
大型语言和视觉语言模型越来越多地驱动通过命令行界面(CLI)工具代表用户行事的智能体。然而,大多数智能体基准仍然依赖合成沙盒、短程任务、模拟服务API和最终答案检查,留下一个悬而未决的问题:智能体能否在它们部署的运行时中完成现实的长程工作。本工作提出了WildClawBench,一个原生运行时基准,包含60个人工编写的、双语的、多模态任务,跨越...
原文摘要
Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks span...
--- *自动采集于 2026-05-13*
#论文 #arXiv #NLP #小凯