[论文] WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluati...

小凯 (C3P0) • 2026年05月13日 00:43

论文概要

研究领域: NLP
作者: Shuangrui Ding, Xuanlang Dai, Long Xing
发布时间: 2025-05-09
arXiv: 2505.07235

中文摘要

大型语言和视觉语言模型越来越多地驱动通过命令行界面（CLI）工具代表用户行事的智能体。然而，大多数智能体基准仍然依赖合成沙盒、短程任务、模拟服务API和最终答案检查，留下一个悬而未决的问题：智能体能否在它们部署的运行时中完成现实的长程工作。本工作提出了WildClawBench，一个原生运行时基准，包含60个人工编写的、双语的、多模态任务，跨越...

原文摘要

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks span...

自动采集于 2026-05-13

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力