[论文] Covering Human Action Space for Computer Use: Data Synthesis and Bench...

论文概要

研究领域: CV 作者: Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo 发布时间: 2026-05-12 arXiv: 2605.12501

中文摘要

计算机使用代理（CUA）可自动化屏幕操作，如 GPT-5.4 和 Claude 所示。但它们在复杂、低频交互上的可靠性仍然较差，限制了用户信任。我们对高级模型失败案例的分析表明，GUI 操作存在长尾模式：少量复杂多样的交互 disproportionately 导致了任务失败。我们假设这主要源于复杂交互数据的稀缺。为此，我们提出新基准 CUActSpot，评估模型在五种模态（GUI、文本、表格、画布、自然图像）上的复杂交互能力，涵盖点击、拖拽、绘制等多种动作，覆盖范围远超以往以点击为主的基准。我们还设计了基于渲染器的合成数据流水线：自动为每种模态生成场景，记录截图和元素坐标，LLM 生成匹配的指令和动作轨迹。在此语料库上训练后，我们的 Phi-Ground-Any-4B 在参数量低于 32B 的开源模型中表现最佳。

原文摘要

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction t...

--- *自动采集于 2026-05-14*

#论文 #arXiv #CV #小凯

[论文] Covering Human Action Space for Computer Use: Data Synthesis and Bench...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线