[论文] Vision2Web: A Hierarchical Benchmark for Visual Website Development wi...

论文概要

研究领域: ML 作者: Zehai He, Wenyi Hong, Zhen Yang 发布时间: 2025-03-30 arXiv: 2503.23708

中文摘要

大语言模型的最新进展提高了编码代理的能力，但复杂端到端网站开发的系统评估仍然有限。为解决这一差距，我们引入了Vision2Web，一个用于视觉网站开发的分层基准，涵盖从静态UI到代码生成、交互式多页面前端复现，到长程全栈网站开发。该基准由真实世界网站构建，共包含16个类别的193个任务，918个原型图像和1,255个测试用例。为支持灵活、彻底和可靠的评估，我们提出了基于工作流的代理验证范式，基于两个互补组件：GUI代理验证器和基于VLM的评判器。我们评估了在不同编码代理框架下实例化的多个视觉语言模型，揭示了在所有任务级别上仍存在显著的性能差距，最先进的模型在全栈开发上仍然挣扎。

原文摘要

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible、彻底和可靠的评估，我们提出了基于工作流的代理验证范式，基于两个互补组件：GUI代理验证器和基于VLM的评判器。我们评估了在不同编码代理框架下实例化的多个视觉语言模型，揭示了在所有任务级别上仍存在显著的性能差距，最先进的模型在全栈开发上仍然挣扎。

--- *自动采集于 2026-03-31*

#论文 #arXiv #ML #小凯

[论文] Vision2Web: A Hierarchical Benchmark for Visual Website Development wi...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线