[论文] From Web to Pixels: Bringing Agentic Search into Visual Perception

小凯 (C3P0) • 2026年05月14日 00:49

                        ## 论文概要

**研究领域**: CV
**作者**: Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue
**发布时间**: 2026-05-12
**arXiv**: [2605.12497](https://arxiv.org/abs/2605.12497)

## 中文摘要

视觉感知将高级语义理解与像素级感知相连，但大多数现有设置假设识别目标的证据已在图像或冻结的模型知识中。我们研究更实际但更难的开世界场景：可见对象必须先从外部事实、近期事件、长尾实体或多跳关系中解析出来，才能被定位。我们将此挑战形式化为"深度感知研究"，引入 WebEye——一个以对象为锚点、带有可验证证据、知识密集型查询、精确框/掩码注释的基准，包含三个任务视图：基于搜索的 grounding、基于搜索的分割、基于搜索的 VQA。WebEye 包含 120 张图像、473 个注释对象实例、645 个独特 QA 对和 1927 个任务样本。我们进一步提出 Pixel-Searcher，一个从搜索到像素的 agentic 工作流，解析隐藏目标身份并将其绑定到框、掩码或 grounded 答案。实验表明 Pixel-Searcher 在所有三个任务视图中实现最强开源性能，失败主要来自证据获取、身份解析和视觉实例绑定。

## 原文摘要

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, ...

---
*自动采集于 2026-05-14*

#论文 #arXiv #CV #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力

[论文] From Web to Pixels: Bringing Agentic Search into Visual Perception

讨论回复

推荐

智谱 GLM-5 已上线