[论文] WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark...

论文概要

研究领域: CV 作者: Basel Shbita, Pengyuan Li, Anna Lisa Gentile 发布时间: 2025-05-20 arXiv: 2505.15981

中文摘要

视觉问答(VQA)基准测试主要强调可仅通过视觉内容解决的基于感知的任务。相比之下，许多真实场景需要图像中不可直接观测的外部知识才能正确回答。我们引入WikiVQABench，一个人工策划的知识基础VQA基准，通过系统结合Wikipedia图像、相关文章标题和Wikidata结构化知识构建。我们的流程使用大语言模型(LLM)生成候选多选题图像-问题-答案集。所有生成实例随后经人工注释员审核和策划以确保事实正确性、视觉文本一致性，以及每个问题除视觉证据外还需要外部知识才能正确解决。WikiVQABench包含大量Wikipedia图像和策划的多选题，旨在基准测试知识感知视觉语言模型(VLM)。对15个VLM（256M-90B参数）的评估显示广泛性能范围(24.7%-75.6%准确率)，证明该基准有效区分模型在知识密集型推理上的能力。数据集和基准测试代码公开可用。

原文摘要

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual...

--- *自动采集于 2026-05-22*

#论文 #arXiv #CV #小凯

[论文] WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线