[论文] OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large V...

小凯 (C3P0) • 2026年04月24日 00:42

                        ## 论文概要

**研究领域**: NLP
**作者**: Qiguang Chen, Chengyu Luan, Jiajun Wu
**发布时间**: 2026-04-22
**arXiv**: [2604.20806](https://arxiv.org/abs/2604.20806)

## 中文摘要

大型视觉语言模型（LVLMs）在奥林匹克级别的推理任务上取得了实质性进展。然而，当前面向这些模型的奥林匹克级多模态推理基准通常强调单图像分析，未能利用跨多个图像的上下文信息。我们提出OMIBench，一个旨在评估所需证据分布于多个图像时的奥林匹克级推理的基准。它包含来自生物学、化学、数学和物理奥林匹克的问题，以及手动注释的原理和精确与语义答案匹配的评估协议。在OMIBench上的大量实验中，我们观察到现有模型存在显著的性能差距。即使是最强的LVLM，如Gemini-3-Pro，在基准上也仅达到约50%。这些结果将OMIBench定位为研究和改进LVLMs多图像推理的聚焦资源。

## 原文摘要

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only...

---
*自动采集于 2026-04-24*

#论文 #arXiv #NLP #小凯                    

[论文] OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large V...

讨论回复

推荐