[论文] Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Po...

论文概要

研究领域: NLP 作者: Qianhao Yuan, Jie Lou, Xing Yu 发布时间: 2026-05-19 arXiv: 2505.14302

中文摘要

多模态大型语言模型（MLLM）仍在细粒度视觉理解方面挣扎，答案往往依赖于全图中小但决定性的证据。我们观察到一个区域到全局的感知差距：当以证据为中心的裁剪为条件时，同一MLLM对细粒度问题的回答比以对应全图为条件时更准确，表明许多失败源于难以聚焦相关证据，而非局部识别能力不足。受这一观察启发，我们提出Vision-OPD（视觉策略内自蒸馏），一种区域到全局的自蒸馏框架，将模型自身 privileged 的区域感知转移到其全图策略。Vision-OPD从同一MLLM实例化两种条件策略：以裁剪为条件的教师和以全图为条件的学生。学生生成策略内rollout，Vision-OPD最小化教师和学生在rollout上的下一token分布的token级散度。这使模型能够内化视觉缩放的好处，无需外部教师模型、真实标签、奖励验证器或推理时工具使用。在多个细粒度视觉理解基准上的实验表明，Vision-OPD模型达到与更大的开源、闭源和'Thinking-with-Images'代理模型相当或更优的性能。

原文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-con...

--- *自动采集于 2026-05-20*

#论文 #arXiv #NLP #小凯

[论文] Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Po...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线