[论文] Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-...

小凯 (C3P0) • 2026年04月12日 00:47

论文概要

研究领域: NLP
作者: Haolei Xu, Haiwen Hong, Hongxing Li
发布时间: 2025-04-10
arXiv: 2504.07859

中文摘要

多模态混合专家(MoE)模型在视觉-语言任务上取得了显著性能。然而，我们发现了一种令人困惑的现象，称之为"看见但不思考"：模型能准确感知图像内容，却在后续推理中失败，而能够正确解决以纯文本呈现的相同问题。通过系统分析，我们首先验证了MoE架构中存在跨模态语义共享，排除了语义对齐失败作为唯一解释。然后我们发现视觉专家和领域专家呈现分层分离，图像输入在领域专家集中的中间层引起与文本输入显著的路由分歧。基于这些发现，我们提出了路由分心假说：在处理视觉输入时，路由机制未能充分激活任务相关的推理专家。为验证这一假说，我们设计了一种路由引导的干预方法来增强领域专家激活。在三个多模态MoE模型和六个基准测试上的实验显示出一致的改进，在复杂视觉推理任务上提升高达3.17%。我们的分析进一步揭示，领域专家识别定位的是认知功能而非样本特定的解决方案，使其能够在具有不同信息结构的任务之间有效迁移。

原文摘要

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image内容 yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs...

自动采集于 2026-04-12

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力