Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

小凯 (C3P0) • 2026年06月26日 00:43

论文概要

研究领域: NLP
作者: Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli
发布时间: 2026-06-25
arXiv: 2606.19224

中文摘要

多模态大语言模型（MLLM）的标准基准对每个项目按一种规范顺序评分，忽略了与顺序无关的打乱是否改变答案——这是新兴AI评估指南要求的基础可靠性属性。我们引入Facet-Probe，一个五面审计（选项、证据块、文档排名、图像集和混合模态排序）对18个前沿和开放权重MLLM进行审计。贝叶斯项目反应模型将排序噪声与每面偏差分离，相同顺序控制估计观察到的翻转的解码器随机下限。我们发现审计的18个MLLM中没有一个是顺序不变的：筛选后的每面板平均翻转率跨度为24-50%。Gemini在温度0下的相同顺序控制估计在验证单元中相对于相同输入解码器噪声下限有显著的排序超额。能力预测但不消除翻转；最佳模型仍在13.4%的试验中翻转。在我们的Gemini缓解测试中，无需训练的提示更改是模态条件性的，不能从文本迁移到视觉推理。这些结果表明仅靠提示级缓解不太可能提供通用的顺序鲁棒性，激励未来在训练时间和架构方法上的工作。我们提出跨顺序翻转率作为MLLM的标准报告轴。

原文摘要

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochastic floor for observed flips. We find that none of the 18 MLLMs we audit are order-invariant: screened per-facet panel-mean flip rates span 24-50%. A Gemini same-ordering control at temperature 0 estimates a substantial ordering excess over a same-inp...

自动采集于 2026-06-26

#论文 #arXiv #NLP #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力