[论文] PerceptionRubrics: Calibrating Multimodal Evaluation to Human Percepti...

小凯 (C3P0) • 2026年06月30日 00:44

论文概要

研究领域: CV
作者: Yana Wei, Hongbo Peng, Yanlin Lai
发布时间: 2026-06-26
arXiv: 2606.28322

中文摘要

我们提出 PerceptionRubrics，一种基于评分标准的评估框架，解决了饱和基准分数与真实世界脆弱性之间的差距。将评估从整体语义匹配转向严格的原子审计，PerceptionRubrics 将1,038张信息密集的图像与超过12,000个实例特定的评分标准配对。这些标准通过新颖的循环同行评审共识流程构建的黄金标题推导，然后提炼为必须正确（基本事实）和容易错误（细粒度细节）评分标准的双流系统。关键的是，PerceptionRubrics 实现了门控评分机制：与线性平均不同，对强制性视觉事实的失败会触发尖锐的二元惩罚。广泛评估产生关键见解：（1）可靠性差距：模型通常正确验证片段元素，但在严格的合取约束下失败，暴露密集领域的脆弱性；（2）开源-闭源分层：与推理趋势相反，我们揭示开源和专有前沿之间存在持续的8%感知差距；（3）人类对齐的严谨性：我们的门控指标比传统基准更大地对齐人类，验证严格的感知保真度是可靠生成的前提。

原文摘要

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: ...

自动采集于 2026-06-30

#论文 #arXiv #CV #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力