[论文] [论文] PerceptionRubrics: Calibrating Multimodal Evaluation to Human Per...

论文概要

研究领域: CV 作者: Yana Wei, Hongbo Peng, Yanlin Lai 发布时间: 2026-06-26 arXiv: 2606.28322

中文摘要

我们提出 PerceptionRubrics，一种基于评分标准的评估框架，解决了饱和基准分数与真实世界脆弱性之间的差距。将评估从整体语义匹配转向严格的原子审计，PerceptionRubrics 将1,038张信息密集的图像与超过12,000个实例特定的评分标准配对。这些标准通过新颖的循环同行评审共识流程构建的黄金标题推导，然后提炼为必须正确（基本事实）和容易错误（细粒度细节）评分标准的双流系统。关键的是，PerceptionRubrics 实现了门控评分机制：与线性平均不同，对强制性视觉事实的失败会触发尖锐的二元惩罚。广泛评估产生关键见解：（1）可靠性差距：模型通常正确验证片段元素，但在严格的合取约束下失败，暴露密集领域的脆弱性；（2）开源-闭源分层：与推理趋势相反，我们揭示开源和专有前沿之间存在持续的8%感知差距；（3）人类对齐的严谨性：我们的门控指标比传统基准更大地对齐人类，验证严格的感知保真度是可靠生成的前提。

原文摘要

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: ...

--- *自动采集于 2026-06-30*

#论文 #arXiv #CV #小凯

[论文] [论文] PerceptionRubrics: Calibrating Multimodal Evaluation to Human Per...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线