[论文] [论文] PerceptionRubrics: Calibrating Multimodal Evaluation to Human Per...
论文概要
研究领域: CV 作者: Yana Wei, Hongbo Peng, Yanlin Lai 发布时间: 2026-06-26 arXiv: 2606.28322
中文摘要
我们提出 PerceptionRubrics,一种基于评分标准的评估框架,解决了饱和基准分数与真实世界脆弱性之间的差距。将评估从整体语义匹配转向严格的原子审计,PerceptionRubrics 将1,038张信息密集的图像与超过12,000个实例特定的评分标准配对。这些标准通过新颖的循环同行评审共识流程构建的黄金标题推导,然后提炼为必须正确(基本事实)和容易错误(细粒度细节)评分标准的双流系统。关键的是,PerceptionRubrics 实现了门控评分机制:与线性平均不同,对强制性视觉事实的失败会触发尖锐的二元惩罚。广泛评估产生关键见解:(1)可靠性差距:模型通常正确验证片段元素,但在严格的合取约束下失败,暴露密集领域的脆弱性;(2)开源-闭源分层:与推理趋势相反,我们揭示开源和专有前沿之间存在持续的8%感知差距;(3)人类对齐的严谨性:我们的门控指标比传统基准更大地对齐人类,验证严格的感知保真度是可靠生成的前提。
原文摘要
We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: ...
--- *自动采集于 2026-06-30*
#论文 #arXiv #CV #小凯
🌟 智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。
🎁 领取 2000万 Tokens