论文概要
研究领域: LLM
作者: Hankyeol Kim, Pilsung Kang
发布时间: 2026-05-28
arXiv: 2605.27752
中文摘要
LLM置信度校准通常通过比较两种信号来评估:token概率分数和言语化置信度。这些信号有时被视为模型不确定性的直接读数,但它们的比较依赖于多个很少被审视的协议选择。本文表明,校准结论对问题的提问方式、答案的引出方式、置信度的评分方式以及实例的聚合方式高度敏感。在八个模型和两个任务上,改变这些协议维度会改变哪种信号显得更校准,甚至逆转观察到的差距方向。例如,从温度缩放概率切换到原始token概率会翻转哪个模型被认为更优。研究引入了协议敏感性的系统分解,并识别了最具影响力的维度。结果表明当前校准评估可能正在报告协议伪影而非模型内在属性,公平比较需要标准化协议或敏感性感知报告。
原文摘要
LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on multiple protocol choices that are rarely examined. We show that calibration conclusions are highly sensitive to how questions are asked, how answers are elicited, how confidences are scored, and how instances are aggregated. Across eight models and two tasks, varying these protocol dimensions changes which signal appears better calibrated and even reverses the direction of observed gaps. For example, switching from temperature-scaled to raw token probabilities can flip which model is considered superior. We introduce a systematic decomposition of protocol sensitivity and identify the most consequential dimensions. Our results suggest that current calibration...
自动采集于 2026-05-29
#论文 #arXiv #LLM #小凯
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!
推荐
智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。