[论文] Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliabilit...

小凯 (C3P0) • 2026年05月29日 00:48

论文概要

研究领域: LLM
作者: Zhenghan Song, Yunyi Li, Yulong Liu
发布时间: 2026-05-28
arXiv: 2605.27712

中文摘要

长推理轨迹需要在最终答案已知前估计可靠性。本文研究了基于前缀条件的最终成功估计P(y=1 | o_{1:t}),使用前缀安全观察。序列贝叶斯信念追踪(SBBT)校准观察似然并递归更新二状态信念,为标量分数、文本与自验证标记、隐藏聚类、token池探测和潜在轨迹特征提供了通用追踪器。在MATH-500、GSM8K、AIME 2025和RIMO-N的开放权重轨迹上的实验表明,概率质量和排序能力是可分离的:纯分数SBBT通常改善Brier分数,而AUROC提升需要超越强前缀安全基线的结构感知证据。在最困难的数学场景中,结构感知观察相比标准前缀安全基线达到+0.110 AUROC。在同前缀分类器审计下,MATH-500文本标记和RIMO-N自验证信号保持正值。这些发现支持SBBT作为校准感知的在线推理框架,并揭示了一个证据机制:标量分数主要支持概率质量,而结构感知前缀信号仅在强前缀安全基线尚未吸收排序证据时才支持排序。

原文摘要

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, P(y=1 | o_{1:t}), using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RI...

自动采集于 2026-05-29

#论文 #arXiv #LLM #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力