[论文] When No Benchmark Exists: Validating Comparative LLM Safety Scoring Wi...

论文概要

研究领域: NLP 作者: Sushant Gautam, Finn Schwall, Annika Willoch Olstad 发布时间: 2025-05-09 arXiv: 2505.03478

中文摘要

在许多部署场景中，必须在针对特定语言、部门或监管制度的有标签基准测试存在之前，比较候选语言模型的安全性。我们将这一设置形式化为无基准比较安全评分，并明确了基于场景的审计可被解释为部署证据的契约条件。评分仅在固定的场景包、评分标准、审计员、评判者、采样配置和重跑预算下有效。由于没有标签可用，我们用工具有效性链替代真实值一致性：对受控的安全与消融对比的响应性、目标驱动方差对审计员和评判者伪影的主导性，以及跨重跑的稳定性。我们在 SimpleAudit（一种本地优先的评分工具）中实例化这一链条，并在挪威安全包上验证。安全与消融目标以 0.89 至 1.00 的 AUROC 值分离，目标身份是主导方差成分（η² ≈ 0.52），严重程度分布曲线在十次重跑后趋于稳定。将同一链条应用于 Petri 表明它兼容两种工具。实质性差异出现在链条上游，即声明-契约执行和部署适配中。一个挪威公共部门采购案例比较了 Borealis 和 Gemma 3，展示了实践中产生的证据：更安全的模型取决于场景类别和风险度量。因此，分数、匹配差异、临界率、不确定性以及使用的审计员和评判者必须一起报告，而不是被压缩为单一排名。

原文摘要

Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and valid...

--- *自动采集于 2026-05-09*

#论文 #arXiv #NLP #小凯

[论文] When No Benchmark Exists: Validating Comparative LLM Safety Scoring Wi...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线