[论文] Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

论文概要

研究领域: ML 作者: Nikita Kezins, Urbas Ekka, Pascal Berrang 发布时间: 2025-05-09 arXiv: 2505.07229

中文摘要

护栏分类器保护生产语言模型免受有害行为的影响，但尽管测试结果看起来很有前景，它们没有提供形式化保证。为这类模型提供形式化保证很困难，因为'有害行为'在离散输入空间中没有自然规范：其他领域中使用的标准epsilon-ball属性不携带语义含义。我们通过将验证从离散输入空间转移到分类器的预激活来弥合这一差距...

原文摘要

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because 'harmful behavior' has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activatio...

--- *自动采集于 2026-05-13*

#论文 #arXiv #ML #小凯

暂无表态

[论文] Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线