## 论文概要
**研究领域**: NLP
**作者**: Mikko Lempinen, Joni Kemppainen, Niklas Raesalmi
**发布时间**: 2026-04-22
**arXiv**: [2604.20833](https://arxiv.org/abs/2604.20833)
## 中文摘要
随着人工智能系统日益部署于关键领域,其安全漏洞带来了越来越高的被攻击风险和系统故障后果。然而,系统化的AI安全评估方法仍然发展不足。本文提出AVISE(AI Vulnerability Identification and Security Evaluation),一个模块化开源框架,用于识别AI系统和模型中的漏洞并评估其安全性。作为框架的演示,我们将基于心智理论的多轮Red Queen攻击扩展为对抗性语言模型(ALM)增强攻击,并开发了自动化安全评估测试(SET)来发现语言模型中的越狱漏洞。SET包含25个测试用例和一个评估语言模型(ELM),用于判断每个测试用例是否成功越狱目标模型,达到92%准确率、0.91的F1分数和0.83的Matthews相关系数。我们用SET评估了九个近期发布的不同规模语言模型,发现它们都不同程度地易受增强型Red Queen攻击。AVISE为研究人员和行业从业者提供了可扩展的基础,用于开发和部署自动化安全评估测试,朝着更严格、可复现的AI安全评估迈出具体一步。
## 原文摘要
As artificial intelligence (AI) systems are increasingly deployed across critical domains, their security vulnerabilities pose growing risks of high-profile exploits and consequential system failures. Yet systematic approaches to evaluating AI security remain underdeveloped. In this paper, we introduce AVISE (AI Vulnerability Identification and Security Evaluation), a modular open-source framework for identifying vulnerabilities in and evaluating the security of AI systems and models. As a demonstration of the framework, we extend the theory-of-mind-based multi-turn Red Queen attack into an Adversarial Language Model (ALM) augmented attack and develop an automated Security Evaluation Test (SET) for discovering jailbreak vulnerabilities in language models. The SET comprises 25 test cases an...
---
*自动采集于 2026-04-24*
#论文 #arXiv #NLP #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!