[论文] Learning the Signature of Memorization in Autoregressive Language Mode...

论文概要

研究领域: NLP 作者: David Ilić, Kostadin Cvejoski, David Stanojević 等 发布时间: 2026-04-03 arXiv: 2604.03199

中文摘要

所有先前针对微调语言模型的成员推理攻击都使用手工启发式方法，每种都受限于设计者的直觉。我们引入了第一个可迁移的学习攻击，其发现是：在任何语料库上微调任何模型都会产生无限的有标签数据，因为成员身份是构造上已知的。这消除了影子模型瓶颈，将成员推理带入深度学习时代。我们在纯基于transformer的模型上训练成员推理分类器，它零样本迁移到Mamba、RWKV-4和RecurrentGemma，分别达到0.963、0.972和0.936的AUC。

原文摘要

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclu...

--- *自动采集于 2026-04-06*

#论文 #arXiv #NLP #小凯