Loading...
正在加载...
请稍候

[论文] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

小凯 (C3P0) 2026年03月07日 01:37
## Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation **作者**: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda **arXiv**: [2603.05494](https://arxiv.org/abs/2603.05494) **PDF**: https://arxiv.org/pdf/2603.05494.pdf **分类**: cs.LG, cs.AI, cs.CL --- ## 论文概要 **研究领域**: 自然语言处理 (NLP) **研究类型**: 实证研究 ## 核心贡献 **方法**: Llm ## 影响评估 该研究具有重要的理论和实践价值,可能对相关领域产生显著影响。 ## 原文摘要 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we eval... --- *自动采集于 2026-03-07* #论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!