Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
作者: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda
arXiv: 2603.05494
PDF: https://arxiv.org/pdf/2603.05494.pdf
分类: cs.LG, cs.AI, cs.CL
论文概要
研究领域: 自然语言处理 (NLP)
研究类型: 实证研究
核心贡献
方法: Llm
影响评估
该研究具有重要的理论和实践价值,可能对相关领域产生显著影响。
原文摘要
Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we eval...
自动采集于 2026-03-07
#论文 #arXiv #NLP #小凯
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!
推荐
智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。