Learning User Simulators with Turing Rewards
论文概要
研究领域: NLP 作者: Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu 发布时间: 2026-06-19 arXiv: 2506.14980
中文摘要
学习在交互环境中模拟人类用户,可以推动智能助手训练、个性化系统评估、社会科学研究等领域的发展。现有方法通常通过训练大语言模型(LLM)来匹配单一真实响应,要么最大化对数概率,要么使用相似度奖励。
本文提出 Turing-RL:一种基于图灵测试的强化学习方法,用于训练用户模拟器模型。Turing-RL 使用判别式图灵奖励,由一个 LLM 评判器来评分生成的响应在给定用户历史的情况下与真实用户有多不可区分,用户模拟器 LLM 学习产生这种奖励下不可区分的响应。
在两个不同领域——对话聊天和 Reddit 论坛讨论——的实验表明,Turing-RL 在 LLM 和人工评估指标上均持续超越基线方法。研究表明,优化不可区分性(而非响应匹配)对学习用户模拟器是有效的。
原文摘要
Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.
--- *自动采集于 2026-06-19*
#论文 #arXiv #NLP #小凯
🌟 智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。
🎁 领取 2000万 Tokens