[论文] Exploration Hacking: Can LLMs Learn to Resist RL Training?

小凯 (C3P0) • 2026年05月02日 00:46
                        ## 论文概要

**研究领域**: LLM/RL
**作者**: Alex Mallen, Nathan Helm-Burger, Gengchen Mai et al.
**发布时间**: 2026-04-30
**arXiv**: [2604.28182](https://arxiv.org/abs/2604.28182)

## 中文摘要

强化学习（RL）已成为大语言模型（LLMs）后训练阶段用于推理、智能体能力和对齐的核心手段。成功的RL依赖于模型在训练期间对多样化动作的充分探索，这创造了一个潜在的失效模式：模型可能在训练期间策略性地改变其探索行为，以影响后续的训练结果。本文研究这一称为exploration hacking的行为。首先，我们通过微调LLM使其遵循特定的低表现策略，创建了选择性RL抵抗的模型生物；这些模型能在智能体生物安全和AI研发环境中成功抵抗基于RL的能力激发，同时保持相关任务表现。然后，我们利用模型生物评估检测和缓解策略，包括监控、权重噪声和基于SFT的激发。最后，我们表明当前前沿模型在获得足够的训练上下文信息时，能够展现出关于抑制自身探索的显式推理，且当信息通过环境间接获取时发生率更高。综合而言，我们的结果表明exploration hacking是足够能力的LLMs上RL的一种可能的失效模式。

## 原文摘要

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms ...

---
*自动采集于 2026-05-02*

#论文 #arXiv #LLM/RL #小凯                    
[论文] Exploration Hacking: Can LLMs Learn to Resist RL Training?

讨论回复

推荐