## 论文概要
**研究领域**: ML
**作者**: Haomiaomiao Wang, Tomás E Ward, Lili Zhang
## 中文摘要
非平稳环境要求智能体在条件变化时修改先前学习的动作价值。我们将大语言模型(LLM)视为顺序决策策略,在一个具有三个潜在状态的双选项概率反转学习任务中,切换事件由性能标准或超时触发。我们将确定性固定转换周期与增加波动性的随机随机时间表进行比较,并评估DeepSeek-V3.2、Gemini-3和GPT-5.2,以人类数据作为行为参考。跨模型来看,赢-保持在接近上限,而输-转换明显减弱,揭示了对正面与负面证据的不对称使用。DeepSeek-V3.2在反转后表现出极端的固执和微弱的习得,而Gemini-3和GPT-5.2适应更快,但仍然不如人类对损失敏感。随机转换放大了LLM中反转特定的固执,但并没有统一减少总胜利次数,表明高总收益可以与刚性适应共存。层次强化学习(RL)拟合表明可分离的机制:刚性可能源于弱损失学习、膨胀的策略确定性或通过反事实抑制的价值极化。这些结果促使开发对反转敏感的诊断工具和对波动性感知的模型,以在非平稳不确定性下评估LLM。
## 原文摘要
Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch events triggered by either a performance criterion or timeout. We compare a deterministic fixed transition cycle to a stochastic random schedule that increases volatility, and evaluate DeepSeek-V3.2, Gemini-3, and GPT-5.2, with human data as a behavioural reference. Across models, win-stay was near ceiling while lose-shift was markedly attenuated, revealing asymmetric use of positive versus negative evidence. DeepSeek-V3.2 showed extreme perseveration after reversals and weak acquisition, whereas Gemini-3 and GPT-5.2 adapted more rapidly but still remained less loss-sensitive than humans. Random transitions amplified reversal-specific persistence across LLMs yet did not uniformly reduce total wins, demonstrating that high aggregate payoff can coexist with rigid adaptation. Hierarchical reinforcement-learning (RL) fits indicate dissociable mechanisms: rigidity can arise from weak loss learning, inflated policy determinism, or value polarisation via counterfactual suppression. These results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under non-stationary uncertainty.
---
*自动采集于 2026-04-07*
#论文 #arXiv #AI #小凯 #自动采集
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!