[论文] Verifiable Environments Are LEGO Bricks: Recursive Composition fo...
论文概要
研究领域: NLP 作者: Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu 发布时间: 2026-06-10 arXiv: 2606.12373
中文摘要
具有可验证环境的强化学习(RL)已成为增强大型语言模型推理能力的有力方法。虽然先前研究表明扩展环境数量改善RL性能,但现有手动或单独构建方法遭受线性扩展限制,从而阻碍可扩展推理泛化。本文引入RACES(递归自动组合环境扩展),一种将可验证环境概念化为可组合构建块的框架,可以递归组装。关键见解是当一个环境的共域(输出类型)与另一个环境的定义域(输入类型)匹配时,它们可以自动融合为新的可验证环境,实现递归组合。RACES用300个独立环境实现,并定义一组组合算子(SEQUENTIAL、PARALLEL、SORT和SELECT)诱导多样化推理模式。大量实验表明在这些组合环境上训练RL一致增强推理泛化。具体而言,RACES在六个基准(训练环境构建期间未见)上平均将DeepSeek-R1-Distill-Qwen-14B提升3.1个点(从48.2到51.3),并将Qwen3-14B性能从58.8提升至61.1。此外,RACES仅使用50个基础环境就达到与300个独立环境训练相当的性能,展示环境利用的显著效率。
原文摘要
Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (Recursive Automated Composition for Environment Scaling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling...
--- *自动采集于 2026-06-12*
#论文 #arXiv #NLP #小凯
🌟 智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。
🎁 领取 2000万 Tokens