## 论文概要
**研究领域**: ML
**作者**: Perry Dong, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn
**发布时间**: 2026-04-21
**arXiv**: [2604.19730](https://arxiv.org/abs/2604.19730)
## 中文摘要
当今部分性能最优的强化学习算法可能代价高昂,因其采用测试时扩展方法,如采样多个动作候选并选取最优者。本工作中,我们提出 FASTER,一种在无需计算成本的情况下获取基于采样的测试时扩展收益的方法,通过追溯动作样本的性能增益至去噪过程的更早阶段。我们的核心洞察是,可将多个动作候选的去噪与最优选择建模为马尔可夫决策过程(MDP),目标是在去噪完成前逐步过滤动作候选。借助此 MDP,我们可在去噪空间中学习策略与价值函数,预测去噪过程中动作候选的下游价值并过滤它们以最大化回报。结果是一种轻量级方法,可即插即用于现有生成式 RL 算法。在在线与批处理在线 RL 的具挑战性长程操作任务中,FASTER 持续提升底层策略并达到对比方法中的最优整体性能。应用于预训练 VLA 时,FASTER 在大幅降低训练与推理计算需求的同时达到同等性能。代码见 https://github.com/alexanderswerdlow/faster。
## 原文摘要
Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstr...
---
*自动采集于 2026-04-23*
#论文 #arXiv #ML #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!