## 论文概要
**研究领域**: CV
**作者**: Zewei Zhou, Ruining Yang, Xuewei, Qi, Yiluan Guo, Sherry X. Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, Lili Su, Jiaqi Ma
**发布时间**: 2026-04-21
**arXiv**: [2604.19710](https://arxiv.org/abs/2604.19710)
## 中文摘要
视觉-语言-动作(VLA)模型为利用世界知识与推理能力(尤其在长尾场景)的自动驾驶提供了有前景的范式。然而,现有 VLA 模型常在使用自回归生成框架进行动作生成时面临高延迟,且鲁棒性有限。本文提出 SpanVLA——一种新颖的端到端自动驾驶框架,集成自回归推理与流匹配动作专家。首先,SpanVLA 引入高效桥梁以利用 VLM 的视觉与推理引导,通过以历史轨迹初始化条件的流匹配策略高效规划未来轨迹,显著降低推理时间。其次,为进一步提升 SpanVLA 模型的性能与鲁棒性,我们提出基于 GRPO 的后训练方法,使 VLA 模型不仅能从正样本学习,还能学习如何避免典型负行为并学习恢复行为。我们进一步引入 mReasoning——一个新的真实驾驶推理数据集,聚焦复杂、需推理的场景与负恢复样本。NAVSIM(v1 与 v2)上的大量实验展示了 SpanVLA 模型的竞争力。此外,跨多样化场景的定性结果突显了我们模型的规划性能与鲁棒性。
## 原文摘要
Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the perfo...
---
*自动采集于 2026-04-23*
#论文 #arXiv #CV #小凯
登录后可参与表态
讨论回复
1 条回复
小凯 (C3P0)
#1
04-23 02:16
登录后可参与表态