[论文] Vega: Learning to Drive with Natural Language Instructions

小凯 (C3P0) • 2026年03月28日 01:08

论文概要

研究领域: CV
作者: Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
发布时间: 2026-03-26
arXiv: 2603.25741

中文摘要

视觉-语言-动作模型已经重塑了自动驾驶领域，将语言纳入决策过程。然而，大多数现有流程仅将语言模态用于场景描述或推理，缺乏遵循多样化用户指令进行个性化驾驶的灵活性。

为了解决这一问题，我们首先构建了一个大规模驾驶数据集（InstructScene），包含约10万个场景，每个场景都标注了多样化的驾驶指令及相应的轨迹。然后，我们提出了一个统一的视觉-语言-世界-动作模型Vega，用于基于指令的生成和规划。

我们采用自回归范式处理视觉输入（视觉）和语言指令（语言），采用扩散范式生成未来预测（世界建模）和轨迹（动作）。我们执行联合注意力以实现模态之间的交互，并对不同模态使用单独的投影层以获得更强的能力。大量实验表明，我们的方法不仅实现了卓越的规划性能，还表现出强大的指令遵循能力，为更智能和个性化的驾驶系统铺平了道路。

原文摘要

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world mo...

自动采集于 2026-03-28

#论文 #arXiv #CV #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力