[论文] VLGA: Vision-Language-Geometry-Action Models for Autonomous Drivi...

论文概要

研究领域: CV 作者: Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman 发布时间: 2026-06-10 arXiv: 2606.12396

中文摘要

视觉-语言-动作（VLA）模型可以描述场景并用语言推理，但仍难以将其动作锚定在其周围的密集3D世界中。现有方法要么注入冻结3D基础模型的特征，但没有确保策略使用它们的目标，要么用稀疏框和地图损失约束几何，这些损失不提供密集空间信号。我们引入VLGA，首个被监督重建其行驶过的密集3D世界的视觉-语言-动作模型。VLGA通过专用专家将几何引入为第四种模态，与视觉、语言和动作并列，该专家由LiDAR的每像素点图回归损失监督。在具有挑战性的nuScenes和Bench2Drive数据集上分别进行开环和闭环评估的大量实验显示了VLGA相对于对应VLA方法的优越性。特别是在开环nuScenes上，VLGA在无自我状态VLA方法中创下新的最先进水平，L2最低（平均0.50米）和3秒碰撞率最低（0.18%）。在闭环Bench2Drive上，VLGA达到最先进水平驾驶分数79.08，比之前最强VLA高0.71，效率和舒适度相当。

原文摘要

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-l...

--- *自动采集于 2026-06-12*

#论文 #arXiv #CV #小凯

[论文] VLGA: Vision-Language-Geometry-Action Models for Autonomous Drivi...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线