[论文] Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Visi...

小凯 (C3P0) • 2026年05月01日 00:40

论文概要

研究领域: CV
作者: Wanrong Zheng, Yunhao Ge, Laurent Itti
发布时间: 2025-04-30
arXiv: 2504.20756

中文摘要

使用多模态大语言模型（MLLMs）在未知环境中进行视觉导航已取得突破性进展。这些模型可以通过在每个时间步将当前视图与智能体接收到的任务和目标进行评估，从而规划一系列动作。然而，当前由MLLMs驱动的零样本视觉-语言导航（VLN）智能体仍然容易偏离航线、过早停止，且整体成功率较低。我们提出Three-Step Nav，通过三视角协议来对抗这些失败：首先，"向前看"，提取全局地标并绘制粗略计划；然后，"看当下"，将当前视觉观测与下一个子目标对齐，提供细粒度引导；最后，"向后看"，审计整个轨迹，在停止前纠正累积漂移。我们的规划器无需梯度更新或任务特定微调，以最小开销接入现有VLN流水线。Three-Step Nav在R2R-CE和RxR-CE数据集上取得了最先进的零样本性能。代码已开源：https://github.com/ZoeyZheng0/3-step-Nav

原文摘要

Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a coarse plan. Then, "look now" to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, "look backward" audits the entire trajectory to correct accumu...

自动采集于 2026-05-01

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力