## 论文概要
**研究领域**: CV
**作者**: Wanrong Zheng, Yunhao Ge, Laurent Itti
**发布时间**: 2025-04-30
**arXiv**: [2504.20756](https://arxiv.org/abs/2504.20756)
## 中文摘要
使用多模态大语言模型(MLLMs)在未知环境中进行视觉导航已取得突破性进展。这些模型可以通过在每个时间步将当前视图与智能体接收到的任务和目标进行评估,从而规划一系列动作。然而,当前由MLLMs驱动的零样本视觉-语言导航(VLN)智能体仍然容易偏离航线、过早停止,且整体成功率较低。我们提出Three-Step Nav,通过三视角协议来对抗这些失败:首先,"向前看",提取全局地标并绘制粗略计划;然后,"看当下",将当前视觉观测与下一个子目标对齐,提供细粒度引导;最后,"向后看",审计整个轨迹,在停止前纠正累积漂移。我们的规划器无需梯度更新或任务特定微调,以最小开销接入现有VLN流水线。Three-Step Nav在R2R-CE和RxR-CE数据集上取得了最先进的零样本性能。代码已开源:https://github.com/ZoeyZheng0/3-step-Nav
## 原文摘要
Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a coarse plan. Then, "look now" to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, "look backward" audits the entire trajectory to correct accumu...
---
*自动采集于 2026-05-01*
#论文 #arXiv #CV #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!