AgentVLN is an efficient embodied navigation framework for long-horizon vision-and-language navigation in unseen environments. It formulates VLN as a POSMDP and follows a VLM-as-Brain paradigm that decouples high-level semantic reasoning from low-level perception and planning through a plug-and-play skill library.
Real-world experiments show that AgentVLN can execute instruction-following navigation in both indoor and outdoor scenes, while maintaining robust planning and efficient deployment. We will release real-world video demos soon.
Please see our project page for HD demo.
- VLM-as-Brain Navigation: AgentVLN decomposes long-horizon navigation into high-level reasoning and modular skill execution under a unified agentic framework.
- Cross-space Representation Mapping: 3D topological waypoints are projected into the image plane as pixel-aligned visual prompts, bridging the gap between 3D planning and 2D VLM perception.
- Context-aware Self-correction: fine-grained active exploration helps the agent recover from occlusions, blind spots, and trajectory drift during long-horizon navigation.
- QD-PCoT for Spatial Ambiguity: the Query-Driven Perceptual Chain-of-Thought mechanism enables the agent to actively query missing geometric cues for more precise target grounding.
- Lightweight Edge Deployment: AgentVLN achieves a strong accuracy-efficiency trade-off and supports real-time local inference on embedded edge platforms.
Compared with prior VLN systems that rely on larger models or remote cloud execution, AgentVLN is designed for efficient local deployment. The framework delivers a better accuracy-efficiency balance on long-horizon VLN benchmarks while remaining lightweight enough for real-time on-device inference.
AgentVLN consistently outperforms prior state-of-the-art methods on the Val-Unseen splits of R2R-CE and RxR-CE, demonstrating strong generalization in complex unseen environments.
- Release the project page and paper PDF
- Release AgentVLN-Instruct
- Open-source training and inference code
- Release pretrained model checkpoints
- Add installation and environment setup instructions
@misc{xin2026agentvln,
title={AgentVLN: Towards Agentic Vision-and-Language Navigation},
author={Zihao Xin and Wentong Li and Yixuan Jiang and Ziyuan Huang and Bin Wang and Piji Li and Jianke Zhu and Jie Qin and Sheng-Jun Huang},
year={2026},
eprint={2603.17670},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.17670},
}







