Skip to content

IMNearth/Spatial-X

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spatial-X: Zero-Shot Vision-and-Language Navigation with Global Scene Priors

Fudan University Adelaide University
Shanghai Innovation Institute University of Southern California

project page code data license

This repository contains the code and data for our series of work on zero-shot Vision-and-Language Navigation (VLN) using global spatial scene priors. We are the first to close-loop the pre-exploration to physically grounded 3D scene reconstructions (i.e. point clouds) for VLN agents and investigate how pre-explored 3D scene priors can provide a robust reasoning basis for MLLM-based agents in multiple ways.

The overall framework is summarized below:

🍻 News & TODOs

  • 2026-04-05: Release the raw code of SpatialNav agent. (Dependencies and instructions comming soon...)
  • Release the data of predicted spatial scene graph on perfect human-crafted point clouds.
  • Release the raw code of environment exploration and scene reconstruction.
  • Release the data of agent-reconstructed noisy scene point clouds.
  • Release the raw code of SpatialAnt agent.

📚 Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation arXiv


We propose a zero-shot VLN setting that allows agents to pre-explore the environment, and construct the Spatial Scene Graph (SSG) to capture global spatial structure and semantics. Based on SSG, SpatialNav integrates an agent-centric spatial map, compass-aligned visual representation, and remote object localization for efficient navigation. SpatialNav significantly outperforms existing zero-shot agents and narrows the gap with state-of-the-art learning-based methods.

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation arXiv


Building on SpatialNav, SpatialAnt addresses the reality gap when deploying pre-exploration-based agents on real robots. We introduce a physical grounding strategy to recover metric scale from monocular RGB-based reconstructed scene point clouds. We further design a visual anticipation mechanism that renders future observations from noisy point clouds for counterfactual reasoning. SpatialAnt achieves state-of-the-art zero-shot performance in both simulation and real-world deployment on the Hello Robot.

Installation

(Coming soon ...)

Performance

Results in Discrete Environments

  • The best and the second best results within each group are denoted by bold and underline.
Methods Pre-Exp R2R REVERIE
TL(↓) NE(↓) OSR(↑) SR(↑) SPL(↑) OSR(↑) SR(↑) SPL(↑)
Supervised Learning:
NavCoT -- 9.95 6.36 48 40 37 14.2 9.2 7.2
PREVALENT -- 10.19 4.71 - 58 53 -- -- --
VLN-BERT -- 12.01 3.93 69 63 57 27.7 25.5 21.1
HAMT -- 11.46 2.29 73 66 61 36.8 33.0 30.2
DUET -- 13.94 3.31 81 72 60 51.1 47.0 33.7
DUET+ScaleVLN -- 14.09 2.09 88 81 70 63.9 57.0 41.8
Zero-Shot:
NavGPT 11.45 6.46 42 34 29 28.3 19.2 14.6
MapGPT -- 5.63 57.6 43.7 34.8 36.8 31.6 20.3
MC-GPT -- 5.42 68.8 32.1 -- 30.3 19.4 9.7
SpatialGPT -- 5.56 70.8 48.4 36.1 -- -- --
SpatialNav (Ours) 13.8 4.54 68.2 57.7 47.8 58.1 49.6 34.6

Results in Continuous Environments

  • The best supervised results are highlighted in bold, while the best zero-shot results are underlined.
  • "Pre-Exp" denotes whether the zero-shot agent adopts the pre-exploration based navigation settings.
# Methods Pre-Exp R2R-CE RxR-CE
NE(↓) OSR(↑) SR(↑) SPL(↑) nDTW(↑) NE(↓) SR(↑) SPL(↑) nDTW(↑)
Supervised Learning:
1 NavFoM -- 4.61 72.1 61.7 55.3 -- 4.74 64.4 56.2 65.8
2 Efficient-VLN -- 4.18 73.7 64.2 55.9 -- 3.88 67.0 54.3 68.4
Zero-Shot:
3 Open-Nav 6.70 23.0 19.0 16.1 45.8 -- -- -- --
4 Smartway 7.01 51.0 29.0 22.5 -- -- -- -- --
5 STRIDER 6.91 39.0 35.0 30.3 51.8 11.19 21.2 9.6 30.1
6 VLN-Zero 5.97 51.6 42.4 26.3 -- 9.13 30.8 19.0 --
7 SpatialNav (Ours) 5.15 66.0 64.0 51.1 65.4 7.64 32.4 24.6 55.0
8 SpatialAnt (Ours) 4.42 76.0 66.0 54.4 69.5 5.28 50.8 35.6 65.4

Citation

If you find our work useful, please consider citing:

@article{zhang2026spatialnav,
  title={SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation},
  author={Zhang, Jiwen and Li, Zejun and Wang, Siyuan and Shi, Xiangyu and Wei, Zhongyu and Wu, Qi},
  journal={arXiv preprint arXiv:2601.06806},
  year={2026}
}

@article{zhang2026spatialant,
  title={SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation},
  author={Zhang, Jiwen and Shi, Xiangyu and Wang, Siyuan and Li, Zerui and Wei, Zhongyu and Wu, Qi},
  journal={arXiv preprint arXiv:2603.26837},
  year={2026}
}

Website License

Creative Commons License

The website code is borrowed from the Nerfies website, and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

Spatial-X: Zero-Shot Vision-and-Language Navigation with Spatial Scene Priors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors