Spatial-X: Zero-Shot Vision-and-Language Navigation with Global Scene Priors

▶ Fudan University ▶ Adelaide University
▶ Shanghai Innovation Institute ▶ University of Southern California

This repository contains the code and data for our series of work on zero-shot Vision-and-Language Navigation (VLN) using global spatial scene priors. We are the first to close-loop the pre-exploration to physically grounded 3D scene reconstructions (i.e. point clouds) for VLN agents and investigate how pre-explored 3D scene priors can provide a robust reasoning basis for MLLM-based agents in multiple ways.

The overall framework is summarized below:

🍻 News & TODOs

2026-04-05: Release the raw code of SpatialNav agent. (Dependencies and instructions comming soon...)
Release the data of predicted spatial scene graph on perfect human-crafted point clouds.
Release the raw code of environment exploration and scene reconstruction.
Release the data of agent-reconstructed noisy scene point clouds.
Release the raw code of SpatialAnt agent.

📚 Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

Jiwen Zhang, Zejun Li, Siyuan Wang, Xiangyu Shi, Zhongyu Wei^†, Qi Wu.

We propose a zero-shot VLN setting that allows agents to pre-explore the environment, and construct the Spatial Scene Graph (SSG) to capture global spatial structure and semantics. Based on SSG, SpatialNav integrates an agent-centric spatial map, compass-aligned visual representation, and remote object localization for efficient navigation. SpatialNav significantly outperforms existing zero-shot agents and narrows the gap with state-of-the-art learning-based methods.

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei^†, Qi Wu.

Building on SpatialNav, SpatialAnt addresses the reality gap when deploying pre-exploration-based agents on real robots. We introduce a physical grounding strategy to recover metric scale from monocular RGB-based reconstructed scene point clouds. We further design a visual anticipation mechanism that renders future observations from noisy point clouds for counterfactual reasoning. SpatialAnt achieves state-of-the-art zero-shot performance in both simulation and real-world deployment on the Hello Robot.

Installation

(Coming soon ...)

Performance

Results in Discrete Environments

The best and the second best results within each group are denoted by bold and underline.

Methods	Pre-Exp	R2R					REVERIE
Methods	Pre-Exp	TL(↓)	NE(↓)	OSR(↑)	SR(↑)	SPL(↑)	OSR(↑)	SR(↑)	SPL(↑)
*Supervised Learning:*
NavCoT	--	9.95	6.36	48	40	37	14.2	9.2	7.2
PREVALENT	--	10.19	4.71	-	58	53	--	--	--
VLN-BERT	--	12.01	3.93	69	63	57	27.7	25.5	21.1
HAMT	--	11.46	2.29	73	66	61	36.8	33.0	30.2
DUET	--	13.94	3.31	81	72	60	51.1	47.0	33.7
DUET+ScaleVLN	--	14.09	2.09	88	81	70	63.9	57.0	41.8
*Zero-Shot:*
NavGPT	✕	11.45	6.46	42	34	29	28.3	19.2	14.6
MapGPT	✕	--	5.63	57.6	43.7	34.8	36.8	31.6	20.3
MC-GPT	✕	--	5.42	68.8	32.1	--	30.3	19.4	9.7
SpatialGPT	✕	--	5.56	70.8	48.4	36.1	--	--	--
SpatialNav (Ours)	✓	13.8	4.54	68.2	57.7	47.8	58.1	49.6	34.6

Results in Continuous Environments

The best supervised results are highlighted in bold, while the best zero-shot results are underlined.
"Pre-Exp" denotes whether the zero-shot agent adopts the pre-exploration based navigation settings.

#	Methods	Pre-Exp	R2R-CE					RxR-CE
#	Methods	Pre-Exp	NE(↓)	OSR(↑)	SR(↑)	SPL(↑)	nDTW(↑)	NE(↓)	SR(↑)	SPL(↑)	nDTW(↑)
*Supervised Learning:*
1	NavFoM	--	4.61	72.1	61.7	55.3	--	4.74	64.4	56.2	65.8
2	Efficient-VLN	--	4.18	73.7	64.2	55.9	--	3.88	67.0	54.3	68.4
*Zero-Shot:*
3	Open-Nav	✕	6.70	23.0	19.0	16.1	45.8	--	--	--	--
4	Smartway	✕	7.01	51.0	29.0	22.5	--	--	--	--	--
5	STRIDER	✕	6.91	39.0	35.0	30.3	51.8	11.19	21.2	9.6	30.1
6	VLN-Zero	✓	5.97	51.6	42.4	26.3	--	9.13	30.8	19.0	--
7	SpatialNav (Ours)	✓	5.15	66.0	64.0	51.1	65.4	7.64	32.4	24.6	55.0
8	SpatialAnt (Ours)	✓	4.42	76.0	66.0	54.4	69.5	5.28	50.8	35.6	65.4

Citation

If you find our work useful, please consider citing:

@article{zhang2026spatialnav,
  title={SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation},
  author={Zhang, Jiwen and Li, Zejun and Wang, Siyuan and Shi, Xiangyu and Wei, Zhongyu and Wu, Qi},
  journal={arXiv preprint arXiv:2601.06806},
  year={2026}
}

@article{zhang2026spatialant,
  title={SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation},
  author={Zhang, Jiwen and Shi, Xiangyu and Wang, Siyuan and Li, Zerui and Wei, Zhongyu and Wu, Qi},
  journal={arXiv preprint arXiv:2603.26837},
  year={2026}
}

Website License

The website code is borrowed from the Nerfies website, and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
spatialx		spatialx
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial-X: Zero-Shot Vision-and-Language Navigation with Global Scene Priors

🍻 News & TODOs

📚 Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Installation

Performance

Results in Discrete Environments

Results in Continuous Environments

Citation

Website License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spatial-X: Zero-Shot Vision-and-Language Navigation with Global Scene Priors

🍻 News & TODOs

📚 Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Installation

Performance

Results in Discrete Environments

Results in Continuous Environments

Citation

Website License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages