VLM-Instruct-FastGS enhances 3D Gaussian Splatting by incorporating semantic guidance from Vision-Language Models (VLMs) into the densification process. Under the same sparse initialization and within the same number of iterations, our method achieves more complete scene reconstruction via a semantic guidance strategy:
- Phase 0: Early Main Region Reconstruction – Quickly establishes the primary scene region from multi-view cues at the initial stage.
- Phase 1: Ambient Initialization and Background Completion – Wraps the main subject with an oblique hollow elliptical tube for environment initialization and subsequent background training, covering ceilings/sky and ground while keeping the subject visible in partial views.
- Phase 2: VLM-Guided Targeted Optimization – Leverages VLMs to identify underperforming regions in rendered images, enabling targeted refinement for enhanced scene quality.
By progressively conducting subject-centric reconstruction, ambient initialization, and semantic-aware refinement, our framework effectively improves the completeness and quality of 3D scene reconstruction, especially under sparse inputs and in early training phases.
The following table illustrates the progressive reconstruction process: starting from the output of Phase 0, we apply Hollow Elliptical Tube Initialization to initialize the surrounding environment, followed by the evolution of Phase 1 reconstruction across different iterations. It highlights the effectiveness of our proposed initialization strategy:
| Result from Phase 0 | Hollow Elliptical Tube Initialization | Phase 1 (1000 iters) | Phase 1 (2000 iters) | Phase 1 (3000 iters) | |
| Mip-NeRF360/garden | ![]() |
![]() |
![]() |
![]() |
![]() |
| Mip-NeRF360/counter | ![]() |
![]() |
![]() |
![]() |
![]() |
We propose to wrap the main subject with a hollow elliptical tube for surrounding environment initialization.
b. Compared with full box-shaped enclosing strategies, it keeps the subject visible in partial views.
Phase 2 leverages Vision-Language Models (VLMs) to detect underperforming regions in rendered images, and then performs targeted optimization on these regions to further improve the overall scene reconstruction quality.
Starting from only 100 random points and after 20,000 iterations, our method, powered by the Qwen3-VL-2B-Instruct vision-language model, demonstrates significantly more complete scene reconstruction:
We evaluate our method on the Mip-NeRF 360 dataset, comparing Gaussian count and training loss convergence against vanilla FastGS under the same sparse initialization (100 random points, 20,000 iterations).
![]() |
![]() |
![]() |
![]() |
Comparison of Gaussian count (upper subplot) and training loss (lower subplot) on Mip-NeRF 360 dataset. Orange: Ours; Blue: FastGS.
The curves visualize the first two stages of our pipeline:
The first stage (0–4000 iterations) corresponds to Phase 0, which focuses on fast reconstruction of the central main region.
The second stage (4000–20000 iterations) corresponds to Phase 1, where we employ an oblique hollow elliptical tube to enclose the main subject for effective ambient initialization and full-scene completion.
As observed from the comparisons, under sparse input conditions, our method achieves comparable training loss with fewer Gaussian primitives than vanilla FastGS, demonstrating higher efficiency in 3D scene representation.
Note: The above curves only reflect the reconstruction process of Phase 0 and Phase 1, without involving the VLM-guided targeted optimization in Phase 2.
Download the Qwen3-VL-2B-Instruct vision-language model and place it in the appropriate directory.
Organize your single scene dataset as follows:
├── your_project/
│ ├── images/
│ ├── sparse/
│ └── 0/
│ ├── cameras.bin
│ └── images.binpython train.py --source /path/to/your_project --model_path /path/to/output --qwen_model_path /path/to/Qwen3-VL-2B-Instruct This project is built upon 3DGS, FastGS, and Qwen3-VL-2B-Instruct. We extend our gratitude to all the authors for their outstanding contributions and excellent repositories!



















