ARKitScenes Downloader for ARKitSceneRefer

Scripts to download ARKitScenes data based on scene IDs from ARKitSceneRefer dataset. Supports chunked downloading to manage large datasets incrementally.

Overview

ARKitSceneRefer contains 1,605 unique scenes with 3D object referring expressions. These scripts extract the scene IDs and download the corresponding data from ARKitScenes (either 3DOD or RAW format) in configurable chunks.

Prerequisites

Python 3.12
Pytorch 2.8
curl (for downloading)
OpenRouter API Key

#create workspace
mkdir workspace
cd workspace
#copy project.zip in workspace
cp <path>/project.zip .
#unzip project
unzip projects.zip
# Install dependencies
pip install pandas
pip install pycolmap-cuda12
pip install torchvision
#install segment-anythging
pip install git+https://github.com/facebookresearch/segment-anything.git
#download sam model
curl https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth --output sam_vit_b_01ec64.pth
#install opencv
pip install opencv-python
#install numpy
pip isntall numpy
#install scipy
pip install scipy
#install open3d
pip install open3d
#install gdown
pip install gdown
gdown 1EkrdJhtHZuG1ch9UBjnEkzzuTgMjJI_o
unzip data.zip

Run COLMAP

 python run_colmap.py --scene_path data/3dod/Training/40777060

When it's finish you will see something like

2025-12-16 03:14:38,071 - INFO - Reconstruction done: 209 images, 2911 points
2025-12-16 03:14:38,073 - INFO - Exporting reconstruction
2025-12-16 03:14:38,087 - INFO - exported point cloud: /workspace/project/data/3dod/Training/40777060/colmap_output/points3D.ply
2025-12-16 03:14:38,099 - INFO - Exported json: /workspace/project/data/3dod/Training/40777060/colmap_output/reconstruction.json
Reconstruction summary
Registered images: 209
3D points: 2911
Observations: 46429
Mean reprojection error: 0.3967 px
2025-12-16 03:14:38,100 - INFO - Finished successfully. output: /workspace/project/data/3dod/Training/40777060/colmap_output

Run SAM

 python sam_segmenter.py --scene_path data/3dod/Training/40777060 --keyframe_interval 1

When it's finish you will see something like

2025-12-16 03:34:24,373 - INFO - Segmenting 40777060_148.061...
2025-12-16 03:34:41,759 - INFO - Segmenting 40777060_151.060...
2025-12-16 03:34:59,257 - INFO - Segmenting 40777060_154.059...
2025-12-16 03:35:28,878 - INFO - Segmenting 40777060_157.057...
2025-12-16 03:35:46,948 - INFO - Segmenting 40777060_160.056...
2025-12-16 03:36:01,151 - INFO - Segmenting 40777060_163.055...
2025-12-16 03:36:17,655 - INFO - Segmented 22 frames
2025-12-16 03:36:17,655 - INFO - Frame 40777060_100.047: 102 objects detected

Run 3D Lift

 python lift_masks_to_3d.py --scene_path data/3dod/Training/40777060 --min_mask_area 200 --min_points 75 --fuse_radius 0.2 --min_views 2 --max_depth 5.0

When it's finish you will see something like

2025-12-16 03:44:15,340 - INFO - Processing 22 SAM mask files
2025-12-16 03:44:18,142 - INFO - Saved 258 fragments
2025-12-16 03:44:20,197 - INFO - Saved 37 fused objects
2025-12-16 03:44:20,218 - INFO - Summary written to /workspace/project/data/3dod/Training/40777060/objects3d/summary.json

Run Evaluation

export OPENROUTER_API_KEY=<YOUR_OPENROUTER_API_KEY>
python run_evaluation.py --data_dir data/3dod/ --max_scenes 1

When finished you will see the report

CAPTIONING METHOD COMPARISON
Metric                          baseline      multiview  context_aware
Objects                               37             37             37
Avg Word Count                      86.4          179.7          107.2
Avg Unique Words                    63.3          114.6           79.4
Avg Attributes                       7.7           12.1            7.6
Detail Score                       1.000          1.000          1.000
Hallucination Rate                 56.8%          48.6%          54.1%
Label Accuracy                       N/A            N/A            N/A
CLIP Score                           N/A            N/A            N/A
Consistency Score                    N/A            N/A            N/A
SUMMARY:
Multi-view vs Baseline detail score: +0.000
Context-aware vs Baseline detail score: +0.000

Download more scenes

#from project go up one level in workspace
cd ..
#clone both repositories ARKitScenes and ARKitSceneRefer
git clone https://github.com/apple/ARKitScenes.git
git clone https://github.com/ku-nlp/ARKitSceneRefer.git
#download more scenes
cd project
python download_3dod_from_refer.py --chunk_index 0
#this will download a chunk of 100 scenes

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CV_final_report-4.pdf		CV_final_report-4.pdf
README.md		README.md
baseline_captioner.py		baseline_captioner.py
caption_metrics.py		caption_metrics.py
compute_3d_context.py		compute_3d_context.py
context_aware_captioner.py		context_aware_captioner.py
crop_extractor.py		crop_extractor.py
download_3dod_from_refer.py		download_3dod_from_refer.py
lift_masks_to_3d.py		lift_masks_to_3d.py
multiview_captioner.py		multiview_captioner.py
openrouter_client.py		openrouter_client.py
run_colmap.py		run_colmap.py
run_evaluation.py		run_evaluation.py
sam_segmenter.py		sam_segmenter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARKitScenes Downloader for ARKitSceneRefer

Overview

Prerequisites

Run COLMAP

Run SAM

Run 3D Lift

Run Evaluation

Download more scenes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARKitScenes Downloader for ARKitSceneRefer

Overview

Prerequisites

Run COLMAP

Run SAM

Run 3D Lift

Run Evaluation

Download more scenes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages