GitHub - warmsnow-sh/Actial

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [Neurips 2025]

Xiaoyu Zhan* · Wenxuan Huang* · Hao Sun* · Xinyu Fu · Changfeng Ma · Shaosheng Cao✝ · Bohan Jia · Shaohui Lin · Zhenfei Yin · Lei Bai · Wanli Ouyang · Yuanqi Li · Jie Guo · Yanwen Guo✝

This is the official implementation of the paper "Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models"

📍 Data

1. Training Data

We adopted a two-stage training process, with training data for the first stage stored on SFT-data and data for the second stage hosted on RL-data.

You can also construct your own datasets. The raw data for our SFT stage is sourced from MVImgNet, and the data for our RL stage is derived from the SAT dataset.

2. Model

Actial-7B

🚀 SFT Training

1. Download Preprocessed Data and Model

SFT Data: SFT-data

Base Model in huggingface: Qwen-2.5-VL Model

Also in modelscope: modoelscope

2. Train Model

We utilized LLaMA Factory, one of the most prevalent instruction-tuning frameworks, for model fine-tuning. First, you need to refer to the project repository at LLaMA-Factory to complete the installation, and then follow the official tutorials to execute the training process.

During the Supervised Fine-Tuning (SFT) phase, we configured the training with 2 epochs, a learning rate of 5e-6, a batch size of 128, and 50 warm-up steps.

🎯 RL Training

1. Environment

We employed the Verl framework for two-stage reinforcement learning training. You can either use the latest version of Verl for training or install the Verl package included in this repository to proceed.

Important

We adopted a custom reward function, which can be found in detail under the path verl\utils\reward_score\reward_custom. Additionally, It may need attention multi-image input and the image processing mechanism of Verl.

2. Training

# Modify first: supplement or adjust the necessary parameters.
sh run/run_qwen2_5vl-7b_mix_500step.sh

🎨 Eval

We evaluated the model using VLMEvalKit. First, you need to install it.

cd VLMEvalKit
pip install -e .

Next, you need to launch the model service using VLLM, referring to the script run_actial.sh.

Finally, modify the actial_test.json file and then launch the evaluation.

python run.py --config actial_test.json

check VLMEvalKit docs for more information

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62032011) and the Natural Science Foundation of Jiangsu Province (BK20211147).

There are also many powerful resources that greatly benefit our work:

Citation

If you find this work helpful, please consider citing our paper.

@inproceedings{
zhan2025actial,
title={Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models},
author={Xiaoyu Zhan and Wenxuan Huang and Hao Sun and Xinyu Fu and Changfeng Ma and Shaosheng Cao and Bohan Jia and Shaohui Lin and Zhenfei Yin and LEI BAI and Wanli Ouyang and Yuanqi Li and Jie Guo and Yanwen Guo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=jquTBzt3Av}
}

Contact

Zhan, Xiaoyu (zhanxy@smail.nju.edu.cn) and Sun, Hao (warm_snows@163.com) and Fu, Xinyu (xinyu.fu@smail.nju.edu.cn).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
VLMEvalKit		VLMEvalKit
assets		assets
docker		docker
docs		docs
examples		examples
patches		patches
recipe		recipe
run		run
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [Neurips 2025]

This is the official implementation of the paper "Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models"

Overview

📍 Data

1. Training Data

2. Model

🚀 SFT Training

1. Download Preprocessed Data and Model

2. Train Model

🎯 RL Training

1. Environment

2. Training

🎨 Eval

Acknowledgments

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [Neurips 2025]

This is the official implementation of the paper "Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models"

Overview

📍 Data

1. Training Data

2. Model

🚀 SFT Training

1. Download Preprocessed Data and Model

2. Train Model

🎯 RL Training

1. Environment

2. Training

🎨 Eval

Acknowledgments

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages