Skip to content

warmsnow-sh/Actial

Repository files navigation

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [Neurips 2025]

Xiaoyu Zhan* · Wenxuan Huang* · Hao Sun* · Xinyu Fu · Changfeng Ma · Shaosheng Cao✝ · Bohan Jia · Shaohui Lin · Zhenfei Yin · Lei Bai · Wanli Ouyang · Yuanqi Li · Jie Guo · Yanwen Guo✝

Arxiv Link

This is the official implementation of the paper "Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models"


teaser

Overview

📍 Data

1. Training Data

We adopted a two-stage training process, with training data for the first stage stored on SFT-data and data for the second stage hosted on RL-data.

You can also construct your own datasets. The raw data for our SFT stage is sourced from MVImgNet, and the data for our RL stage is derived from the SAT dataset.

2. Model

Actial-7B

🚀 SFT Training

1. Download Preprocessed Data and Model

SFT Data: SFT-data

Base Model in huggingface: Qwen-2.5-VL Model

Also in modelscope: modoelscope

2. Train Model

We utilized LLaMA Factory, one of the most prevalent instruction-tuning frameworks, for model fine-tuning. First, you need to refer to the project repository at LLaMA-Factory to complete the installation, and then follow the official tutorials to execute the training process.

During the Supervised Fine-Tuning (SFT) phase, we configured the training with 2 epochs, a learning rate of 5e-6, a batch size of 128, and 50 warm-up steps.

🎯 RL Training

1. Environment

We employed the Verl framework for two-stage reinforcement learning training. You can either use the latest version of Verl for training or install the Verl package included in this repository to proceed.

Important

We adopted a custom reward function, which can be found in detail under the path verl\utils\reward_score\reward_custom. Additionally, It may need attention multi-image input and the image processing mechanism of Verl.

2. Training

# Modify first: supplement or adjust the necessary parameters.
sh run/run_qwen2_5vl-7b_mix_500step.sh

🎨 Eval

We evaluated the model using VLMEvalKit. First, you need to install it.

cd VLMEvalKit
pip install -e .

Next, you need to launch the model service using VLLM, referring to the script run_actial.sh.

Finally, modify the actial_test.json file and then launch the evaluation.

python run.py --config actial_test.json

check VLMEvalKit docs for more information

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62032011) and the Natural Science Foundation of Jiangsu Province (BK20211147).

There are also many powerful resources that greatly benefit our work:

Citation

If you find this work helpful, please consider citing our paper.

@inproceedings{
zhan2025actial,
title={Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models},
author={Xiaoyu Zhan and Wenxuan Huang and Hao Sun and Xinyu Fu and Changfeng Ma and Shaosheng Cao and Bohan Jia and Shaohui Lin and Zhenfei Yin and LEI BAI and Wanli Ouyang and Yuanqi Li and Jie Guo and Yanwen Guo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=jquTBzt3Av}
}

Contact

Zhan, Xiaoyu (zhanxy@smail.nju.edu.cn) and Sun, Hao (warm_snows@163.com) and Fu, Xinyu (xinyu.fu@smail.nju.edu.cn).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors