Xiaoyu Zhan* · Wenxuan Huang* · Hao Sun* · Xinyu Fu · Changfeng Ma · Shaosheng Cao✝ · Bohan Jia · Shaohui Lin · Zhenfei Yin · Lei Bai · Wanli Ouyang · Yuanqi Li · Jie Guo · Yanwen Guo✝
This is the official implementation of the paper "Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models"
We adopted a two-stage training process, with training data for the first stage stored on SFT-data and data for the second stage hosted on RL-data.
You can also construct your own datasets. The raw data for our SFT stage is sourced from MVImgNet, and the data for our RL stage is derived from the SAT dataset.
SFT Data: SFT-data
Base Model in huggingface: Qwen-2.5-VL Model
Also in modelscope: modoelscope
We utilized LLaMA Factory, one of the most prevalent instruction-tuning frameworks, for model fine-tuning. First, you need to refer to the project repository at LLaMA-Factory to complete the installation, and then follow the official tutorials to execute the training process.
During the Supervised Fine-Tuning (SFT) phase, we configured the training with 2 epochs, a learning rate of 5e-6, a batch size of 128, and 50 warm-up steps.
We employed the Verl framework for two-stage reinforcement learning training. You can either use the latest version of Verl for training or install the Verl package included in this repository to proceed.
Important
We adopted a custom reward function, which can be found in detail under the path verl\utils\reward_score\reward_custom. Additionally, It may need attention multi-image input and the image processing mechanism of Verl.
# Modify first: supplement or adjust the necessary parameters.
sh run/run_qwen2_5vl-7b_mix_500step.sh
We evaluated the model using VLMEvalKit. First, you need to install it.
cd VLMEvalKit
pip install -e .Next, you need to launch the model service using VLLM, referring to the script run_actial.sh.
Finally, modify the actial_test.json file and then launch the evaluation.
python run.py --config actial_test.jsoncheck VLMEvalKit docs for more information
This work was supported by the National Natural Science Foundation of China (62032011) and the Natural Science Foundation of Jiangsu Province (BK20211147).
There are also many powerful resources that greatly benefit our work:
If you find this work helpful, please consider citing our paper.
@inproceedings{
zhan2025actial,
title={Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models},
author={Xiaoyu Zhan and Wenxuan Huang and Hao Sun and Xinyu Fu and Changfeng Ma and Shaosheng Cao and Bohan Jia and Shaohui Lin and Zhenfei Yin and LEI BAI and Wanli Ouyang and Yuanqi Li and Jie Guo and Yanwen Guo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=jquTBzt3Av}
}Zhan, Xiaoyu (zhanxy@smail.nju.edu.cn) and Sun, Hao (warm_snows@163.com) and Fu, Xinyu (xinyu.fu@smail.nju.edu.cn).
