This repository is the official implementation for "AeroReformer: Aerial Referring Transformer for UAV-Based Referring Image Segmentation" (paper).
🚀 AeroReformer is a novel vision-language framework for UAV-based referring image segmentation (UAV-RIS), designed to tackle the unique challenges of aerial imagery, such as complex spatial scales, occlusions, and diverse object orientations.
Our approach integrates a Vision-Language Cross-Attention Module (VLCAM) for enhanced multimodal understanding and a Rotation-Aware Multi-Scale Fusion (RAMSF) decoder to improve segmentation accuracy in aerial scenes.
The code has been verified only on Ubuntu. Please adapt and test it on your own platform as needed.
The code has been verified to work with PyTorch v2.3.1 and Python 3.10.
- Clone this repository.
- Change the directory to the root of this repository.
-
Create a new Conda environment with Python 3.10 and then activate it:
conda create -n AeroReformer python=3.10 conda activate AeroReformer
-
Install PyTorch with CUDA 12.4 support (ensure your NVIDIA driver is compatible):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
-
Install the packages listed in
requirements.txtusingpip:pip install -r requirements.txt
- Create the
./pretrained_weightsdirectory to store the weights.mkdir ./pretrained_weights
- Download the pre-trained classification weights of the Swin Transformer from this link.
- Place the downloaded
.pthfile into the./pretrained_weightsdirectory. These weights are necessary for initializing the model during training.
Warning: Experiments are conducted on the UAVid-RIS and VDD-RIS datasets. The text expressions in these datasets were generated by the Qwen and LLaMA models, and may contain errors or inconsistencies. We welcome any collaboration to help improve the quality of the data.
To ensure full reproducibility, we provide the necessary preprocessing code in this GitHub repository. This allows you to generate the exact image data used in our experiments from the original datasets. The text labels, which were generated by our team, are directly available for download from Hugging Face.
Follow these steps to prepare the datasets for training and testing:
Download the raw datasets and save them under ./data/UAVid_RIS and ./data/VDD_RIS (or similar paths you prefer).
- UAVid: UAVid Official Website
- VDD: Hugging Face
Download the generated text expressions and save them in the corresponding dataset folders. Extract them if necessary.
- UAVid-RIS Texts: Hugging Face link
- VDD-RIS Texts: Hugging Face link
After downloading and extracting, the dataset folders should look like this:
UAVid-RIS:
$DATA_PATH_UAVID # ./data/UAVid_RIS
├── uavid_ris
│ ├── refs(uow).p
│ ├── refs_llama(uow).p
│ ├── instances.json
└── images
└── uavid_ris
├── PNGImages
├── ann_split
├── ann_split_llama
├── annotations
VDD-RIS:
$DATA_PATH_VDD # ./data/VDD_RIS
├── vdd_ris
│ ├── refs(uow).p
│ ├── refs_llama(uow).p
│ ├── instances.json
└── images
└── vdd_ris
├── PNGImages
├── ann_split
├── ann_split_llama
├── annotations
Use the preprocessing scripts provided in this repository to split the images for training, validation, and testing:
# UAVid-RIS
python tool/image_split_uavid.py train
python tool/image_split_uavid.py val
python tool/image_split_uavid.py test
# VDD-RIS
python tool/image_split_vdd.py train
python tool/image_split_vdd.py val
python tool/image_split_vdd.py testYou can also get access to the UAV images we captured at the University of Warwick in the demo subfolder of this repository.
We use DistributedDataParallel from PyTorch for training. To run on a single GPU (ID 0), use the following commands.
Train on UAVid-RIS:
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=12345 train.py --dataset uavid_ris --model_id AeroReformer --epochs 40 --img_size 480 --refer_data_root ./data/UAVid_RIS/ --mha 4-4-4-4 --output-dir ./checkpoints/UAVid_RISTrain on VDD-RIS:
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=12345 train.py --dataset vdd_ris --model_id AeroReformer --epochs 10 --img_size 480 --refer_data_root ./data/VDD_RIS/ --mha 4-4-4-4 --output-dir ./checkpoints/VDD_RISTest on UAVid-RIS:
python test.py --swin_type base --dataset uavid_ris --resume ./checkpoints/UAVid_RIS/model_best_AeroReformer.pth --model_id AeroReformer --split test --workers 4 --window12 --img_size 480 --refer_data_root ./data/UAVid_RIS/ --mha 4-4-4-4Test on VDD-RIS:
python test.py --swin_type base --dataset vdd_ris --resume ./checkpoints/VDD_RIS/model_best_AeroReformer.pth --model_id AeroReformer --split test --workers 4 --window12 --img_size 480 --refer_data_root ./data/VDD_RIS/ --mha 4-4-4-4The code in this repository is built upon the work of LAVT and RMSIN. We would like to thank the authors for making their project open source.