MV-MAE: Motion Vector Masked Autoencoder for UAV Action Recognition

MV-MAE is a hierarchical video architecture designed specifically for top-down UAV (Unmanned Aerial Vehicle) perspectives. It leverages Motion Vectors (MV) and I-frames from compressed bitstreams to achieve efficient and accurate action recognition.

🚀 Key Features

Compressed Domain Processing: Directly uses Motion Vectors, avoiding expensive optical flow computation or heavy 3D convolutions.
2-Phase Supervised Fine-Tuning (SFT):
- Phase 1 (Warmup): Freezes the backbone encoders to stabilize the randomly initialized Temporal Transformer and Head.
- Phase 2 (End-to-End): Unfreezes the top layers of the backbone for task-specific adaptation.
Precision Evaluation: Logs Accuracy, Macro-F1, Precision, and Recall. Macro metrics provide a balanced view across the 155 UAV action classes.

🛣️ Project Pipeline

The MV-MAE workflow consists of three major stages:

Step 1: Codebook Generation

Before pretraining, a discrete codebook must be generated from the Motion Vector distributions to enable the masked prediction task.

python pretraining/codebook/build_codebook.py

Step 2: DMV-MAE Pretraining

Pretrain the Motion Encoder using the generated codebook. This stage teaches the model to reconstruct masked motion patches.

python pretraining/pretrain.py

Step 3: Supervised Fine-Tuning (SFT)

The final stage where the Context Encoder (backbone) and Motion Encoder (pretrained) are fused and fine-tuned for action recognition using the 2-phase logic.

python train.py

🛠️ Project Structure

MV_MAE/
├── mv_mae/
│   ├── models/
│   │   ├── mv_mae.py      # Main architecture (Context + Motion + Temporal)
│   │   ├── encoders.py    # I-frame and MV backbones
│   │   └── temporal.py    # Temporal Transformer aggregator
│   ├── data/
│   │   ├── dataset.py     # UAVHuman Dataset loader
│   │   └── video_loader.py # GOP-based video decoding
│   └── utils/
│       ├── checkpoint.py  # Phase-aware checkpointing
│       └── optimizer.py   # Multi-LR parameter group management
├── pretraining/           # Pretraining scripts and model definitions
├── model_zoo/             # Pretrained weights (CLIP, DMVMAE)
└── train.py               # 2-Phase SFT training pipeline

📈 Supervised Fine-Tuning (SFT) Details

1. Requirements

Install dependencies:

pip install torch timm scikit-learn tqdm av

2. Dataset Setup

Ensure your dataset is organized under datasets/UAVHuman_480p_mp4/. The train.py script expects splits at datasets/UAVHuman_480p_mp4/train_split.txt.

3. Running SFT

The train.py script automatically handles the 2-phase logic:

Phase 1 (Warmup): Encoders frozen (Epochs 0-5). Stabilizes the Temporal Transformer.
Phase 2 (End-to-End): Top-2 encoder layers unfrozen (Epochs 6-50) with 10x lower learning rate.

📊 Evaluation

The model logs Macro-F1 scores after every validation epoch to ensure performance is not skewed by majority classes in the UAVHuman dataset.

Developed for Advanced Action Recognition from UAV Perspectives.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
checkpoints_pretrain		checkpoints_pretrain
configs		configs
datasets		datasets
model_tests		model_tests
model_zoo		model_zoo
mv_mae		mv_mae
pretraining		pretraining
tests		tests
.gitignore		.gitignore
README.md		README.md
check_corrupted.py		check_corrupted.py
corrupted_videos_info.txt		corrupted_videos_info.txt
requirements.txt		requirements.txt
test_keys.py		test_keys.py
train.py		train.py
train_100.py		train_100.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MV-MAE: Motion Vector Masked Autoencoder for UAV Action Recognition

🚀 Key Features

🛣️ Project Pipeline

Step 1: Codebook Generation

Step 2: DMV-MAE Pretraining

Step 3: Supervised Fine-Tuning (SFT)

🛠️ Project Structure

📈 Supervised Fine-Tuning (SFT) Details

1. Requirements

2. Dataset Setup

3. Running SFT

📊 Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MV-MAE: Motion Vector Masked Autoencoder for UAV Action Recognition

🚀 Key Features

🛣️ Project Pipeline

Step 1: Codebook Generation

Step 2: DMV-MAE Pretraining

Step 3: Supervised Fine-Tuning (SFT)

🛠️ Project Structure

📈 Supervised Fine-Tuning (SFT) Details

1. Requirements

2. Dataset Setup

3. Running SFT

📊 Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages