Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin
LUMIA Lab, Shanghai Jiao Tong University
We present Fourier Compressor, a simple and parameter-free visual token compression module for VLMs. Our method removes redundancy in the frequency domain via two-dimensional Discrete Cosine Transform (DCT), preserving semantic fidelity with negligible overhead. Fourier Compressor generalizes effectively across multiple VLM architectures, including LLaVA, Qwen2-VL, and Qwen2.5-VL, for both image and video tasks.
Fourier-Compressor/
├── fourier_compressor/
│ ├── compress.py # model-agnostic compression
│ ├── dct.py # torch DCT / IDCT implementation
│ └── integrations/
│ ├── llava/ # LLaVA patch and source-edit notes
│ └── qwen_vl/ # Qwen2/2.5-VL patch and helpers
├── examples/
│ ├── infer_llava.py
│ ├── infer_qwen2_vl.py
│ └── infer_qwen2_5_vl.py
└── README.md
LLaVA and Qwen2/2.5-VL use different transformers stacks, so we recommend separate environments.
# LLaVA environment
conda create -n fourier-llava python=3.10 -y
conda activate fourier-llava
# Install LLaVA following the official LLaVA repository.# Qwen2-VL / Qwen2.5-VL environment
conda create -n fourier-qwen python=3.12 -y
conda activate fourier-qwen
pip install "transformers==4.51.3" qwen-vl-utilsInstall Fourier Compressor after the model-specific dependencies:
git clone https://github.com/whyisverysmart/Fourier-Compressor
cd Fourier-Compressor
pip install -e .Patch LLaVA before loading the model:
from fourier_compressor.integrations.llava import apply_to_llava
apply_to_llava(reserve=12) # 24x24 -> 12x12 visual tokens
# Load and run LLaVA as usual.Example:
python examples/infer_llava.py --image your.jpg --reserve 12Qwen2-VL requires two synchronized changes: compressing the visual output and updating the number of visual placeholder tokens emitted by the processor.
import torch
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from fourier_compressor.integrations.qwen_vl import apply_to_qwen2_vl
model_id = "Qwen/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(
model_id,
min_pixels=256*28*28,
max_pixels=2304*28*28,
)
apply_to_qwen2_vl(model, processor, ratio=2/3)
# Build inputs and generate as in the official Qwen-VL examples.Example:
python examples/infer_qwen2_vl.py --image your.jpg --ratio 0.6666667Qwen2.5-VL uses the same integration logic under transformers==4.51.3.
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from fourier_compressor.integrations.qwen_vl import apply_to_qwen2_5_vl
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(
model_id,
min_pixels=256*28*28,
max_pixels=2304*28*28,
)
apply_to_qwen2_5_vl(model, processor, ratio=2/3)Example:
python examples/infer_qwen2_5_vl.py --image your.jpg --ratio 0.6666667For benchmark evaluation, follow the original evaluation scripts of the corresponding model repository, or evaluate through lmms-eval.
| Model | Base Model | Visual Tokens | Compression | Weights |
|---|---|---|---|---|
| Fourier-LLaVA-v1.5-7B-256 | LLaVA-v1.5-7B | 256 | 55.6% | 🤗 HF |
| Fourier-LLaVA-v1.5-7B-144 | LLaVA-v1.5-7B | 144 | 75.0% | 🤗 HF |
| Fourier-LLaVA-v1.5-7B-64 | LLaVA-v1.5-7B | 64 | 88.9% | 🤗 HF |
| Fourier-LLaVA-v1.5-7B-36 | LLaVA-v1.5-7B | 36 | 93.8% | 🤗 HF |
| Fourier-LLaVA-v1.5-13B-144 | LLaVA-v1.5-13B | 144 | 75.0% | 🤗 HF |
| Fourier-Qwen2-VL-2B-0.67 | Qwen2-VL-2B-Instruct | Dynamic | 55.6% | 🤗 HF |
| Fourier-Qwen2.5-VL-3B-0.67 | Qwen2.5-VL-3B-Instruct | Dynamic | 55.6% | 🤗 HF |
@article{wang2025fourier,
title={Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models},
author={Wang, Huanyu and Kai, Jushi and Bai, Haoli and Hou, Lu and Jiang, Bo and He, Ziwei and Lin, Zhouhan},
journal={arXiv preprint arXiv:2508.06038},
year={2025}
}