This package includes PAROAttention codes with quantization and sparsity implementation. We build PAROAttention based on the SageAttention code, and further integrate the tailored designs for PAROAttention block-wise sparsity.
python >= 3.9,torch >= 2.3.0,CUDA >= 11.8flash-attnfor benchmarking- compile from source
python setup.py developorpip install -e . - Note that this version is only verified on Ampere (sm80) GPUs, such as A100, A800 etc.
-
Attention acceleration under varying sparsity (0.2, 0.3, 0.5 density)
cd bench python bench.py -
Rope kernel with and without permutation
cd bench python overhead.py -
Benchmark the baseline implementation such as FlashAttention V2
cd bench python bench_baseline.py --method fa2 -
The speed comparison between FlashAttention V2, SageAttention, SpargeAttenion, SparseVideoGen and PAROAttention(ours) and analysis of overhead respectively (we need calib data for profiling as sparge is data dependency)
cd bench python bench_all.py --q_path your/path/to/q --k_path your/path/to/k --v_path your/path/to/v --permute_plan_path your/path/to/permute_plan
We provide the example of PAROAttention cogvideox pipeline in ./example/pipeline_cogvideox.py, adopting the PARO_CogVideoXAttnProcessor in paroattention/cogvideox.py.
You need to specify the permute_plan.pth and sparse_plan.pth path.
For comparison with FA2 to measure speedup, you could uncomment the F.scaled_dot_product_attention in the code to adopt the FA2 implementation, and compare the Attention Time of two approaches.
This project is licensed under the Apache License 2.0.
See the LICENSE file for more information.