SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training
SAFusion is a new efficient tensor fusion mechanism for high-performance distributed DNN training. SAFusion first proposes a sparsification-ahead tensor fusion scheme, which performs sparsification on each of the gradient tensors before merging them during tensor fusion, instead of sparsification-behind tensor fusion, so as to avoid gradient tensor missing and thus improve the convergence performance. Then, SAFusion proposes an inter-worker gradient alignment fusion scheme that merges the same amount of sparsified gradients across workers to avoid long gradient synchronization waiting time, and an intra-worker adaptive buffer sizing scheme that maximizes the overlap of backpropagation and communication time to reduce multiple waiting periods. This repository contains SAFusion's source code, as well as a set of benchmarking scripts for some popular open-source distributed DNN training systems with state-of-the-art tensor fusion schemes.
This code repository covers:
- SAF(Naive): Sparsification-ahead tensor fusion
- SAFusion-Inter: Aligned inter-worker gradient tensor fusion
- SAFusion-(Inter+Intra): Adaptive intra-worker buffer sizing
We use the PyTorch framework and implemented the prototype system of SAFusion based on the Horovod distributed training framework using NCCL as the communication library.
In our system of SAFusion, each worker contains a Generator module for generating an efficient sparsification-ahead fusion buffer, which provides inter-worker aligned fusion and intra-worker adaptive fusion operations for efficient tensor fusion; Controller module for controlling a series of operations such as sparsified gradient pushing, pulling, and communication in the fusion buffer; a Sparsification Compression module for performing layer-wise gradient sparsification during the backward propagation.
The workflow of the SAFusion Generator module:
- CUDA-12.0
- Python >= 3.9
- NCCL-2.8.3
- PyTorch-1.3.+
- OpenMPI-4.0.+
- Horovod-0.28.1+
- Numpy
- TensorboardX
- Tqdm
git clone https://github.com/zqming-cs/SAFusion.git
cd SAFusion
pip install -r requirements.txt
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod==0.28.0
If pip installation fails, please try to upgrade pip via pip install --upgrade pip. If Horovod installation with NCCL failed, please check the installation guide.
The primary benchmark is provided in example.
For example, we can use the following command to run the benchmark on 8 GPUs, with compression algorithm as dgc, communication primitive as allgather, memory as residual.
To run BERT-large training job:
cd safusion/example/nlp/bert/scripts
bash run_squad_bert.sh
To run GPT2-large training job:
cd safusion/example/nlp/gpt
bash run_clm_no_trainer_hvd_103.sh
To run ViT-large training job:
cd safusion/example/cv/vit
bash run_imagenet_no_trainer.sh
To run ResNet-152 training job:
cd safusion/example/cv/resnet
bash run_imagenet_resnet152.sh
SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training
If you are using this repository for your paper, please cite our work
@inproceedings{ming2025safusion,
title={SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training},
author={Zhangqiang Ming and Yuchong Hu and Xinjue Zheng and Wenxiang Zhou and Dan Feng},
booktitle={Proceedings of the 34th ACM International Symposium on High-Performance Parallel and Distributed Computing},
url={https://doi.org/10.1145/3731545.3731581}
year={2025}
}
- Wikitex-2/103: https://huggingface.co/datasets/wikitext
- SQuAD: https://rajpurkar.github.io/SQuAD-explorer/
- CIFAR-100: https://www.cs.utoronto.ca/~kriz/cifar.html
- ImageNet: https://www.image-net.org/
See LICENSE.
