SSFusion proposes a new selective sparsification tensor fusion mechanism for data-parallel distributed DNN training that addresses the limitations of high compression overhead or low convergence accuracy caused by existing tensor fusion schemes. SSFusion performs selective sparsification before tensor fusion, instead of per-tensor or multi-tensor sparsification, to reduce compression overhead while maintaining convergence accuracy. SSFusion also proposes an efficient sparsification offloading scheme to further speed up compression, and an interleaved communication scheme to improve sparse communication efficiency. This repository contains SSFusion's source code, as well as a set of benchmarking scripts for some popular open-source distributed DNN training systems with state-of-the-art tensor fusion and sparsification schemes.
This code repository covers:
- SSFusion(Naive): Tensor fusion with selective sparsification
- SSFusion-O: Efficient sparsification offloading in SSFusion
- SSFusion-I: Interleaved communication in SSFusion
We use the PyTorch framework and implement the prototype system of SSFusion based on the Horovod distributed training framework using NCCL as the communication library.
In our system of SSFusion, each worker contains a Generator module for generating fusion buffer with selective sparsification, which provides Sparsification offloading and Interleaved communication operations for efficient data-parallel distributed DNN training; Controller module for controlling a series of operations such as sparsified gradient selecting, offloading, pushing, pulling, and interleaving communication in the fusion buffer; a Selective Sparsification module for performing selective sparsification during backpropagation; a Tensor Fusion module for performing tensor fusion after backpropagation.
The workflow of the SSFusion System:
- CUDA-12.0
- Python >= 3.12
- NCCL-2.8.3
- PyTorch-2.3.+
- OpenMPI-4.0.+
- Horovod-0.28.1+
- Numpy
- TensorboardX
- Tqdm
git clone https://github.com/ICDE26-SSFusion/SSFusion.git
cd SSFusion
pip install -r requirements.txt
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod==0.28.0
if pip installation fails, please try to upgrade pip via pip install --upgrade pip. If Horovod installation with NCCL failed, please check the installation guide.
The primary benchmark is provided in example.
For example, we can use the following command to run the benchmark on 8 GPUs, with compression algorithm as dgc, communication primitives as All-Reduce and All-Gather, memory as residual.
To run BERT-large training job:
cd SSFusion/example/nlp/bert/scripts
bash run_squad_bert.sh
To run GPT2-large training job:
cd SSFusion/example/nlp/gpt
bash run_clm_no_trainer_hvd_103.sh
To run ViT-large training job:
cd SSFusion/example/cv/vit
bash run_imagenet_no_trainer.sh
To run ResNet-152 training job:
cd SSFusion/example/cv/resnet
bash run_imagenet_resnet152.sh
SSFusion: Tensor Fusion with Selective Sparsification for Efficient Distributed DNN Training
If you are using this repository for your paper, please cite our work
@inproceedings{ming2026ssfusion,
title={SSFusion: Tensor Fusion with Selective Sparsification for Efficient Distributed DNN Training},
author={Zhangqiang Ming and Rui Wang and Yuchong Hu and Yuanhao Shu and Wenxiang Zhou and Xinjue Zheng and Dan Feng},
booktitle={Proceedings of the 42nd IEEE International Conference on Data Engineering (ICDE2026)},
url={https://doi.org/10.1145/xxxx.xxxx}
year={2026}
}
- Wikitex-2/103: https://huggingface.co/datasets/wikitext
- SQuAD: https://rajpurkar.github.io/SQuAD-explorer/
- ImageNet: https://www.image-net.org/
- CIFAR-100: https://www.cs.utoronto.ca/~kriz/cifar.html
See LICENSE.
