SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training

SAFusion is a new efficient tensor fusion mechanism for high-performance distributed DNN training. SAFusion first proposes a sparsification-ahead tensor fusion scheme, which performs sparsification on each of the gradient tensors before merging them during tensor fusion, instead of sparsification-behind tensor fusion, so as to avoid gradient tensor missing and thus improve the convergence performance. Then, SAFusion proposes an inter-worker gradient alignment fusion scheme that merges the same amount of sparsified gradients across workers to avoid long gradient synchronization waiting time, and an intra-worker adaptive buffer sizing scheme that maximizes the overlap of backpropagation and communication time to reduce multiple waiting periods. This repository contains SAFusion's source code, as well as a set of benchmarking scripts for some popular open-source distributed DNN training systems with state-of-the-art tensor fusion schemes.

Introduction

This code repository covers:

SAFusion Framework

SAF(Naive): Sparsification-ahead tensor fusion
SAFusion-Inter: Aligned inter-worker gradient tensor fusion
SAFusion-(Inter+Intra): Adaptive intra-worker buffer sizing

State-of-the-art tensor fusion schemes

State-of-the-art sparsification algorithms

Implementation

SAFusion System Architecture

We use the PyTorch framework and implemented the prototype system of SAFusion based on the Horovod distributed training framework using NCCL as the communication library.

In our system of SAFusion, each worker contains a Generator module for generating an efficient sparsification-ahead fusion buffer, which provides inter-worker aligned fusion and intra-worker adaptive fusion operations for efficient tensor fusion; Controller module for controlling a series of operations such as sparsified gradient pushing, pulling, and communication in the fusion buffer; a Sparsification Compression module for performing layer-wise gradient sparsification during the backward propagation.

SAFusion Generator Workflow

The workflow of the SAFusion Generator module：

Installation

Prerequisites

Get the code

git clone https://github.com/zqming-cs/SAFusion.git
cd SAFusion
pip install -r requirements.txt
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod==0.28.0

If pip installation fails, please try to upgrade pip via pip install --upgrade pip. If Horovod installation with NCCL failed, please check the installation guide.

Quick start

The primary benchmark is provided in example. For example, we can use the following command to run the benchmark on 8 GPUs, with compression algorithm as dgc, communication primitive as allgather, memory as residual.

To run BERT-large training job:

cd safusion/example/nlp/bert/scripts
bash run_squad_bert.sh

To run GPT2-large training job:

cd safusion/example/nlp/gpt
bash run_clm_no_trainer_hvd_103.sh

To run ViT-large training job:

cd safusion/example/cv/vit
bash run_imagenet_no_trainer.sh

To run ResNet-152 training job:

cd safusion/example/cv/resnet
bash run_imagenet_resnet152.sh

Papers

SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training

If you are using this repository for your paper, please cite our work

@inproceedings{ming2025safusion,
  title={SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training},
  author={Zhangqiang Ming and Yuchong Hu and Xinjue Zheng and Wenxiang Zhou and Dan Feng},
  booktitle={Proceedings of the 34th ACM International Symposium on High-Performance Parallel and Distributed Computing},
  url={https://doi.org/10.1145/3731545.3731581}
  year={2025}
}

Referred Datasets

Wikitex-2/103: https://huggingface.co/datasets/wikitext
SQuAD: https://rajpurkar.github.io/SQuAD-explorer/
CIFAR-100: https://www.cs.utoronto.ca/~kriz/cifar.html
ImageNet: https://www.image-net.org/

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cupcake		cupcake
models		models
oktopk		oktopk
omgs		omgs
safusion		safusion
syncea		syncea
wfbp		wfbp
Generator_.png		Generator_.png
LICENSE.txt		LICENSE.txt
Overview_.png		Overview_.png
Readme.md		Readme.md
compression.py		compression.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training

Introduction

SAFusion Framework

State-of-the-art tensor fusion schemes

State-of-the-art sparsification algorithms

Implementation

SAFusion System Architecture

SAFusion Generator Workflow

Installation

Prerequisites

Get the code

Quick start

Papers

Referred Datasets

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training

Introduction

SAFusion Framework

State-of-the-art tensor fusion schemes

State-of-the-art sparsification algorithms

Implementation

SAFusion System Architecture

SAFusion Generator Workflow

Installation

Prerequisites

Get the code

Quick start

Papers

Referred Datasets

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages