Skip to content

zqming-cs/ADTopk

Repository files navigation

Enabling Efficient All-Dimension Top-k Sparsification for High-Performance Distributed DNN Training Systems

ADTopk is an all-dimension top-𝑘 sparsification scheme that selects the 𝑘 largest elements across all dimensions of the per-layer gradient tensor, ensuring that every dimension contributes at least one element and eliminating dimension missing. ADTopk also enables independent local sorting within each dimension, allowing parallel execution across dimensions to enhance GPU utilization. Furthermore, we enhance ADTopk with system-level optimizations: (i) interleaved sparsification to accelerate convergence, (ii) partial sparsification to reduce sparsification overhead, and (iii) hybrid collective communication to improve sparse communication efficiency. We implement a high-performance distributed DNN training framework with a modular sparsification compression library supporting ADTopk and state-of-the-art gradient sparsification baselines. We also develop collective communication libraries to support different sparsification communication primitives.

Introduction

This code repository covers:

ADTopk Framework

  • ADTopk avoids dimension missing via a matrix-based sparsification to enhance convergence accuracy, and increases GPU core parallelism via a multiple local sorting to improve sparsification efficiency.
  • ADTopk employs an interleaved sparsification scheme to combine ADTopk and the traditional Top-𝑘 to speed up the convergence.
  • ADTopk employs a partial sparsification scheme via minimizing communication idle periods to to reduce the sparsification overhead.
  • ADTopk employs a hybrid collective communication that combines All-Reduce and All-Gather to improve sparse communication efficiency.

State-of-the-art gradient sparsification methods.

Implementation

ADTopk System Architecture

We implement ADTopk, which mainly consists of five main modules, an all-dimension Top-𝑘 sparsification module, an interleaved sparsification module, a partial sparsification module, a hybrid collective communication module, and a decompression and averaging module. We also implement a proofs module to prove the stable convergence of ADTopk distributed SGD on multiple training tasks.

We use the PyTorch framework and implemented the prototype system of ADTopk based on the Horovod framework using NCCL as the communication library. Overview of our system is as follows.

ADTopk System Overview

The workflow of the ADTopk System:

Installation

Prerequisites

Install ADTopk

git clone https://github.com/zqming-cs/ADTopk.git
cd ADTopk
pip install -r requirements.txt
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod==0.28.1
pip install -e .

Quick start

The primary benchmark is provided in examples. For example, we can use the following command to run the benchmark on 8 GPUs, with sparsification algorithm as adtopk, gaussiank, sidco, and dgc, communication primitives as All-Reduce and All-Gather, memory as residual.

To run BERT-large training job:

cd ADTopk/examples/nlp_examples/bert/scripts
bash run_squad_bert.sh

To run GPT2-large training job:

cd ADTopk/examples/nlp_examples/gpt
bash run_clm_no_trainer_hvd_103.sh

To run ViT-large training job:

cd ADTopk/examples/cv_examples/vit
bash run_imagenet_no_trainer.sh

To run ResNet-152 training job:

cd ADTopk/examples/cv_examples/resnet
bash run_imagenet_resnet152.sh

Papers

  • Enabling Efficient All-Dimension Top-k Sparsification for High-Performance Distributed DNN Training Systems

An earlier version of this paper appeared at the Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC2024), June 2024. In this extended version, we enhanced ADTopk with new system-level optimizations, including partial sparsification and hybrid collective communication. We also include new evaluation results and show that our enhanced ADTopk increases training throughput by up to 268.0%.

Citation

If you are using this repository for your paper, please cite our previous work

@inproceedings{ming2024adtopk,
  title={ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN Training},
  author={Zhangqiang Ming and Yuchong Hu and Wenxiang Zhou and Xinjue Zheng and Dan Feng},
  booktitle={Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing},
  url={https://doi.org/10.1145/3625549.3658678}
  year={2024}
}

Referred Datasets

License

See LICENSE.

About

Enabling Efficient All-Dimension Top-k Sparsification for High-Performance Distributed DNN Training Systems

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors