Skip to content

converged-computing/performance-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

302 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Performance Study

DOI

This study tested HPC application performance across three clouds and on-premises HPC. The repository is organized as follows:

  • docker: includes container builds for different environments. Containers are shared between environments when possible to reduce redundancy.

    • google: includes Google builds for each of CPU an GPU
    • aws: includes AWS builds for each of CPU and GPU. The distinguishing feature is building with libfabric for EFA.
    • azure: includes Microsoft Azure builds for each of CPU and GPU, targeting infiniband.
  • experiments: are organized first by cloud, and then the underlying environment. In each, a README with the full experiment protocol (and usually commands to run) are included.

    • Google Cloud includes HPC Toolkit (Compute Engine), and GKE (Kubernetes) for each of CPU and GPU
    • Amazon Web Services includes Parallel Cluster (EC2), and EKS (KUbernetes) for each of CPU and GPU
    • Microsoft Azure includes CycleCloud (VMs), and AKS (Kubernetes) for each of CPU and GPU.
  • analysis: includes preliminary plots for data exploration. Note that not all are finalized.

  • paper: includes a subset of cleaned up and further worked on plots intended for use in publications, etc.

Experiments

"Bare Metal"

  • Microsoft Azure CycleCloud CPU (date)
    • size 32 (abhik done 6 apps 8/28/2024, done milroy 8/30/2024)
    • size 64 (abhik done 6 apps 8/28/2024, done milroy 8/30/2024)
    • size 128 (done milroy 8/30/2024)
    • size 256 (done milroy 8/31/2024)
  • Microsoft Azure CycleCloud GPU (date)
    • size 4 (milroy and ani 8/31/2024)
    • size 8 (milroy and ani 8/31/2024)
    • size 16 (milroy and ani 8/31/2024)
    • size 32 (milroy and ani 8/31/2024)
  • AWS GPU Parallel Cluster
    • size 32 (not going to do, could not build image)
    • size 64 (not going to do, could not build image)
    • size 128 (not going to do, could not build image)
    • size 256 (not going to do, could not build image)
  • AWS CPU Parallel Cluster
    • size 32 (done milroy 8/29/2024-8/30/2024)
    • size 64 (done ani 8/29/2024-8/30/2024)
    • size 128 (done ani 8/29/2024-8/30/2024)
    • size 256 (done ani 8/29/2024-8/30/2024)
  • Google Cloud Compute Engine CPU (redone several times due to app configurations)
    • size 32 (vsoch done 8/26/2024)
    • size 64 (vsoch done 8/26/2024)
    • size 128 (vsoch done 8/27/2024)
    • size 256 (vsoch done 8/27/2024)
  • Google Compute Engine GPU
    • done on llnl-flux
    • New VM and automation needed with Terraform (vsoch, early 9/2024)
    • size 4 (vsoch 9/6/2024)
    • size 8 (vsoch 9/7/2024)
    • size 16 (vsoch 9/8/2024)
    • size 32 (vsoch 9/8/2024)
    • quicksilver and osu all reduce need runs at all sizes (vsoch 9/9/2024)

Kubernetes

  • Microsoft Azure AKS CPU
    • size 32 (vsoch done 8/24/2024), redone with placement (vsoch 8/28/2024)
    • size 64 (vsoch done 8/24/2024), redone with placement (vsoch 8/28/2024)
    • size 128 (vsoch done 8/28/2024)
    • size 256 (vsoch TBA 8/29/2024)
  • Google Cloud GKE CPU
    • size 32 (vsoch done 8/21/2024)
    • size 64 (vsoch done 8/22/2024)
    • size 128 (vsoch done 8/23/2024)
    • size 256 (vsoch done 8/23/2024)
  • AWS CPU EKS
    • size 32 (vsoch done 8/21/2024-8/22/2024)
    • size 64 (vsoch done 8/22/2024)
    • size 128 (vsoch done 8/22/2024)
    • size 256 (vsoch done on 8/31/2024)
  • AWS GPU EKS
    • size 4 (done vsoch 8/26/2024, milroy lammps/osu 8/27/2024)
    • size 8 (done vsoch 8/26/2024, milroy lammps/osu 8/27/2024)
    • size 16 (done vsoch, milroy lammps/osu 8/27/2024)
    • size 32 not possible, could not get more than 16 nodes from AWS
  • Google Cloud GKE GPU
    • size 4 (done vsoch 8/29/2024)
    • size 8 (done vsoch TBA 8/29/2024)
    • size 16 (done vsoch 8/30/2024)
    • size 32 (done vsoch 8/30/2024)
    • milroy figured out installing latest drivers - key to success here!
  • Microsoft Azure AKS GPU
    • size 4 (done vsoch 8/31/2024)
    • size 8 (done vsoch 8/31/2024)
    • size 16 (done vsoch 8/31/2024)
    • size 32 (done vsoch 8/31/2024)

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

About

Performance study for HPC applications across Google, AWS, and Azure clouds ☁️

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •