Skip to content

naratech-eng/hpc-genomics-nf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genomic Workflow Deployment on AWS ParallelCluster Using Nextflow and SLURM

Overview

This documentation provides a comprehensive design guide for building a scalable, secure, and reproducible High-Performance Computing (HPC) environment on Amazon Web Services (AWS), specifically tailored for Genomic Nextflow workflows.

The core of the infrastructure is built using AWS ParallelCluster, managed as Infrastructure as Code (IaC) via Terraform.

Published documentation: https://naratech-platforms.gitbook.io/genomics-nf-hpc-on-aws-parallelcluster/

  • Production-grade genomic variant discovery pipeline
  • AWS ParallelCluster with SLURM scheduler
  • CPU and GPU partitions for optimized workload execution
  • Nextflow workflow orchestration
  • Spack + Lmod for software management
  • FSx for Lustre and EFS for high-performance storage
  • Wazuh security monitoring on ECS Fargate
  • Prometheus/Grafana observability stack

Architecture Diagram

flowchart TD
    subgraph "Public Internet"
        User((User))
    end

    subgraph "AWS Cloud - VPC"
        subgraph "Public Subnet"
            Bastion[Bastion Host <br/>or SSM/ VPN Gateway]
        end

        subgraph "Private Subnet"
            subgraph "Management Layer"
                HeadNode[Head Node<br/>SLURM Controller<br/>Nextflow Engine<br/>Spack/Lmod]
                WazuhMgr[Wazuh Manager<br/>Security Monitoring]
                PromGraf[Prometheus & Grafana<br/>Observability Stack]
            end

            subgraph "Compute Layer (Autoscaling)"
                subgraph "CPU Partition"
                    CPUNodes[c6a/c7i Instances<br/>Alignment, QC, Pre-processing]
                end
                subgraph "GPU Partition"
                    GPUNodes[g5/g6 Instances<br/>DeepVariant, GPU Acceleration]
                end
            end

            subgraph "Storage Layer"
                EFS[(Amazon EFS<br/>Home Dirs, Scripts)]
                FSx[(FSx for Lustre<br/>Scratch, High-perf I/O)]
                S3[(Amazon S3<br/>Raw Data, Long-term Storage)]
            end
        end
    end

    User -- r1 --> Bastion
    
    Bastion --> HeadNode
    HeadNode --> CPUNodes
    HeadNode --> GPUNodes
    CPUNodes --- FSx
    GPUNodes --- FSx
    HeadNode --- EFS
    CPUNodes --- EFS
    GPUNodes --- EFS
    FSx <--> S3
    
    WazuhMgr --> HeadNode
    WazuhMgr -.-> CPUNodes
    WazuhMgr -.-> GPUNodes
    PromGraf -.-> HeadNode
    PromGraf -.-> CPUNodes
    PromGraf -.-> GPUNodes

    linkStyle default stroke-width:2px,fill:none,stroke:#F472B6,stroke-dasharray: 5 5,animation:flow
    
    classDef default fill:#1F2937,stroke:#22D3EE,color:#E5E7EB,stroke-width:2px;
    classDef storage fill:#1F2937,stroke:#22D3EE,color:#E5E7EB,stroke-width:2px,stroke-dasharray: 5 5;
    class EFS,FSx,S3 storage;
Loading

Quick Navigation

Section Description
Project Overview Objectives, design principles, and target audience
System Architecture Component breakdown and deployment model
Technology Stack Compute, storage, and software decisions
Terraform Provisioning Infrastructure as Code setup
Workflow Design Nextflow pipeline execution flow
Security & Observability Wazuh, Prometheus, and Grafana integration
Cost Optimization Strategies for minimizing TCO
Conclusion Summary and future enhancements
References External resources and citations
Developer Guidance SSM access, SLURM, and GPU management
Troubleshooting GPU and CUDA issue resolution
Validation Checklist Post-deployment verification steps
Ansible Post-Provisioning Post-provision configuration and Lmod setup

Getting Started

  1. Review the Project Overview to understand the goals
  2. Study the System Architecture for component understanding
  3. Follow the Terraform Provisioning guide to deploy infrastructure
  4. Use the Validation Checklist to verify deployment

Repository Structure

hpc-genomics-nf/
├── README.md            # Project overview (this file)
├── docs/                # GitBook documentation
├── ansible/             # Post-provision configuration
├── terraform/           # Infrastructure as Code
├── nextflow/            # Pipeline definitions
└── modules/             # Reusable components

About

Production-style HPC genomics pipeline built with Nextflow and SLURM on AWS ParallelCluster, featuring mixed CPU/GPU workloads with integrated security (Wazuh) and observability using Prometheus and Grafana.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors