aws-samples · nithiyn · Apr 28, 2025 · Apr 28, 2025 · Apr 28, 2025 · Apr 28, 2025
diff --git a/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md b/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md
@@ -0,0 +1,181 @@
+# Steps to create EKS cluster with EFS
+
+In this example we create an EKS cluster consisting of one `trn1.32xlarge` compute node. 
+### a. Configure AWS CLI
+
+```
+aws configure
+```
+
+### b. Create a config file for EKS cluster creation
+
+We have provided an example file here: [trn1-nxd-cluster-config..yaml](./trn1-nxd-cluster-config.yaml)
+
+```
+apiVersion: eksctl.io/v1alpha5
+kind: ClusterConfig
+
+metadata:
+  name: nxdi-inference-cluster
+  region: $REGION
+  version: "1.00"
+
+vpc:
+  id: $PLACEHOLDER_VPC_ID
+  subnets:
+    private:
+      $AVAILABILITY_ZONE_1:
+        id: $PLACEHOLDER_SUBNET_PRIVATE_1
+    public:
+      $AVAILABILITY_ZONE_1:
+        id: $PLACEHOLDER_SUBNET_PUBLIC_1
+      $AVAILABILITY_ZONE_2:
+        id: $PLACEHOLDER_SUBNET_PUBLIC_2
+
+  clusterEndpoints:
+    privateAccess: true
+    publicAccess: true
+
+cloudwatch:
+  clusterLogging:
+    enableTypes: ["*"]  
+
+iam:
+  withOIDC: true
+
+# Adding additional section to Cluster (eksctl) for any controllers you may want to install. Uncomment as required. 
+wellKnownPolicies:
+#   ebsCSIController: true      # Adds policies for using the ebs-csi-controller
+  efsCSIController: true      # Adds policies for using the efs-csi-controller  
+
+addons:
+  - name: vpc-cni
+    version: 1.18.1-eksbuild.1
+    configurationValues: '{"env":{"ENABLE_PREFIX_DELEGATION":"true", "ENABLE_POD_ENI":"true", "POD_SECURITY_GROUP_ENFORCING_MODE":"standard"},"enableNetworkPolicy": "true"}'
+    resolveConflicts: overwrite      
+  - name: amazon-cloudwatch-observability
+    version: v1.16.4-eksbuild.1
+  # - name: aws-ebs-csi-driver
+  #   version: v1.26.0-eksbuild.1
+  - name: aws-efs-csi-driver     
+    version: v2.1.9-eksbuild.1  
+
+managedNodeGroups:
+  - name: trn-compute-node-group
+    instanceType: trn1.32xlarge
+    instancePrefix: trtllm-compute-node 
+    subnets:
+      - $PLACEHOLDER_SUBNET_PRIVATE_1
+    privateNetworking: true
+    efaEnabled: true
+    minSize: 0
+    desiredCapacity: 2
+    maxSize: 2
+    volumeSize: 500
+    # comment out capacityReservation if you do not need ODCR
+    #capacityReservation:
+      #capacityReservationTarget:
+        #capacityReservationID: "$CR_ID"
+    ami: ami-07c8bc6b0bb890e9e
+    amiFamily: AmazonLinux2
+    ssh:
+      publicKeyName: $PUBLIC_KEYPAIR_NAME
+      sourceSecurityGroupIds: [$SECURITY_GROUP_IDS]
+    updateConfig:
+      maxUnavailablePercentage: 50
+    iam:
+      withAddonPolicies:
+        externalDNS: true
+        certManager: true
+        autoScaler: true
+        cloudWatch: true
+        ebs: true
+        efs: true
+        fsx: true
+        imageBuilder: true
+        xRay: true
+        awsLoadBalancerController: true
+        albIngress: true
+
+```
+
+
+### c. Create the EKS cluster
+
+```
+eksctl create cluster -f trn1-nxdi-cluster-config.yaml
+```
+
+## 3. (Optional) Capacity Blocks
+
+If you have Capacity Blocks for P5 or P4 instances, you can follow the [steps here](https://github.com/aws-samples/awsome-inference/tree/main/1.infrastructure#capacity-blocks) to create a self-managed nodegroup and add to your existing EKS cluster.
+
+
+## 4. Create an EFS file system
+
+To enable multiple pods deployed to multiple nodes to load shards of the same model so that they can used in coordination to serve inference request too large to loaded by a single GPU, we'll need a common, shared storage location. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume.
+
+Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod.
+
+### a. Create an IAM role
+
+Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver.
+
+### b. Install EFS CSI driver
+
+Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status.
+
+### c. Create EFS file system
+
+### Create EFS Filesystem
+```bash
+# Create EFS filesystem (Elastic, encrypted)
+EFS_JSON=$(aws efs create-file-system \
+  --region us-west-2 \
+  --creation-token "neuron-models-$(date +%s)" \
+  --performance-mode generalPurpose \
+  --throughput-mode elastic \
+  --encrypted \
+  --tags Key=Name,Value=neuron-inference)
+
+EFS_ID=$(echo "$EFS_JSON" | jq -r '.FileSystemId')
+
+# Wait until available (avoids race)
+until [ "$(aws efs describe-file-systems \
+            --region us-west-2 \
+            --file-system-id "$EFS_ID" \
+            --query 'FileSystems[0].LifeCycleState' \
+            --output text)" = "available" ]; do
+  echo "waiting for EFS $EFS_ID ..."
+  sleep 3
+done
+echo "EFS is available."
+```
+# Create mount targets (one per AZ your nodes use)
+```
+for subnet in subnet-xxx subnet-yyy subnet-zzz; do
+  aws efs create-mount-target \
+    --region us-west-2 \
+    --file-system-id "$EFS_ID" \
+    --subnet-id "$subnet" \
+    --security-groups sg-your-efs-security-group
+done
+```
+# Create StorageClass
+```bash
+cat <<EOF | kubectl apply -f -
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: efs-sc
+provisioner: efs.csi.aws.com
+parameters:
+  provisioningMode: efs-ap
+  fileSystemId: ${EFS_ID}
+  directoryPerms: "0755"
+reclaimPolicy: Retain
+volumeBindingMode: Immediate
+allowVolumeExpansion: true
+EOF
+
+```
diff --git a/1.infrastructure/1_setup_cluster/nxd-inference/trn1-nxd-cluster-config.yaml b/1.infrastructure/1_setup_cluster/nxd-inference/trn1-nxd-cluster-config.yaml
@@ -0,0 +1,85 @@
+apiVersion: eksctl.io/v1alpha5
+kind: ClusterConfig
+
+metadata:
+  name: $CLUSTER_NAME
+  region: $REGION
+  version: "1.30"
+
+vpc:
+  id: $VPC_ID
+  subnets:
+    private:
+      us-west-2c:
+        id: $PRIVATE_SUBNET_1
+      us-west-2d:
+        id: $PRIVATE_SUBNET_2
+    public:
+      us-west-2c:
+        id: $PUBLIC_SUBNET_1
+      us-west-2d:
+        id: $PUBLIC_SUBNET_2
+  clusterEndpoints:
+    privateAccess: true
+    publicAccess: true
+
+cloudwatch:
+  clusterLogging:
+    enableTypes: ["*"]
+
+iam:
+  withOIDC: true
+  serviceAccounts:
+    - metadata:
+        name: efs-csi-controller-sa
+        namespace: kube-system
+      wellKnownPolicies:
+        efsCSIController: true
+
+addons:
+  - name: vpc-cni
+    version: v1.18.1-eksbuild.1
+    resolveConflicts: overwrite
+    configurationValues: |
+      env:
+        ENABLE_PREFIX_DELEGATION: "true"
+        ENABLE_POD_ENI: "true"
+        POD_SECURITY_GROUP_ENFORCING_MODE: "standard"
+      enableNetworkPolicy: "true"
+
+  - name: amazon-cloudwatch-observability
+    version: v1.16.4-eksbuild.1
+
+  - name: aws-efs-csi-driver
+    version: v2.1.9-eksbuild.1
+
+managedNodeGroups:
+  - name: trn-compute-node-group
+    instanceType: trn1.32xlarge
+    amiFamily: AmazonLinux2023
+    subnets:
+      - $PRIVATE_SUBNET_1
+    privateNetworking: true
+    efaEnabled: true
+    minSize: 0
+    desiredCapacity: 1
+    maxSize: 2
+    volumeSize: 500
+    capacityReservation:
+      capacityReservationTarget:
+        capacityReservationId: $CAPACITY_RESERVATION_ID
+    updateConfig:
+      maxUnavailablePercentage: 50
+    iam:
+      withAddonPolicies:
+        externalDNS: true
+        certManager: true
+        autoScaler: true
+        cloudWatch: true
+        ebs: true
+        efs: true
+        fsx: true
+        imageBuilder: true
+        xRay: true
+        awsLoadBalancerController: true
+        albIngress: true
diff --git a/2.projects/README.md b/2.projects/README.md
@@ -10,6 +10,8 @@ The major components of this directory are:
 |-- ray-service/                
 |-- multinode-triton-trtllm-inference/
 |-- mixture-of-agents/
+|-- neuronx-distributed/
+|-- mig-gpu-partitioning/
 `-- ...
 // Other directories
 ```
@@ -80,6 +82,27 @@ Recent advances in large language models (LLMs) have shown substantial capabilit
 2. [2.projects/mixture-of-agents/outputs/](/2.projects/mixture-of-agents/outputs/): This directory consists of output of 2-layers MoA.
 3. [2.projects/mixture-of-agents/alpaca_eval](/2.projects/mixture-of-agents/alpaca_eval/): This directory is from AlpacaEval GitHub repository and consists of results of Anthropic Claude 3.5 Sonnet. These results are used during evaluation.
 
+## NEURONX DISTRIBUTED INFERENCE - FUSED SPECULATIVE DECODING on EKS
+
+This project demonstrates deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** with **vLLM-Neuron** on Amazon EKS with AWS Trainium instances (trn1.32xlarge). The solution showcases advanced inference optimization techniques including **fused draft speculative decoding** for enhanced performance.
+
+### Key Features 
+
+**Inference Optimization:**
+- **Fused Speculative Decoding**: Combines target model (e.g., Qwen3-32B) with draft model (e.g., Qwen3-0.6B) for accelerated token generation
+
+**Configuration:**
+- **Toggle Speculation**: Easy switching between speculative and standard inference modes
+- **Configurable Parameters**: Speculation length, batch size, sequence length, and context window
+- **Model Agnostic**: Supports various model architectures with proper configuration
+
+### Files & Directories
+1. `nxd-inference-eks/`: Complete EKS deployment guide with Kubernetes manifests
+2. `fused-SD/manifests/`: Kubernetes YAML files for download, compilation, deployment, and monitoring
+3. Infrastructure setup guides for EKS cluster creation with Trainium node groups and EFS storage
+
+See [neuronx-distributed/nxd-inference-eks](https://github.com/aws-samples/awsome-inference/tree/main/2.projects/neuronx-distributed/nxd-inference-eks) for detailed implementation guide.
+
 ## MIG
 These days, the challenge with ML Inference workloads, is that not all workloads require the same amount of compute resources. With accelerated instances like the Amazon EC2 P5 (p5.48xlarge / p5e.48xlarge), or the Amazon EC2 P4 (p4d.24xlarge / p4de.24xlarge), customers would need to pay for the full instance of 8 GPUs. Additionally, some workloads may be too small to even run on a single GPU! To learn more about the specifics of GPU EC2 instances, check out this developer guide.
 

diff --git a/2.projects/neuronx-distributed/README.md b/2.projects/neuronx-distributed/README.md
@@ -0,0 +1,83 @@
+# NeuronX Distributed Inference on AWS
+
+This directory contains examples for deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** on AWS Trainium instances. NxDI is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances, offering advanced inference capabilities including continuous batching and speculative decoding.
+
+## What is NeuronX Distributed Inference?
+
+[NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html) (NxDI) is introduced with Neuron SDK 2.21+ and provides high-performance inference on AWS Trainium and Inferentia instances with features like continuous batching, speculative decoding, and seamless vLLM integration for production deployments.
+
+## Examples
+
+This directory contains two deployment approaches for different use cases:
+
+### 1. EKS Deployment (`nxd-inference-eks/`)
+
+**Kubernetes deployment**
+
+- **Models**: Qwen3, Llama3, and other large language models
+- **Instance Types**: `trn1.32xlarge`, `trn2.48xlarge`
+- **Key Features**:
+  - Kubernetes-native deployment 
+  - EFS shared storage for model artifacts
+  - Neuron monitoring and observability
+  - Support for standard and speculative decoding modes
+  - Load balancing with Application Load Balancer
+
+**[→ See EKS Deployment Guide](nxd-inference-eks/README.md)**
+
+### 2. EC2 Deployment (`nxdi-ec2-vllm/`)
+
+**Direct EC2 deployment for development and testing**
+
+- **Use Case**: Development, experimentation, and simpler production setups
+- **Models**: Mistral Small 24B (primary example) and other NxDI-compatible models
+- **Instance Types**: `trn1.32xlarge`
+- **Key Features**:
+  - Jupyter notebook-based setup and experimentation
+  - Direct vLLM integration
+  - Performance benchmarking tools (`llmperf`, `lm_eval`)
+  - Profiling and optimization capabilities
+
+**[→ See EC2 Deployment Guide](nxdi-ec2-vllm/README.md)**
+
+## Choosing the Right Deployment
+
+| Requirement | EKS Deployment | EC2 Deployment |
+|-------------|----------------|----------------|
+| **Production workloads** | ✅ Recommended | ⚠️ Limited scalability |
+| **Development/Testing** | ⚠️ Complex setup | ✅ Recommended |
+| **High availability** | ✅ Multi-AZ support | ❌ Single instance |
+| **Setup complexity** | ⚠️ Moderate | ✅ Simple |
+| **Cost optimization** | ✅ Scale to zero | ❌ Always running |
+
+## Prerequisites
+
+Before using either example, ensure you have:
+
+1. **AWS Account** with appropriate permissions for Trainium instances
+2. **VPC Setup**: Use the provided template in [`1.infrastructure/0_setup_vpc/trn-vpc-example.yaml`](../../1.infrastructure/0_setup_vpc/trn-vpc-example.yaml)
+3. **Instance Quotas**: Sufficient quota for Trainium instances in your target region
+4. **HuggingFace Token**: For downloading gated models (Llama, etc.)
+
+## Getting Started
+
+1. **Choose your deployment approach** based on your use case
+2. **Set up the prerequisite infrastructure** (VPC, quotas)
+3. **Follow the specific README** for your chosen deployment method
+4. **Configure your models and parameters** according to your requirements
+
+## Performance Characteristics
+
+Both examples demonstrate high-performance inference capabilities:
+
+- **Throughput**: Up to 100+ tokens/second for 70B models on `trn1.32xlarge`
+- **Latency**: Sub-second time-to-first-token with speculative decoding
+- **Efficiency**: Optimized memory usage with tensor parallelism
+- **Scalability**: Horizontal scaling (EKS) or vertical optimization (EC2)
+
+## Support and Resources
+
+- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/)
+- [NeuronX Distributed Inference Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/)
+- [vLLM Documentation](https://docs.vllm.ai/)
+- [AWS Trainium Developer Guide](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia.html)