Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
962b70c
Create nxdi.txt
nithiyn Apr 28, 2025
7a818ca
commit mistral small nxdi-vllm deployment guide
nithiyn Apr 28, 2025
7e06335
Create imgs sub directory
nithiyn Apr 28, 2025
c7187d3
upload profiler image
nithiyn Apr 28, 2025
7a830a4
feat: 1p1d manifest and markdowns
nithiyn Jul 6, 2025
8638a20
commit tp32 updates trn1
nithiyn Jul 8, 2025
40f9be8
feat: updated FSD artifacts
nithiyn Jul 9, 2025
98857fb
commit updates
nithiyn Jul 15, 2025
fbf35f0
feat- docs and md
nithiyn Jul 29, 2025
54670bd
commit chnages cfn
nithiyn Aug 10, 2025
709a2b1
commit deletions
nithiyn Aug 10, 2025
f5227ac
Add NXD cluster configuration
nithiyn Aug 28, 2025
5a5ab63
Merge NXDI benchmarking profiling content
nithiyn Aug 28, 2025
2d0a0f8
Add NXD inference manifests and documentation (clean)
nithiyn Sep 5, 2025
6fbfd8f
cleanup template files
nithiyn Sep 6, 2025
5d819a0
fix: remove prefix caching, compiling working, revert to qwen, remove…
nithiyn Sep 9, 2025
6d952b6
commit final updates
nithiyn Sep 12, 2025
f79b745
Update README.md
nithiyn Sep 12, 2025
9f2e03b
Update README.md
nithiyn Sep 12, 2025
81b4863
Update README.md
nithiyn Sep 12, 2025
3c286e1
Update README.md
nithiyn Sep 12, 2025
6798235
Update README.md
nithiyn Sep 12, 2025
84533b6
Delete 2.projects/neuronx-distributed/nxd-inference-eks/.env
nithiyn Sep 13, 2025
c3e0f06
Update Create_EKS_Cluster.md
nithiyn Sep 13, 2025
dad2233
Update Create_EKS_Cluster.md
nithiyn Sep 13, 2025
e107819
Update Create_EKS_Cluster.md
nithiyn Sep 13, 2025
a20e568
Update Create_EKS_Cluster.md
nithiyn Sep 13, 2025
c12fb84
Update README.md
nithiyn Sep 13, 2025
9e8b1be
fix: env var mismatch on target dir name
nithiyn Sep 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions 1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Steps to create EKS cluster with EFS

In this example we create an EKS cluster consisting of one `trn1.32xlarge` compute node.
### a. Configure AWS CLI

```
aws configure
```

### b. Create a config file for EKS cluster creation

We have provided an example file here: [trn1-nxd-cluster-config..yaml](./trn1-nxd-cluster-config.yaml)

```
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: nxdi-inference-cluster
region: $REGION
version: "1.00"

vpc:
id: $PLACEHOLDER_VPC_ID
subnets:
private:
$AVAILABILITY_ZONE_1:
id: $PLACEHOLDER_SUBNET_PRIVATE_1
public:
$AVAILABILITY_ZONE_1:
id: $PLACEHOLDER_SUBNET_PUBLIC_1
$AVAILABILITY_ZONE_2:
id: $PLACEHOLDER_SUBNET_PUBLIC_2

clusterEndpoints:
privateAccess: true
publicAccess: true

cloudwatch:
clusterLogging:
enableTypes: ["*"]

iam:
withOIDC: true

# Adding additional section to Cluster (eksctl) for any controllers you may want to install. Uncomment as required.
wellKnownPolicies:
# ebsCSIController: true # Adds policies for using the ebs-csi-controller
efsCSIController: true # Adds policies for using the efs-csi-controller

addons:
- name: vpc-cni
version: 1.18.1-eksbuild.1
configurationValues: '{"env":{"ENABLE_PREFIX_DELEGATION":"true", "ENABLE_POD_ENI":"true", "POD_SECURITY_GROUP_ENFORCING_MODE":"standard"},"enableNetworkPolicy": "true"}'
resolveConflicts: overwrite
- name: amazon-cloudwatch-observability
version: v1.16.4-eksbuild.1
# - name: aws-ebs-csi-driver
# version: v1.26.0-eksbuild.1
- name: aws-efs-csi-driver
version: v2.1.9-eksbuild.1

managedNodeGroups:
- name: trn-compute-node-group
instanceType: trn1.32xlarge
instancePrefix: trtllm-compute-node
subnets:
- $PLACEHOLDER_SUBNET_PRIVATE_1
privateNetworking: true
efaEnabled: true
minSize: 0
desiredCapacity: 2
maxSize: 2
volumeSize: 500
# comment out capacityReservation if you do not need ODCR
#capacityReservation:
#capacityReservationTarget:
#capacityReservationID: "$CR_ID"
ami: ami-07c8bc6b0bb890e9e
amiFamily: AmazonLinux2
ssh:
publicKeyName: $PUBLIC_KEYPAIR_NAME
sourceSecurityGroupIds: [$SECURITY_GROUP_IDS]
updateConfig:
maxUnavailablePercentage: 50
iam:
withAddonPolicies:
externalDNS: true
certManager: true
autoScaler: true
cloudWatch: true
ebs: true
efs: true
fsx: true
imageBuilder: true
xRay: true
awsLoadBalancerController: true
albIngress: true

```


### c. Create the EKS cluster

```
eksctl create cluster -f trn1-nxdi-cluster-config.yaml
```

## 3. (Optional) Capacity Blocks

If you have Capacity Blocks for P5 or P4 instances, you can follow the [steps here](https://github.com/aws-samples/awsome-inference/tree/main/1.infrastructure#capacity-blocks) to create a self-managed nodegroup and add to your existing EKS cluster.


## 4. Create an EFS file system

To enable multiple pods deployed to multiple nodes to load shards of the same model so that they can used in coordination to serve inference request too large to loaded by a single GPU, we'll need a common, shared storage location. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume.

Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod.

### a. Create an IAM role

Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver.

### b. Install EFS CSI driver

Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status.

### c. Create EFS file system

### Create EFS Filesystem
```bash
# Create EFS filesystem (Elastic, encrypted)
EFS_JSON=$(aws efs create-file-system \
--region us-west-2 \
--creation-token "neuron-models-$(date +%s)" \
--performance-mode generalPurpose \
--throughput-mode elastic \
--encrypted \
--tags Key=Name,Value=neuron-inference)

EFS_ID=$(echo "$EFS_JSON" | jq -r '.FileSystemId')

# Wait until available (avoids race)
until [ "$(aws efs describe-file-systems \
--region us-west-2 \
--file-system-id "$EFS_ID" \
--query 'FileSystems[0].LifeCycleState' \
--output text)" = "available" ]; do
echo "waiting for EFS $EFS_ID ..."
sleep 3
done
echo "EFS is available."
```
# Create mount targets (one per AZ your nodes use)
```
for subnet in subnet-xxx subnet-yyy subnet-zzz; do
aws efs create-mount-target \
--region us-west-2 \
--file-system-id "$EFS_ID" \
--subnet-id "$subnet" \
--security-groups sg-your-efs-security-group
done
```
# Create StorageClass
```bash
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: ${EFS_ID}
directoryPerms: "0755"
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true
EOF

```
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: $CLUSTER_NAME
region: $REGION
version: "1.30"

vpc:
id: $VPC_ID
subnets:
private:
us-west-2c:
id: $PRIVATE_SUBNET_1
us-west-2d:
id: $PRIVATE_SUBNET_2
public:
us-west-2c:
id: $PUBLIC_SUBNET_1
us-west-2d:
id: $PUBLIC_SUBNET_2
clusterEndpoints:
privateAccess: true
publicAccess: true

cloudwatch:
clusterLogging:
enableTypes: ["*"]

iam:
withOIDC: true
serviceAccounts:
- metadata:
name: efs-csi-controller-sa
namespace: kube-system
wellKnownPolicies:
efsCSIController: true

addons:
- name: vpc-cni
version: v1.18.1-eksbuild.1
resolveConflicts: overwrite
configurationValues: |
env:
ENABLE_PREFIX_DELEGATION: "true"
ENABLE_POD_ENI: "true"
POD_SECURITY_GROUP_ENFORCING_MODE: "standard"
enableNetworkPolicy: "true"

- name: amazon-cloudwatch-observability
version: v1.16.4-eksbuild.1

- name: aws-efs-csi-driver
version: v2.1.9-eksbuild.1

managedNodeGroups:
- name: trn-compute-node-group
instanceType: trn1.32xlarge
amiFamily: AmazonLinux2023
subnets:
- $PRIVATE_SUBNET_1
privateNetworking: true
efaEnabled: true
minSize: 0
desiredCapacity: 1
maxSize: 2
volumeSize: 500
capacityReservation:
capacityReservationTarget:
capacityReservationId: $CAPACITY_RESERVATION_ID
updateConfig:
maxUnavailablePercentage: 50
iam:
withAddonPolicies:
externalDNS: true
certManager: true
autoScaler: true
cloudWatch: true
ebs: true
efs: true
fsx: true
imageBuilder: true
xRay: true
awsLoadBalancerController: true
albIngress: true
23 changes: 23 additions & 0 deletions 2.projects/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ The major components of this directory are:
|-- ray-service/
|-- multinode-triton-trtllm-inference/
|-- mixture-of-agents/
|-- neuronx-distributed/
|-- mig-gpu-partitioning/
`-- ...
// Other directories
```
Expand Down Expand Up @@ -80,6 +82,27 @@ Recent advances in large language models (LLMs) have shown substantial capabilit
2. [2.projects/mixture-of-agents/outputs/](/2.projects/mixture-of-agents/outputs/): This directory consists of output of 2-layers MoA.
3. [2.projects/mixture-of-agents/alpaca_eval](/2.projects/mixture-of-agents/alpaca_eval/): This directory is from AlpacaEval GitHub repository and consists of results of Anthropic Claude 3.5 Sonnet. These results are used during evaluation.

## NEURONX DISTRIBUTED INFERENCE - FUSED SPECULATIVE DECODING on EKS

This project demonstrates deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** with **vLLM-Neuron** on Amazon EKS with AWS Trainium instances (trn1.32xlarge). The solution showcases advanced inference optimization techniques including **fused draft speculative decoding** for enhanced performance.

### Key Features

**Inference Optimization:**
- **Fused Speculative Decoding**: Combines target model (e.g., Qwen3-32B) with draft model (e.g., Qwen3-0.6B) for accelerated token generation

**Configuration:**
- **Toggle Speculation**: Easy switching between speculative and standard inference modes
- **Configurable Parameters**: Speculation length, batch size, sequence length, and context window
- **Model Agnostic**: Supports various model architectures with proper configuration

### Files & Directories
1. `nxd-inference-eks/`: Complete EKS deployment guide with Kubernetes manifests
2. `fused-SD/manifests/`: Kubernetes YAML files for download, compilation, deployment, and monitoring
3. Infrastructure setup guides for EKS cluster creation with Trainium node groups and EFS storage

See [neuronx-distributed/nxd-inference-eks](https://github.com/aws-samples/awsome-inference/tree/main/2.projects/neuronx-distributed/nxd-inference-eks) for detailed implementation guide.

## MIG
These days, the challenge with ML Inference workloads, is that not all workloads require the same amount of compute resources. With accelerated instances like the Amazon EC2 P5 (p5.48xlarge / p5e.48xlarge), or the Amazon EC2 P4 (p4d.24xlarge / p4de.24xlarge), customers would need to pay for the full instance of 8 GPUs. Additionally, some workloads may be too small to even run on a single GPU! To learn more about the specifics of GPU EC2 instances, check out this developer guide.

Expand Down
83 changes: 83 additions & 0 deletions 2.projects/neuronx-distributed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# NeuronX Distributed Inference on AWS

This directory contains examples for deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** on AWS Trainium instances. NxDI is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances, offering advanced inference capabilities including continuous batching and speculative decoding.

## What is NeuronX Distributed Inference?

[NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html) (NxDI) is introduced with Neuron SDK 2.21+ and provides high-performance inference on AWS Trainium and Inferentia instances with features like continuous batching, speculative decoding, and seamless vLLM integration for production deployments.

## Examples

This directory contains two deployment approaches for different use cases:

### 1. EKS Deployment (`nxd-inference-eks/`)

**Kubernetes deployment**

- **Models**: Qwen3, Llama3, and other large language models
- **Instance Types**: `trn1.32xlarge`, `trn2.48xlarge`
- **Key Features**:
- Kubernetes-native deployment
- EFS shared storage for model artifacts
- Neuron monitoring and observability
- Support for standard and speculative decoding modes
- Load balancing with Application Load Balancer

**[→ See EKS Deployment Guide](nxd-inference-eks/README.md)**

### 2. EC2 Deployment (`nxdi-ec2-vllm/`)

**Direct EC2 deployment for development and testing**

- **Use Case**: Development, experimentation, and simpler production setups
- **Models**: Mistral Small 24B (primary example) and other NxDI-compatible models
- **Instance Types**: `trn1.32xlarge`
- **Key Features**:
- Jupyter notebook-based setup and experimentation
- Direct vLLM integration
- Performance benchmarking tools (`llmperf`, `lm_eval`)
- Profiling and optimization capabilities

**[→ See EC2 Deployment Guide](nxdi-ec2-vllm/README.md)**

## Choosing the Right Deployment

| Requirement | EKS Deployment | EC2 Deployment |
|-------------|----------------|----------------|
| **Production workloads** | ✅ Recommended | ⚠️ Limited scalability |
| **Development/Testing** | ⚠️ Complex setup | ✅ Recommended |
| **High availability** | ✅ Multi-AZ support | ❌ Single instance |
| **Setup complexity** | ⚠️ Moderate | ✅ Simple |
| **Cost optimization** | ✅ Scale to zero | ❌ Always running |

## Prerequisites

Before using either example, ensure you have:

1. **AWS Account** with appropriate permissions for Trainium instances
2. **VPC Setup**: Use the provided template in [`1.infrastructure/0_setup_vpc/trn-vpc-example.yaml`](../../1.infrastructure/0_setup_vpc/trn-vpc-example.yaml)
3. **Instance Quotas**: Sufficient quota for Trainium instances in your target region
4. **HuggingFace Token**: For downloading gated models (Llama, etc.)

## Getting Started

1. **Choose your deployment approach** based on your use case
2. **Set up the prerequisite infrastructure** (VPC, quotas)
3. **Follow the specific README** for your chosen deployment method
4. **Configure your models and parameters** according to your requirements

## Performance Characteristics

Both examples demonstrate high-performance inference capabilities:

- **Throughput**: Up to 100+ tokens/second for 70B models on `trn1.32xlarge`
- **Latency**: Sub-second time-to-first-token with speculative decoding
- **Efficiency**: Optimized memory usage with tensor parallelism
- **Scalability**: Horizontal scaling (EKS) or vertical optimization (EC2)

## Support and Resources

- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/)
- [NeuronX Distributed Inference Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/)
- [vLLM Documentation](https://docs.vllm.ai/)
- [AWS Trainium Developer Guide](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia.html)
Loading