diff --git a/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md b/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md
new file mode 100644
index 0000000..9de121b
--- /dev/null
+++ b/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md
@@ -0,0 +1,181 @@
+# Steps to create EKS cluster with EFS
+
+In this example we create an EKS cluster consisting of one `trn1.32xlarge` compute node. 
+### a. Configure AWS CLI
+
+```
+aws configure
+```
+
+### b. Create a config file for EKS cluster creation
+
+We have provided an example file here: [trn1-nxd-cluster-config..yaml](./trn1-nxd-cluster-config.yaml)
+
+```
+apiVersion: eksctl.io/v1alpha5
+kind: ClusterConfig
+
+metadata:
+  name: nxdi-inference-cluster
+  region: $REGION
+  version: "1.00"
+
+vpc:
+  id: $PLACEHOLDER_VPC_ID
+  subnets:
+    private:
+      $AVAILABILITY_ZONE_1:
+        id: $PLACEHOLDER_SUBNET_PRIVATE_1
+    public:
+      $AVAILABILITY_ZONE_1:
+        id: $PLACEHOLDER_SUBNET_PUBLIC_1
+      $AVAILABILITY_ZONE_2:
+        id: $PLACEHOLDER_SUBNET_PUBLIC_2
+        
+  clusterEndpoints:
+    privateAccess: true
+    publicAccess: true
+      
+cloudwatch:
+  clusterLogging:
+    enableTypes: ["*"]  
+
+iam:
+  withOIDC: true
+
+# Adding additional section to Cluster (eksctl) for any controllers you may want to install. Uncomment as required. 
+wellKnownPolicies:
+#   ebsCSIController: true      # Adds policies for using the ebs-csi-controller
+  efsCSIController: true      # Adds policies for using the efs-csi-controller  
+
+addons:
+  - name: vpc-cni
+    version: 1.18.1-eksbuild.1
+    configurationValues: '{"env":{"ENABLE_PREFIX_DELEGATION":"true", "ENABLE_POD_ENI":"true", "POD_SECURITY_GROUP_ENFORCING_MODE":"standard"},"enableNetworkPolicy": "true"}'
+    resolveConflicts: overwrite      
+  - name: amazon-cloudwatch-observability
+    version: v1.16.4-eksbuild.1
+  # - name: aws-ebs-csi-driver
+  #   version: v1.26.0-eksbuild.1
+  - name: aws-efs-csi-driver     
+    version: v2.1.9-eksbuild.1  
+
+managedNodeGroups:
+  - name: trn-compute-node-group
+    instanceType: trn1.32xlarge
+    instancePrefix: trtllm-compute-node 
+    subnets:
+      - $PLACEHOLDER_SUBNET_PRIVATE_1
+    privateNetworking: true
+    efaEnabled: true
+    minSize: 0
+    desiredCapacity: 2
+    maxSize: 2
+    volumeSize: 500
+    # comment out capacityReservation if you do not need ODCR
+    #capacityReservation:
+      #capacityReservationTarget:
+        #capacityReservationID: "$CR_ID"
+    ami: ami-07c8bc6b0bb890e9e
+    amiFamily: AmazonLinux2
+    ssh:
+      publicKeyName: $PUBLIC_KEYPAIR_NAME
+      sourceSecurityGroupIds: [$SECURITY_GROUP_IDS]
+    updateConfig:
+      maxUnavailablePercentage: 50
+    iam:
+      withAddonPolicies:
+        externalDNS: true
+        certManager: true
+        autoScaler: true
+        cloudWatch: true
+        ebs: true
+        efs: true
+        fsx: true
+        imageBuilder: true
+        xRay: true
+        awsLoadBalancerController: true
+        albIngress: true
+    
+```
+
+
+### c. Create the EKS cluster
+
+```
+eksctl create cluster -f trn1-nxdi-cluster-config.yaml
+```
+
+## 3. (Optional) Capacity Blocks
+
+If you have Capacity Blocks for P5 or P4 instances, you can follow the [steps here](https://github.com/aws-samples/awsome-inference/tree/main/1.infrastructure#capacity-blocks) to create a self-managed nodegroup and add to your existing EKS cluster.
+
+
+## 4. Create an EFS file system
+
+To enable multiple pods deployed to multiple nodes to load shards of the same model so that they can used in coordination to serve inference request too large to loaded by a single GPU, we'll need a common, shared storage location. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume.
+
+Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod.
+
+### a. Create an IAM role
+
+Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver.
+
+### b. Install EFS CSI driver
+
+Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status.
+
+### c. Create EFS file system
+
+### Create EFS Filesystem
+```bash
+# Create EFS filesystem (Elastic, encrypted)
+EFS_JSON=$(aws efs create-file-system \
+  --region us-west-2 \
+  --creation-token "neuron-models-$(date +%s)" \
+  --performance-mode generalPurpose \
+  --throughput-mode elastic \
+  --encrypted \
+  --tags Key=Name,Value=neuron-inference)
+
+EFS_ID=$(echo "$EFS_JSON" | jq -r '.FileSystemId')
+
+# Wait until available (avoids race)
+until [ "$(aws efs describe-file-systems \
+            --region us-west-2 \
+            --file-system-id "$EFS_ID" \
+            --query 'FileSystems[0].LifeCycleState' \
+            --output text)" = "available" ]; do
+  echo "waiting for EFS $EFS_ID ..."
+  sleep 3
+done
+echo "EFS is available."
+```
+# Create mount targets (one per AZ your nodes use)
+```
+for subnet in subnet-xxx subnet-yyy subnet-zzz; do
+  aws efs create-mount-target \
+    --region us-west-2 \
+    --file-system-id "$EFS_ID" \
+    --subnet-id "$subnet" \
+    --security-groups sg-your-efs-security-group
+done
+```
+# Create StorageClass
+```bash
+cat <<EOF | kubectl apply -f -
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: efs-sc
+provisioner: efs.csi.aws.com
+parameters:
+  provisioningMode: efs-ap
+  fileSystemId: ${EFS_ID}
+  directoryPerms: "0755"
+reclaimPolicy: Retain
+volumeBindingMode: Immediate
+allowVolumeExpansion: true
+EOF
+
+```
diff --git a/1.infrastructure/1_setup_cluster/nxd-inference/trn1-nxd-cluster-config.yaml b/1.infrastructure/1_setup_cluster/nxd-inference/trn1-nxd-cluster-config.yaml
new file mode 100644
index 0000000..2552a20
--- /dev/null
+++ b/1.infrastructure/1_setup_cluster/nxd-inference/trn1-nxd-cluster-config.yaml
@@ -0,0 +1,85 @@
+apiVersion: eksctl.io/v1alpha5
+kind: ClusterConfig
+
+metadata:
+  name: $CLUSTER_NAME
+  region: $REGION
+  version: "1.30"
+
+vpc:
+  id: $VPC_ID
+  subnets:
+    private:
+      us-west-2c:
+        id: $PRIVATE_SUBNET_1
+      us-west-2d:
+        id: $PRIVATE_SUBNET_2
+    public:
+      us-west-2c:
+        id: $PUBLIC_SUBNET_1
+      us-west-2d:
+        id: $PUBLIC_SUBNET_2
+  clusterEndpoints:
+    privateAccess: true
+    publicAccess: true
+
+cloudwatch:
+  clusterLogging:
+    enableTypes: ["*"]
+
+iam:
+  withOIDC: true
+  serviceAccounts:
+    - metadata:
+        name: efs-csi-controller-sa
+        namespace: kube-system
+      wellKnownPolicies:
+        efsCSIController: true
+
+addons:
+  - name: vpc-cni
+    version: v1.18.1-eksbuild.1
+    resolveConflicts: overwrite
+    configurationValues: |
+      env:
+        ENABLE_PREFIX_DELEGATION: "true"
+        ENABLE_POD_ENI: "true"
+        POD_SECURITY_GROUP_ENFORCING_MODE: "standard"
+      enableNetworkPolicy: "true"
+
+  - name: amazon-cloudwatch-observability
+    version: v1.16.4-eksbuild.1
+
+  - name: aws-efs-csi-driver
+    version: v2.1.9-eksbuild.1
+
+managedNodeGroups:
+  - name: trn-compute-node-group
+    instanceType: trn1.32xlarge
+    amiFamily: AmazonLinux2023
+    subnets:
+      - $PRIVATE_SUBNET_1
+    privateNetworking: true
+    efaEnabled: true
+    minSize: 0
+    desiredCapacity: 1
+    maxSize: 2
+    volumeSize: 500
+    capacityReservation:
+      capacityReservationTarget:
+        capacityReservationId: $CAPACITY_RESERVATION_ID
+    updateConfig:
+      maxUnavailablePercentage: 50
+    iam:
+      withAddonPolicies:
+        externalDNS: true
+        certManager: true
+        autoScaler: true
+        cloudWatch: true
+        ebs: true
+        efs: true
+        fsx: true
+        imageBuilder: true
+        xRay: true
+        awsLoadBalancerController: true
+        albIngress: true
diff --git a/2.projects/README.md b/2.projects/README.md
index 15a5e41..54702cc 100644
--- a/2.projects/README.md
+++ b/2.projects/README.md
@@ -10,6 +10,8 @@ The major components of this directory are:
 |-- ray-service/                
 |-- multinode-triton-trtllm-inference/
 |-- mixture-of-agents/
+|-- neuronx-distributed/
+|-- mig-gpu-partitioning/
 `-- ...
 // Other directories
 ```
@@ -80,6 +82,27 @@ Recent advances in large language models (LLMs) have shown substantial capabilit
 2. [2.projects/mixture-of-agents/outputs/](/2.projects/mixture-of-agents/outputs/): This directory consists of output of 2-layers MoA.
 3. [2.projects/mixture-of-agents/alpaca_eval](/2.projects/mixture-of-agents/alpaca_eval/): This directory is from AlpacaEval GitHub repository and consists of results of Anthropic Claude 3.5 Sonnet. These results are used during evaluation.
 
+## NEURONX DISTRIBUTED INFERENCE - FUSED SPECULATIVE DECODING on EKS
+
+This project demonstrates deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** with **vLLM-Neuron** on Amazon EKS with AWS Trainium instances (trn1.32xlarge). The solution showcases advanced inference optimization techniques including **fused draft speculative decoding** for enhanced performance.
+
+### Key Features 
+
+**Inference Optimization:**
+- **Fused Speculative Decoding**: Combines target model (e.g., Qwen3-32B) with draft model (e.g., Qwen3-0.6B) for accelerated token generation
+
+**Configuration:**
+- **Toggle Speculation**: Easy switching between speculative and standard inference modes
+- **Configurable Parameters**: Speculation length, batch size, sequence length, and context window
+- **Model Agnostic**: Supports various model architectures with proper configuration
+
+### Files & Directories
+1. `nxd-inference-eks/`: Complete EKS deployment guide with Kubernetes manifests
+2. `fused-SD/manifests/`: Kubernetes YAML files for download, compilation, deployment, and monitoring
+3. Infrastructure setup guides for EKS cluster creation with Trainium node groups and EFS storage
+
+See [neuronx-distributed/nxd-inference-eks](https://github.com/aws-samples/awsome-inference/tree/main/2.projects/neuronx-distributed/nxd-inference-eks) for detailed implementation guide.
+
 ## MIG
 These days, the challenge with ML Inference workloads, is that not all workloads require the same amount of compute resources. With accelerated instances like the Amazon EC2 P5 (p5.48xlarge / p5e.48xlarge), or the Amazon EC2 P4 (p4d.24xlarge / p4de.24xlarge), customers would need to pay for the full instance of 8 GPUs. Additionally, some workloads may be too small to even run on a single GPU! To learn more about the specifics of GPU EC2 instances, check out this developer guide.
 
diff --git a/2.projects/neuronx-distributed/README.md b/2.projects/neuronx-distributed/README.md
new file mode 100644
index 0000000..ab15c64
--- /dev/null
+++ b/2.projects/neuronx-distributed/README.md
@@ -0,0 +1,83 @@
+# NeuronX Distributed Inference on AWS
+
+This directory contains examples for deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** on AWS Trainium instances. NxDI is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances, offering advanced inference capabilities including continuous batching and speculative decoding.
+
+## What is NeuronX Distributed Inference?
+
+[NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html) (NxDI) is introduced with Neuron SDK 2.21+ and provides high-performance inference on AWS Trainium and Inferentia instances with features like continuous batching, speculative decoding, and seamless vLLM integration for production deployments.
+
+## Examples
+
+This directory contains two deployment approaches for different use cases:
+
+### 1. EKS Deployment (`nxd-inference-eks/`)
+
+**Kubernetes deployment**
+
+- **Models**: Qwen3, Llama3, and other large language models
+- **Instance Types**: `trn1.32xlarge`, `trn2.48xlarge`
+- **Key Features**:
+  - Kubernetes-native deployment 
+  - EFS shared storage for model artifacts
+  - Neuron monitoring and observability
+  - Support for standard and speculative decoding modes
+  - Load balancing with Application Load Balancer
+
+**[→ See EKS Deployment Guide](nxd-inference-eks/README.md)**
+
+### 2. EC2 Deployment (`nxdi-ec2-vllm/`)
+
+**Direct EC2 deployment for development and testing**
+
+- **Use Case**: Development, experimentation, and simpler production setups
+- **Models**: Mistral Small 24B (primary example) and other NxDI-compatible models
+- **Instance Types**: `trn1.32xlarge`
+- **Key Features**:
+  - Jupyter notebook-based setup and experimentation
+  - Direct vLLM integration
+  - Performance benchmarking tools (`llmperf`, `lm_eval`)
+  - Profiling and optimization capabilities
+
+**[→ See EC2 Deployment Guide](nxdi-ec2-vllm/README.md)**
+
+## Choosing the Right Deployment
+
+| Requirement | EKS Deployment | EC2 Deployment |
+|-------------|----------------|----------------|
+| **Production workloads** | ✅ Recommended | ⚠️ Limited scalability |
+| **Development/Testing** | ⚠️ Complex setup | ✅ Recommended |
+| **High availability** | ✅ Multi-AZ support | ❌ Single instance |
+| **Setup complexity** | ⚠️ Moderate | ✅ Simple |
+| **Cost optimization** | ✅ Scale to zero | ❌ Always running |
+
+## Prerequisites
+
+Before using either example, ensure you have:
+
+1. **AWS Account** with appropriate permissions for Trainium instances
+2. **VPC Setup**: Use the provided template in [`1.infrastructure/0_setup_vpc/trn-vpc-example.yaml`](../../1.infrastructure/0_setup_vpc/trn-vpc-example.yaml)
+3. **Instance Quotas**: Sufficient quota for Trainium instances in your target region
+4. **HuggingFace Token**: For downloading gated models (Llama, etc.)
+
+## Getting Started
+
+1. **Choose your deployment approach** based on your use case
+2. **Set up the prerequisite infrastructure** (VPC, quotas)
+3. **Follow the specific README** for your chosen deployment method
+4. **Configure your models and parameters** according to your requirements
+
+## Performance Characteristics
+
+Both examples demonstrate high-performance inference capabilities:
+
+- **Throughput**: Up to 100+ tokens/second for 70B models on `trn1.32xlarge`
+- **Latency**: Sub-second time-to-first-token with speculative decoding
+- **Efficiency**: Optimized memory usage with tensor parallelism
+- **Scalability**: Horizontal scaling (EKS) or vertical optimization (EC2)
+
+## Support and Resources
+
+- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/)
+- [NeuronX Distributed Inference Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/)
+- [vLLM Documentation](https://docs.vllm.ai/)
+- [AWS Trainium Developer Guide](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia.html)
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/.gitignore b/2.projects/neuronx-distributed/nxd-inference-eks/.gitignore
new file mode 100644
index 0000000..a7741a7
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/.gitignore
@@ -0,0 +1,28 @@
+# Environment variables
+.env
+.env.local
+.env.*.local
+
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+
+# IDE files
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# Logs
+*.log
+logs/
+
+# Temporary files
+*.tmp
+*.temp
\ No newline at end of file
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/README.md b/2.projects/neuronx-distributed/nxd-inference-eks/README.md
new file mode 100644
index 0000000..0a2f1b8
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/README.md
@@ -0,0 +1,646 @@
+# NeuronX Distributed Inference on EKS
+
+This example demonstrates deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** on Amazon EKS with AWS Trainium instances. The deployment supports both standard inference and speculative decoding. For the purpose of this example, we use the Qwen3 family of models but other popular model architectures are also supported with Neuron and can be used with this example and with supported kernels.
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Amazon EKS Cluster                       │
+│                                                             │
+│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
+│  │  Download Job   │  │   Compilation   │  │   Monitoring │ │
+│  │  (HF → EFS)     │  │      Job        │  │   DaemonSet  │ │
+│  │  • Target model │  │  • NxDI compile │  │  • Neuron    │ │
+│  │  • Draft model  │  │  • Spec / NoSpec│  │    Monitor   │ │
+│  └─────────────────┘  └─────────────────┘  └──────────────┘ │
+│                                                             │
+│  ┌─────────────────┐                                        │
+│  │  Inference      │                                        │
+│  │  Deployment     │                                        │
+│  │  • vLLM Server  │                                        │
+│  │  • LoadBalancer │                                        │
+│  └─────────────────┘                                        │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────────┐│
+│  │              Shared EFS Storage                         ││
+│  │  • /shared/model_hub/* (downloads)                      ││
+│  │  • /shared/compiled_models/Qwen3/* (neffs)              ││
+│  │  • Logs                                                 ││
+│  └─────────────────────────────────────────────────────────┘│
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Features
+
+- **Kubernetes-native** jobs for **download** and **compile**
+- **Speculative decoding** toggle via `ENABLE_SPECULATIVE`
+- **Separate compiled outputs** for spec / non-spec (no overwrites)
+- **Shared EFS storage** for models, artifacts, and logs
+- **Load balancing** and **monitoring** with Neuron Monitor
+
+## Prerequisites
+
+- Follow the steps in `1.infrastructure/0_setup_vpc/vpc-cf-example.yaml` and `1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md` first to setup your infrastructure.
+- `kubectl`, `aws` CLI, and `helm`
+- EKS cluster with Trainium nodes and EFS PVC
+- A Hugging Face token with access to the target repos
+
+Install Neuron device plugin and (optionally) the scheduler extension:
+
+```bash
+
+helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart --set "npd.enabled=false"
+kubectl get ds neuron-device-plugin -n kube-system
+
+helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
+  --set "scheduler.enabled=true" \
+  --set "npd.enabled=false"
+
+```
+
+## Setup
+
+### 1) Clone & Navigate
+
+```bash
+git clone https://github.com/aws-samples/awsome-inference.git
+cd awsome-inference/2.projects/neuronx-distributed/nxd-inference-eks/
+```
+
+### 2) Label Trainium Nodes
+
+```bash
+kubectl label nodes -l node.kubernetes.io/instance-type=trn1.32xlarge workload-type=neuron-inference
+kubectl taint nodes -l node.kubernetes.io/instance-type=trn1.32xlarge aws.amazon.com/neuron=:NoSchedule
+kubectl get nodes -L workload-type,node.kubernetes.io/instance-type
+```
+
+### 3) Namespace, Storage, and Secrets
+
+```bash
+kubectl create namespace neuron-inference
+
+# Apply EFS storage configuration
+kubectl apply -f fused-SD/manifests/storage.yaml -n neuron-inference #this uses the efs filesystem and the storage class you created previously
+
+# Create HF token secret once
+# replace YOUR_HF_TOKEN with your actual token (starts with hf_)
+kubectl -n neuron-inference create secret generic hf-token \
+  --from-literal=HF_TOKEN='YOUR_HF_TOKEN' \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+```
+
+## Workflow
+
+> **Two steps:** (1) **Download** both models to EFS, (2) **Compile** with or without speculation.  
+> Compiles write to **separate directories** so you can keep both.
+
+### Create a secret once for your HF token:
+
+kubectl -n neuron-inference create secret generic hf-token \
+  --from-literal=HF_TOKEN='YOUR_HF_TOKEN'
+
+
+### Step 1 — Download both models (target + draft)
+
+Apply the **download job** manifest:
+
+```bash
+kubectl apply -n neuron-inference -f fused-SD/manifests/model_download.yaml
+kubectl -n neuron-inference wait --for=condition=complete job/neuron-model-download --timeout=3600s
+kubectl -n neuron-inference logs job/neuron-model-download --tail=200
+```
+
+Expected locations after success:
+
+```
+/shared/model_hub/${MODEL_DIRNAME}/config.json
+/shared/model_hub/${DRAFT_DIRNAME}/config.json
+```
+
+Quick verify:
+
+```bash
+kubectl -n neuron-inference exec -it <any-running-pod> -- ls -l /shared/model_hub/${MODEL_DIRNAME} | head
+```
+
+### Step 2 — Compile (separate outputs for spec vs non-spec)
+
+Apply the **compile job** manifest. Control speculation by editing the `ENABLE_SPECULATIVE` environment variable in the manifest as well as other env vars you would like to set or toggle. The script sets a number of defaults in the ConfigMap:
+
+```bash
+# Non-spec compile (kept in /shared/compiled_models/Llama-3.3-70B/nospec_tp32)
+kubectl apply -n neuron-inference -f fused-SD/manifests/compile.yaml
+kubectl -n neuron-inference wait --for=condition=complete job/neuron-model-compilation --timeout=3600s
+kubectl -n neuron-inference logs job/neuron-model-compilation --tail=200
+
+# For speculative compile, edit the manifest to set ENABLE_SPECULATIVE=true, then:
+kubectl -n neuron-inference delete job neuron-model-compilation --ignore-not-found
+kubectl apply -n neuron-inference -f fused-SD/manifests/compile.yaml
+kubectl -n neuron-inference wait --for=condition=complete job/neuron-model-compilation --timeout=3600s
+kubectl -n neuron-inference logs job/neuron-model-compilation --tail=200
+```
+
+**Output layout (no overwrites):**
+```
+/shared/compiled_models/Qwen3-32B/
+  ├─ nospec_tp32/
+  └─ spec_slen7_tp32/
+```
+
+### Step 3 — Deploy Inference
+
+**Important:** Before deploying, you must update the compiled model paths in `fused-SD/manifests/fsd-deploy.yaml` to match your compilation parameters. Please make sure that the env vars being set and used by vLLM are consistent with your compile-time input shapes and configs.
+
+The deployment manifest has hardcoded paths that need to match your compile job settings:
+
+```yaml
+# In fsd-deploy.yaml, update these env vars to match your compilation:
+- name: COMPILED_MODEL_PATH_STD
+  value: "/shared/compiled_models/Qwen3-32B/spec_slen7_tp32"  # for speculative
+- name: COMPILED_MODEL_PATH_SPEC  
+  value: "/shared/compiled_models/Qwen3-32B/nospec_tp32"      # for non-speculative
+```
+
+**Path format:** `/shared/compiled_models/{MODEL_NAME}/{mode}_{params}`
+
+Where:
+- `{MODEL_NAME}` = your `COMPILED_ROOT` basename (e.g., `Qwen3-32B`)
+- `{mode}` = `spec` or `nospec` 
+- `{params}` = `slen{SPECULATION_LENGTH}_tp{TP_DEGREE}` for spec, or just `tp{TP_DEGREE}` for nospec
+
+**Examples:**
+- TP=32, no speculation: `nospec_tp32`
+- TP=32, speculation length 7: `spec_slen7_tp32` 
+- TP=16, speculation length 5: `spec_slen5_tp16`
+
+**Quick update command:**
+```bash
+# For TP=32, SPECULATION_LENGTH=7 (adjust as needed)
+sed -i 's|/shared/compiled_models/Qwen3-32B/spec_slen7_tp32|/shared/compiled_models/Qwen3-32B/spec_slen7_tp32|g' fused-SD/manifests/fsd-deploy.yaml
+sed -i 's|/shared/compiled_models/Qwen3-32B/nospec_tp32|/shared/compiled_models/Qwen3-32B/nospec_tp32|g' fused-SD/manifests/fsd-deploy.yaml
+```
+
+Then apply your inference deployment:
+
+```bash
+kubectl apply -n neuron-inference -f fused-SD/manifests/fsd-deploy.yaml
+kubectl -n neuron-inference wait --for=condition=available deployment/neuron-llama-inference --timeout=600s
+```
+
+### Step 4 — Load Balancing with Application Load Balancer
+
+To expose your inference service externally and distribute traffic across multiple pods, you'll set up an Application Load Balancer (ALB) using the AWS Load Balancer Controller.
+
+#### 4.1 Install AWS Load Balancer Controller
+
+**Prerequisites:**
+- Your EKS cluster must have an IAM OIDC identity provider
+- The AWS Load Balancer Controller requires specific IAM permissions
+
+**Option A: Using Kubernetes Manifests (Recommended)**
+
+1. Create the IAM policy and service account:
+```bash
+# Download the IAM policy document
+curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.2/docs/install/iam_policy.json
+
+# Create the IAM policy
+aws iam create-policy \
+    --policy-name AWSLoadBalancerControllerIAMPolicy \
+    --policy-document file://iam_policy.json
+
+# Create service account with IAM role
+eksctl create iamserviceaccount \
+  --cluster=your-cluster-name \
+  --namespace=kube-system \
+  --name=aws-load-balancer-controller \
+  --role-name AmazonEKSLoadBalancerControllerRole \
+  --attach-policy-arn=arn:aws:iam::ACCOUNT-ID:policy/AWSLoadBalancerControllerIAMPolicy \
+  --approve
+```
+
+2. Install the controller:
+```bash
+# Add the EKS chart repo
+helm repo add eks https://aws.github.io/eks-charts
+helm repo update
+
+# Install AWS Load Balancer Controller
+helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
+  -n kube-system \
+  --set clusterName=your-cluster-name \
+  --set serviceAccount.create=false \
+  --set serviceAccount.name=aws-load-balancer-controller
+```
+
+3. Verify installation:
+```bash
+kubectl get deployment -n kube-system aws-load-balancer-controller
+kubectl logs -n kube-system deployment/aws-load-balancer-controller
+```
+
+**Option B: Using Helm (Alternative)**
+Follow the [AWS documentation for Helm installation](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html).
+
+#### 4.2 Deploy the Ingress
+
+Once the AWS Load Balancer Controller is installed and running:
+
+```bash
+# Apply the ingress configuration
+kubectl -n neuron-inference apply -f fused-SD/manifests/neuron-ingress.yaml
+
+# Monitor ingress creation (wait for ADDRESS to appear)
+kubectl -n neuron-inference get ingress neuron-qwen-ingress -w
+```
+
+The ingress will create an Application Load Balancer that:
+- Routes traffic to your inference service pods
+- Provides health checks on the `/health` endpoint
+- Supports both HTTP and HTTPS traffic
+- Automatically scales with your deployment
+
+#### 4.3 Test Your Deployment
+
+Once the ALB is provisioned (this can take 2-3 minutes):
+
+```bash
+# Get the ALB hostname
+ALB=$(kubectl -n neuron-inference get ing neuron-qwen-ingress -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
+echo "ALB Endpoint: http://$ALB"
+
+# Test the health endpoint
+curl -i "http://$ALB/health"
+
+# List available models
+curl -i "http://$ALB/v1/models"
+
+# Test inference with a simple completion
+curl -s "http://$ALB/v1/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "/shared/model_hub/Qwen3-32B",
+    "prompt": "Say hi from vLLM on Neuron.",
+    "max_tokens": 64,
+    "temperature": 0.7
+  }'
+
+# Test with chat completions API
+curl -s "http://$ALB/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "/shared/model_hub/Qwen3-32B",
+    "messages": [{"role": "user", "content": "Hello! How are you?"}],
+    "max_tokens": 100
+  }'
+```
+Congratulation!
+
+#### 4.4 Production Considerations
+
+For production deployments, consider:
+
+- **HTTPS/TLS**: Configure SSL certificates using AWS Certificate Manager
+- **Custom Domain**: Set up Route 53 records pointing to your ALB
+- **WAF Integration**: Add AWS WAF for additional security
+- **Access Logging**: Enable ALB access logs for monitoring and debugging
+- **Target Group Settings**: Tune health check intervals and thresholds based on your model's startup time
+
+Example ingress with HTTPS:
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: neuron-qwen-ingress-https
+  annotations:
+    kubernetes.io/ingress.class: alb
+    alb.ingress.kubernetes.io/scheme: internet-facing
+    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:region:account:certificate/cert-id
+    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
+    alb.ingress.kubernetes.io/ssl-redirect: '443'
+spec:
+  rules:
+  - host: your-domain.com
+    http:
+      paths:
+      - path: /
+        pathType: Prefix
+        backend:
+          service:
+            name: neuron-llama-service
+            port:
+              number: 8000
+```
+
+### Step 5 — Deploy Neuron Monitor for Observability
+
+AWS Neuron Monitor provides comprehensive monitoring and observability for your Neuron workloads, including hardware utilization, model performance metrics, and system health indicators.
+
+#### 5.1 Understanding Neuron Monitor
+
+Neuron Monitor offers:
+- **Hardware Metrics**: NeuronCore utilization, memory usage, temperature
+- **Model Performance**: Inference latency, throughput, queue depth
+- **System Health**: Device status, error rates, compilation metrics
+- **Integration**: Works with Prometheus, Grafana, CloudWatch, and other monitoring systems
+
+#### 5.2 Deploy Neuron Monitor DaemonSet
+
+The Neuron Monitor runs as a DaemonSet to collect metrics from all Neuron devices across your cluster:
+
+```bash
+# Create the Neuron Monitor DaemonSet
+cat <<EOF | kubectl apply -f -
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: neuron-monitor
+  namespace: neuron-inference
+  labels:
+    app: neuron-monitor
+spec:
+  selector:
+    matchLabels:
+      app: neuron-monitor
+  template:
+    metadata:
+      labels:
+        app: neuron-monitor
+    spec:
+      serviceAccount: neuron-monitor
+      hostNetwork: true
+      hostPID: true
+      containers:
+      - name: neuron-monitor
+        image: #set latest image
+        securityContext:
+          privileged: true
+        env:
+        - name: NEURON_MONITOR_CW_REGION
+          value: "us-west-2"  # Change to your region
+        - name: NEURON_MONITOR_CW_LOG_GROUP
+          value: "/aws/eks/neuron-monitor"
+        ports:
+        - containerPort: 8080
+          name: http-metrics
+        - containerPort: 8082
+          name: http-health
+        volumeMounts:
+        - name: proc
+          mountPath: /host/proc
+          readOnly: true
+        - name: sys
+          mountPath: /host/sys
+          readOnly: true
+        - name: neuron-devices
+          mountPath: /dev/neuron0
+        - name: tmp
+          mountPath: /tmp
+        resources:
+          requests:
+            memory: "128Mi"
+            cpu: "100m"
+          limits:
+            memory: "512Mi"
+            cpu: "500m"
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 8082
+          initialDelaySeconds: 30
+          periodSeconds: 30
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8082
+          initialDelaySeconds: 5
+          periodSeconds: 10
+      volumes:
+      - name: proc
+        hostPath:
+          path: /proc
+      - name: sys
+        hostPath:
+          path: /sys
+      - name: neuron-devices
+        hostPath:
+          path: /dev/neuron0
+      - name: tmp
+        hostPath:
+          path: /tmp
+      nodeSelector:
+        workload-type: neuron-inference
+      tolerations:
+      - key: aws.amazon.com/neuron
+        operator: Exists
+        effect: NoSchedule
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: neuron-monitor
+  namespace: neuron-inference
+  annotations:
+    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT-ID:role/NeuronMonitorRole
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: neuron-monitor-service
+  namespace: neuron-inference
+  labels:
+    app: neuron-monitor
+spec:
+  selector:
+    app: neuron-monitor
+  ports:
+  - name: http-metrics
+    port: 8080
+    targetPort: 8080
+  - name: http-health
+    port: 8082
+    targetPort: 8082
+  type: ClusterIP
+EOF
+```
+
+#### 5.3 Create IAM Role for CloudWatch Integration
+
+If you want to send metrics to CloudWatch, create an IAM role:
+
+```bash
+# Create IAM policy for CloudWatch access
+cat <<EOF > neuron-monitor-policy.json
+{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Effect": "Allow",
+            "Action": [
+                "cloudwatch:PutMetricData",
+                "logs:CreateLogGroup",
+                "logs:CreateLogStream",
+                "logs:PutLogEvents",
+                "logs:DescribeLogStreams"
+            ],
+            "Resource": "*"
+        }
+    ]
+}
+EOF
+
+# Create the policy
+aws iam create-policy \
+    --policy-name NeuronMonitorCloudWatchPolicy \
+    --policy-document file://neuron-monitor-policy.json
+
+# Create service account with IAM role (replace ACCOUNT-ID and CLUSTER-NAME)
+eksctl create iamserviceaccount \
+  --cluster=CLUSTER-NAME \
+  --namespace=neuron-inference \
+  --name=neuron-monitor \
+  --role-name=NeuronMonitorRole \
+  --attach-policy-arn=arn:aws:iam::ACCOUNT-ID:policy/NeuronMonito
+--approve
+```
+**Verify Neuron Monitor Deployment**
+
+```bash
+# Check DaemonSet status
+kubectl -n neuron-inference get daemonset neuron-monitor
+kubectl -n neuron-inference get pods -l app=neuron-monitor
+
+# View logs
+kubectl -n neuron-inference logs -l app=neuron-monitor --tail=50
+```
+
+#### 5.5 Configure Prometheus Integration (Optional)
+
+To scrape metrics with Prometheus, add the following ServiceMonitor:
+
+```bash
+cat <<EOF | kubectl apply -f -
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: neuron-monitor
+  namespace: neuron-inference
+  labels:
+    app: neuron-monitor
+spec:
+  selector:
+    matchLabels:
+      app: neuron-monitor
+  endpoints:
+  - port: http-metrics
+    interval: 30s
+    path: /metrics
+EOF
+```
+
+#### 5.6 Key Metrics to Monitor
+
+Neuron Monitor exposes several important metrics:
+
+**Hardware Metrics:**
+- `neuron_hardware_ecc_events_total`: ECC error events
+- `neuron_hardware_memory_used_bytes`: Memory utilization per NeuronCore
+- `neuron_hardware_utilization_ratio`: NeuronCore utilization percentage
+
+**Runtime Metrics:**
+- `neuron_runtime_inference_latency_seconds`: End-to-end inference latency
+- `neuron_runtime_queue_size`: Number of pending inference requests
+- `neuron_runtime_throughput_inferences_per_second`: Inference throughput
+
+**Model Metrics:**
+- `neuron_model_loaded`: Whether model is successfully loaded
+- `neuron_model_inference_errors_total`: Inference error count
+- `neuron_execution_latency_seconds`: Model execution time
+
+#### 5.7 Grafana Dashboard
+
+You can import pre-built Grafana dashboards for Neuron monitoring:
+
+```bash
+# Download the official Neuron dashboard
+curl -o neuron-dashboard.json https://raw.githubusercontent.com/aws-neuron/aws-neuron-samples/master/src/examples/pytorch/neuron_monitor/grafana-dashboard.json
+
+# Import into your Grafana instance via the UI or API
+```
+
+#### 5.8 CloudWatch Integration
+
+If using CloudWatch, metrics will appear under the `AWS/Neuron` namespace. You can create CloudWatch alarms for:
+
+- High NeuronCore utilization
+- Inference latency spikes  
+- Error rate thresholds
+- Memory usage alerts
+
+Example CloudWatch alarm:
+```bash
+aws cloudwatch put-metric-alarm \
+  --alarm-name "High-Neuron-Utilization" \
+  --alarm-description "Alert when NeuronCore utilization exceeds 90%" \
+  --metric-name neuron_hardware_utilization_ratio \
+  --namespace AWS/Neuron \
+  --statistic Average \
+  --period 300 \
+  --threshold 0.9 \
+  --comparison-operator GreaterThanThreshold \
+  --evaluation-periods 2
+```
+
+## Troubleshooting
+  kubectl -n neuron-inference logs job/neuron-model-download --tail=200
+  ```
+- Ensure the HF token secret exists and is referenced:
+  ```bash
+  kubectl -n neuron-inference get secret hf-token
+  ```
+- Verify the EFS PVC is bound and writable:
+  ```bash
+  kubectl -n neuron-inference get pvc
+  ```
+
+**Compile fails immediately saying “Unrecognized model … config.json”**
+- The download likely didn’t complete or the path is wrong. Verify:
+  ```bash
+  kubectl -n neuron-inference exec -it <pod> -- test -f /shared/model_hub/${MODEL_DIRNAME}/config.json && echo OK
+  ```
+
+**Spec compile overwrote non-spec?**
+- With the provided manifests, outputs are separated per mode (`nospec_*` vs `spec_*`). If you see overwrites, confirm your `COMPILED_ROOT` and job env vars.
+
+**Inference deployment can't find compiled artifacts**
+- Check that the paths in `fsd-deploy.yaml` match your actual compilation output:
+  ```bash
+  kubectl -n neuron-inference exec -it <pod> -- ls -la /shared/compiled_models/Qwen3-32B/
+  ```
+- Update the `COMPILED_MODEL_PATH_STD` and `COMPILED_MODEL_PATH_SPEC` env vars to match your TP degree and speculation length.
+
+**Neuron compiler errors**
+- These are model/hardware/SDK specific. Re-run with smaller `TP_DEGREE`, confirm SDK image version, or inspect `/shared/compile*.log`. Consider filing an issue with logs.
+
+## Cleanup
+
+```bash
+kubectl -n neuron-inference delete job neuron-model-download --ignore-not-found
+kubectl -n neuron-inference delete job neuron-model-compilation --ignore-not-found
+kubectl -n neuron-inference delete deployment neuron-llama-inference service neuron-llama-service --ignore-not-found
+kubectl -n neuron-inference delete pvc efs-models-pvc --ignore-not-found
+kubectl delete namespace neuron-inference
+```
+
+## References
+
+- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) – NxDI & compiler guidance
+- [NeuronX Distributed Inference Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html)
+- [vLLM Documentation](https://docs.vllm.ai/) – Server flags and deployment considerations
+- [Kubernetes Jobs Documentation](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
+- [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
+- [AWS Neuron Helm Charts](https://github.com/aws-neuron/aws-neuron-helm-charts)
+- [AWS Trainium Instance Types](https://aws.amazon.com/ec2/instance-types/trn1/)
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/.env.example b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/.env.example
new file mode 100644
index 0000000..bfc3d43
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/.env.example
@@ -0,0 +1,24 @@
+# Hugging Face Configuration
+HF_TOKEN=your_huggingface_token_here
+HF_MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
+HF_DRAFT_MODEL_ID=meta-llama/Llama-3.2-1B-Instruct
+MODEL_NAME=llama-3-70B-inst
+
+# Inference Configuration
+MAX_MODEL_LEN=12800
+SEQ_LEN=12800
+MAX_CONTEXT_LEN=12288
+
+# Neuron Configuration
+TENSOR_PARALLEL_SIZE=32
+TP_DEGREE=32  # Add this - same as TENSOR_PARALLEL_SIZE
+NAMESPACE=neuron-inference
+BATCH_SIZE=1
+MAX_NUM_SEQS=1
+ENABLE_SPECULATIVE=false  # Start with false
+SPECULATION_LENGTH=7  # Add this
+
+# Paths
+MODEL_PATH=/shared/models/Llama-3.3-70B-Instruct
+COMPILED_MODEL_PATH=/shared/traced_model/Llama-3.3-70B-Instruct
+DRAFT_MODEL_PATH=/shared/models/Llama-3.2-1B-Instruct
\ No newline at end of file
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/compile.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/compile.yaml
new file mode 100644
index 0000000..2081e41
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/compile.yaml
@@ -0,0 +1,140 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: neuron-compile-scripts
+  namespace: neuron-inference
+data:
+  compile_model.sh: |
+    #!/usr/bin/env bash
+    set -euo pipefail
+
+    # ---- Inputs / defaults ----
+    : "${ENABLE_SPECULATIVE:=false}"
+    : "${SPECULATION_LENGTH:=7}"
+    : "${TP_DEGREE:=32}"
+    : "${BATCH_SIZE:=1}"
+    : "${MAX_CONTEXT_LEN:=8192}"
+    : "${SEQ_LEN:=8192}"
+
+    # Where the downloader put things
+    : "${MODEL_DIRNAME:=Qwen3-32B}"
+    : "${DRAFT_DIRNAME:=Qwen3-0.6B}"
+    : "${MODEL_ROOT:=/shared/model_hub}"
+
+    MODEL_PATH="${MODEL_ROOT}/${MODEL_DIRNAME}"
+    DRAFT_MODEL_PATH="${MODEL_ROOT}/${DRAFT_DIRNAME}"
+
+    # Option 1: separate compiled outputs per configuration
+    : "${COMPILED_ROOT:=/shared/compiled_models/Qwen3-32B}"
+    if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then
+      COMPILED_MODEL_PATH="${COMPILED_ROOT}/spec_slen${SPECULATION_LENGTH}_tp${TP_DEGREE}"
+    else
+      COMPILED_MODEL_PATH="${COMPILED_ROOT}/nospec_tp${TP_DEGREE}"
+    fi
+    mkdir -p "${COMPILED_MODEL_PATH}"
+
+    # Sanity checks (fail fast if downloads missing)
+    [[ -f "${MODEL_PATH}/config.json" ]] || { echo "Missing target model at ${MODEL_PATH}"; exit 2; }
+    if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then
+      [[ -f "${DRAFT_MODEL_PATH}/config.json" ]] || { echo "Missing draft model at ${DRAFT_MODEL_PATH}"; exit 2; }
+    fi
+
+    echo "Speculative=${ENABLE_SPECULATIVE} SLEN=${SPECULATION_LENGTH} TP=${TP_DEGREE} BS=${BATCH_SIZE} SEQ=${SEQ_LEN}"
+    echo "MODEL_PATH=${MODEL_PATH}"
+    echo "DRAFT_MODEL_PATH=${DRAFT_MODEL_PATH}"
+    echo "COMPILED_MODEL_PATH=${COMPILED_MODEL_PATH}"
+
+    BASE_ARGS=(
+      --model-type qwen3
+      --task-type causal-lm
+      run
+      --model-path "${MODEL_PATH}"
+      --compiled-model-path "${COMPILED_MODEL_PATH}"
+      --torch-dtype bfloat16
+      --start_rank_id 0
+      --local_ranks_size "${TP_DEGREE}"
+      --tp-degree "${TP_DEGREE}"
+      --batch-size "${BATCH_SIZE}"
+      --max-context-length "${MAX_CONTEXT_LEN}"
+      --seq-len "${SEQ_LEN}" 
+      --on-device-sampling
+      --top-k 1
+      --do-sample
+      --sequence-parallel-enabled
+      --cc-pipeline-tiling-factor 1
+      --enable-bucketing
+      --context-encoding-buckets 512 1024 2048 4096 8192 
+      --token-generation-buckets 512 1024 2048 4096 8192 
+      --compile-only
+      --prompt "What is annapurna labs?"
+    )
+
+    LOG=/shared/compile.log
+    if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then
+      BASE_ARGS+=( --draft-model-path "${DRAFT_MODEL_PATH}" --enable-fused-speculation --speculation-length "${SPECULATION_LENGTH}" )
+      LOG=/shared/compile_speculative.log
+    fi
+
+    echo "inference_demo ${BASE_ARGS[*]}"
+    inference_demo "${BASE_ARGS[@]}" 2>&1 | tee "${LOG}"
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: neuron-model-compilation
+  namespace: neuron-inference
+spec:
+  template:
+    spec:
+      restartPolicy: OnFailure
+      nodeSelector:
+        workload-type: "neuron-inference"
+      tolerations:
+      - key: aws.amazon.com/neuron
+        operator: Exists
+        effect: NoSchedule
+      containers:
+      - name: model-compiler
+        image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.0-ubuntu22.04
+        command: ["/bin/bash","-lc"]
+        args:
+        - |
+          set -euxo pipefail
+          df -h /shared || true
+          bash /scripts/compile_model.sh
+        env:
+        # tune/override as needed:
+        - name: ENABLE_SPECULATIVE
+          value: "false"    # set "true" for SD compile
+        - name: SPECULATION_LENGTH
+          value: "7"
+        - name: TP_DEGREE
+          value: "32"
+        - name: MODEL_DIRNAME
+          value: "Qwen3-32B"
+        - name: DRAFT_DIRNAME
+          value: "Qwen3-0.6B"
+        - name: COMPILED_ROOT
+          value: "/shared/compiled_models/Qwen3-32B"
+        resources:
+          limits:
+            aws.amazon.com/neuroncore: "32"
+            memory: "300Gi"
+            cpu: "64"
+          requests:
+            aws.amazon.com/neuroncore: "32"
+            memory: "300Gi"
+            cpu: "64"
+        volumeMounts:
+        - name: shared-storage
+          mountPath: /shared
+        - name: scripts
+          mountPath: /scripts
+      volumes:
+      - name: shared-storage
+        persistentVolumeClaim:
+          claimName: efs-models-pvc
+      - name: scripts
+        configMap:
+          name: neuron-compile-scripts
+          defaultMode: 0755
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/fsd-deploy.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/fsd-deploy.yaml
new file mode 100644
index 0000000..e2ad23d
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/fsd-deploy.yaml
@@ -0,0 +1,170 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: neuron-vllm-scripts
+  namespace: neuron-inference
+data:
+  start_vllm.sh: |
+    #!/bin/bash
+    set -euo pipefail
+
+    : "${ENABLE_SPECULATIVE:=false}"
+    : "${SPECULATION_LENGTH:=7}"
+    : "${TP_DEGREE:=32}"
+    : "${BATCH_SIZE:=1}"
+    : "${MAX_CONTEXT_LEN:=8192}"
+    : "${SEQ_LEN:=8192}"
+    : "${MAX_NUM_SEQS:=1}"
+
+    : "${MODEL_DIRNAME:=Qwen3-32B}"
+    : "${DRAFT_DIRNAME:=Qwen3-0.6B}"
+    : "${MODEL_ROOT:=/shared/model_hub}"
+
+    MODEL_PATH="${MODEL_ROOT}/${MODEL_DIRNAME}"
+    DRAFT_MODEL_PATH="${MODEL_ROOT}/${DRAFT_DIRNAME}"
+
+    : "${COMPILED_MODEL_PATH_STD:=/shared/compiled_models/Qwen3-32B/nospec_tp32}"
+    : "${COMPILED_MODEL_PATH_SPEC:=/shared/compiled_models/Qwen3-32B/spec_slen7_tp32}"
+
+    # -------- Select compiled dir based on mode --------
+    if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then
+      export NEURON_COMPILED_ARTIFACTS="${COMPILED_MODEL_PATH_SPEC}"
+    else
+      export NEURON_COMPILED_ARTIFACTS="${COMPILED_MODEL_PATH_STD}"
+    fi
+
+    echo "Mode: ${ENABLE_SPECULATIVE} | TP=${TP_DEGREE} | MAX_NUM_SEQS=${MAX_NUM_SEQS} | MAX_LEN=${SEQ_LEN}"
+    echo "Using NEURON_COMPILED_ARTIFACTS=${NEURON_COMPILED_ARTIFACTS}"
+    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
+
+    if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then
+      echo "Starting vLLM (speculative decoding)…"
+      VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \
+        --model "${MODEL_PATH}" \
+        --speculative-model "${DRAFT_MODEL_PATH}" \
+        --num-speculative-tokens "${SPECULATION_LENGTH}" \
+        --max-model-len "${SEQ_LEN}" \
+        --use-v2-block-manager \
+        --max-num-seqs "${MAX_NUM_SEQS}" \
+        --tensor-parallel-size "${TP_DEGREE}" \
+        --device neuron \
+        --override-neuron-config "{\"enable_fused_speculation\": true}" \
+        --host 0.0.0.0 --port 8000
+    else
+      echo "Starting vLLM (standard)…"
+      VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \
+        --model "${MODEL_PATH}" \
+        --max-model-len "${SEQ_LEN}" \
+        --max-num-seqs "${MAX_NUM_SEQS}" \
+        --tensor-parallel-size "${TP_DEGREE}" \
+        --device neuron \
+        --use-v2-block-manager \
+        --host 0.0.0.0 --port 8000
+    fi
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: neuron-qwen-inference
+  namespace: neuron-inference
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: neuron-qwen-inference
+  template:
+    metadata:
+      labels:
+        app: neuron-qwen-inference
+    spec:
+      nodeSelector:
+        workload-type: "neuron-inference"
+        node.kubernetes.io/instance-type: "trn1.32xlarge"
+      tolerations:
+      - key: aws.amazon.com/neuron
+        operator: Exists
+        effect: NoSchedule
+      containers:
+      - name: vllm-server
+        image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.0-ubuntu22.04
+        command: ["/bin/bash","-lc"]
+        args:
+        - |
+          set -euo pipefail
+          exec /scripts/start_vllm.sh
+        ports:
+        - containerPort: 8000
+          name: http
+        resources:
+          limits:
+            aws.amazon.com/neuroncore: "32"
+            memory: "400Gi"
+            cpu: "120"
+          requests:
+            aws.amazon.com/neuroncore: "32"
+            memory: "400Gi"
+            cpu: "120"
+        env:
+        - name: ENABLE_SPECULATIVE
+          value: "false"
+        - name: SPECULATION_LENGTH
+          value: "7"
+        - name: TP_DEGREE
+          value: "32"
+        - name: BATCH_SIZE
+          value: "1"
+        - name: MAX_CONTEXT_LEN
+          value: "8192"
+        - name: SEQ_LEN
+          value: "8192"
+        - name: MAX_NUM_SEQS
+          value: "1"
+        - name: MODEL_DIRNAME
+          value: "Qwen3-32B"
+        - name: DRAFT_DIRNAME
+          value: "Qwen3-0.6B"
+        - name: MODEL_ROOT
+          value: "/shared/model_hub"
+        - name: COMPILED_MODEL_PATH_STD
+          value: "/shared/compiled_models/Qwen3-32B/nospec_tp32"
+        - name: COMPILED_MODEL_PATH_SPEC
+          value: "/shared/compiled_models/Qwen3-32B/spec_slen7_tp32"
+        volumeMounts:
+        - name: shared-storage
+          mountPath: /shared
+        - name: vllm-scripts
+          mountPath: /scripts
+        readinessProbe:
+          httpGet:
+            path: /health      
+            port: 8000
+          initialDelaySeconds: 30
+          periodSeconds: 5
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 8000
+          initialDelaySeconds: 60
+          periodSeconds: 10
+      volumes:
+      - name: shared-storage
+        persistentVolumeClaim:
+          claimName: efs-models-pvc
+      - name: vllm-scripts
+        configMap:
+          name: neuron-vllm-scripts
+          defaultMode: 0755
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: neuron-qwen-svc
+  namespace: neuron-inference
+spec:
+  type: ClusterIP
+  selector:
+    app: neuron-qwen-inference
+  ports:
+  - name: http
+    port: 8000
+    targetPort: 8000
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/model_download.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/model_download.yaml
new file mode 100644
index 0000000..5a44c92
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/model_download.yaml
@@ -0,0 +1,96 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: neuron-download-scripts
+  namespace: neuron-inference
+data:
+  download_models.sh: |
+    #!/usr/bin/env bash
+    set -euo pipefail
+
+    : "${HF_MODEL_ID:=Qwen/Qwen3-32B}"
+    : "${HF_DRAFT_MODEL_ID:=Qwen/Qwen3-0.6B}"
+
+    : "${MODEL_DIRNAME:=Qwen3-32B}"
+    : "${DRAFT_DIRNAME:=Qwen3-0.6B}"
+
+    : "${MODEL_ROOT:=/shared/model_hub}"
+    MODEL_PATH="${MODEL_ROOT}/${MODEL_DIRNAME}"
+    DRAFT_MODEL_PATH="${MODEL_ROOT}/${DRAFT_DIRNAME}"
+
+    mkdir -p "$MODEL_PATH" "$DRAFT_MODEL_PATH"
+
+    if ! command -v hf >/dev/null 2>&1; then
+      python -m pip install -q --no-cache-dir 'huggingface_hub[cli]'
+    fi
+
+    if [[ -z "${HF_TOKEN:-}" ]]; then
+      echo "HF_TOKEN not set; cannot access gated repos." >&2
+      exit 1
+    fi
+    export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN"
+
+    echo "Downloading target: ${HF_MODEL_ID} -> ${MODEL_PATH}"
+    hf download "${HF_MODEL_ID}" --local-dir "${MODEL_PATH}" --exclude "*/.git/*"
+    echo "OK: ${MODEL_PATH}"
+
+    echo "Downloading draft: ${HF_DRAFT_MODEL_ID} -> ${DRAFT_MODEL_PATH}"
+    hf download "${HF_DRAFT_MODEL_ID}" --local-dir "${DRAFT_MODEL_PATH}" --exclude "*/.git/*"
+    echo "OK: ${DRAFT_MODEL_PATH}"
+
+    echo "Listing:"
+    find "${MODEL_PATH}" -maxdepth 1 -type f | head -n 20
+    find "${DRAFT_MODEL_PATH}" -maxdepth 1 -type f | head -n 20
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: neuron-model-download
+  namespace: neuron-inference
+spec:
+  template:
+    spec:
+      restartPolicy: OnFailure
+      nodeSelector:
+        workload-type: "neuron-inference"
+      tolerations:
+      - key: aws.amazon.com/neuron
+        operator: Exists
+        effect: NoSchedule
+      containers:
+      - name: downloader
+        image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.0-ubuntu22.04
+        command: ["/bin/bash","-lc"]
+        args:
+        - |
+          set -euxo pipefail
+          df -h /shared || true
+          bash /scripts/download_models.sh 2>&1 | tee /shared/download.log
+        env:
+        # Set/override these as needed:
+        - name: HF_MODEL_ID
+          value: "Qwen/Qwen3-32B"
+        - name: HF_DRAFT_MODEL_ID
+          value: "Qwen/Qwen3-0.6B"
+        - name: MODEL_DIRNAME
+          value: "Qwen3-32B"
+        - name: DRAFT_DIRNAME
+          value: "Qwen3-0.6B"
+        - name: HF_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token
+              key: HF_TOKEN
+        volumeMounts:
+        - name: shared-storage
+          mountPath: /shared
+        - name: scripts
+          mountPath: /scripts
+      volumes:
+      - name: shared-storage
+        persistentVolumeClaim:
+          claimName: efs-models-pvc
+      - name: scripts
+        configMap:
+          name: neuron-download-scripts
+          defaultMode: 0755
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/neuron-ingress.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/neuron-ingress.yaml
new file mode 100644
index 0000000..76a2c0c
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/neuron-ingress.yaml
@@ -0,0 +1,29 @@
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: neuron-qwen-ingress
+  namespace: neuron-inference
+  annotations:
+    kubernetes.io/ingress.class: alb
+    alb.ingress.kubernetes.io/scheme: internet-facing
+    alb.ingress.kubernetes.io/target-type: ip
+    # Health checks
+    alb.ingress.kubernetes.io/healthcheck-path: /health
+    alb.ingress.kubernetes.io/healthcheck-port: "8000"
+    alb.ingress.kubernetes.io/success-codes: "200"
+    # Ensure ALB is created in the AZs with your pods
+    alb.ingress.kubernetes.io/subnets: subnet-0186a8f2614f8c522, subnet-0ad9fdda1d6f7dc98 #replace with yours- min 2
+    # (Optional) Cross-zone LB
+    alb.ingress.kubernetes.io/load-balancer-attributes: load_balancing.cross_zone.enabled=true
+spec:
+  ingressClassName: alb
+  rules:
+    - http:
+        paths:
+          - path: /
+            pathType: Prefix
+            backend:
+              service:
+                name: neuron-qwen-svc
+                port:
+                  number: 8000
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/storage.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/storage.yaml
new file mode 100644
index 0000000..8881a9e
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/storage.yaml
@@ -0,0 +1,28 @@
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: efs-pv
+spec:
+  capacity:
+    storage: 500Gi
+  volumeMode: Filesystem
+  accessModes:
+    - ReadWriteMany
+  persistentVolumeReclaimPolicy: Retain
+  storageClassName: efs-sc
+  csi:
+    driver: efs.csi.aws.com
+    volumeHandle: fs-0f239c95d507d9517 # Change to your own ID
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: efs-models-pvc   # aligned
+  namespace: neuron-inference
+spec:
+  accessModes:
+    - ReadWriteMany
+  storageClassName: efs-sc
+  resources:
+    requests:
+      storage: 500Gi
\ No newline at end of file
diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/neuron-monitor/neuron-container.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/neuron-monitor/neuron-container.yaml
new file mode 100644
index 0000000..9e77bf0
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/neuron-monitor/neuron-container.yaml
@@ -0,0 +1,62 @@
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: neuron-monitor
+  namespace: neuron-monitor
+  labels:
+    app: neuron-monitor
+    version: v1
+spec:
+  selector:
+    matchLabels:
+      app: neuron-monitor
+  template:
+    metadata:
+      labels:
+        app: neuron-monitor
+        version: v1
+    spec:
+      affinity:
+        nodeAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            nodeSelectorTerms:
+              - matchExpressions:
+                  - key: kubernetes.io/os
+                    operator: In
+                    values:
+                      - linux
+                  - key: node.kubernetes.io/instance-type
+                    operator: In
+                    values:
+                      - trn1.2xlarge
+                      - trn1.32xlarge
+                      - trn1n.32xlarge
+                      - inf1.xlarge
+                      - inf1.2xlarge
+                      - inf1.6xlarge
+                      - inf2.xlarge
+                      - inf2.8xlarge
+                      - inf2.24xlarge
+                      - inf2.48xlarge
+      containers:
+        - name: neuron-monitor
+          image: public.ecr.aws/neuron/neuron-monitor:1.0.1
+          ports:
+            - containerPort: 8000
+          command:
+             - "/opt/bin/entrypoint.sh"
+          args: 
+            - "--port"
+            - "8000"  
+          resources:
+            limits:
+              cpu: 500m
+              memory: 256Mi
+            requests:
+              cpu: 256m
+              memory: 128Mi
+          env:
+          - name: GOMEMLIMIT
+            value: 160MiB
+          securityContext:
+            privileged: true
diff --git a/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img-neff.png b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img-neff.png
new file mode 100644
index 0000000..7b5d5a4
Binary files /dev/null and b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img-neff.png differ
diff --git a/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img.txt b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img.txt
new file mode 100644
index 0000000..8b13789
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img.txt
@@ -0,0 +1 @@
+
diff --git a/2.projects/neuronx-distributed/nxdi-ec2-vllm/mistral-small-3-0-nxdi-guide.ipynb b/2.projects/neuronx-distributed/nxdi-ec2-vllm/mistral-small-3-0-nxdi-guide.ipynb
new file mode 100644
index 0000000..e2a00e6
--- /dev/null
+++ b/2.projects/neuronx-distributed/nxdi-ec2-vllm/mistral-small-3-0-nxdi-guide.ipynb
@@ -0,0 +1,1173 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "5a972332",
+   "metadata": {},
+   "source": [
+    "# Guide to deploy, benchmark, and profile Mistral Small 2501 with NXDI and VLLM on Trn1\n",
+    "\n",
+    "This notebook provides a step-by-step guide for serving, profiling, and running benchmarks on Mistral Small 24B model on a **Trn1** instance. \n",
+    "\n",
+    "## Mistral Small 2501\n",
+    "\n",
+    "[Mistral Small 3.0](https://mistral.ai/news/mistral-small-3) is a 24B-parameter language model from Mistral AI optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment. The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.\n",
+    "\n",
+    "## Neuronx-Distributed-Inference (NxDI)\n",
+    "\n",
+    "[NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview) (where NxD stands for NeuronX Distributed) is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances. Introduced with Neuron SDK 2.21 release, it offers advanced inference capabilities, including features such as continuous batching and speculative decoding for high performance inference. Additionally, it supports inference engine for vLLM for seamless integration with the majority of customers’ production deployment systems. ML developers can use NxD Inference library at different levels of abstraction that fits their inference use case.\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "1. **Check/Install Dependencies** for AWS Neuron (NXDI, vLLM fork, etc.).\n",
+    "2. **Optional**: Install additional utilities (`inference-benchmarking` (lm_eval), InfluxDB, `llmperf` for performance benchmarking, etc.).\n",
+    "3. **Download** Mistral Small 24B base model.\n",
+    "4. **Spin Up** a VLLM server, benchmark and pull a profile.\n",
+    "   \n",
+    "### Prerequisites\n",
+    "\n",
+    "- **Amazon EC2 Trn1.32xlarge instance** with AWS Neuron drivers and recommended PyTorch environment.\n",
+    "- **NXDI virtual environment** (e.g., `aws_neuronx_venv_pytorch_2_5_nxd_inference`) is required.\n",
+    "\n",
+    "- To request a quota increase for `trn1.32xlarge` on EC2, follow these steps:\n",
+    "\n",
+    "1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).\n",
+    "2. Choose Amazon EC2.\n",
+    "3. Review your default quota for the following resources:\n",
+    "   - `trn1.32xlarge` for ec2 on-demand use\n",
+    "4. If needed, request a quota increase for these resources.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "355c31c9-b15d-4879-84d4-92d62fd82d4d",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-warning\"> \n",
+    "\n",
+    "<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8934c846-b3f5-458a-b11b-77b813904e0c",
+   "metadata": {},
+   "source": [
+    "### Create Your EC2 instance\n",
+    "\n",
+    "Follow the steps here for a detailed set up of your EC2 instance setup:\n",
+    "\n",
+    "#### Steps:\n",
+    "- Navigate to the EC2 dashboard from the AWS mgmt console and launch your instance.\n",
+    "- Search for the Ubuntu 22.04 Neuron DLAMI.\n",
+    "- Choose the instance size as Trn1.32xlarge or any other Neuron based instance you're able to fit the model.\n",
+    "- Set the inbound rule for ssh to your local machine's ip address or anywhere (note that it is not in accordance to set this to allow trafic from any ipv4, please ensure you secure these ports once done testing.\n",
+    "- Create and specify your ssh key in the instance configuration step. You will need your .pem file\n",
+    "- Create your instance.\n",
+    "- Once you have launched your instance, navigate to either your terminal or VSCODE and follow the steps below:\n",
+    "\n",
+    "#### ssh for powershell:\n",
+    "\n",
+    "`$PUBLIC_DNS=\"paste your public ipv4 dns here\" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`\n",
+    "`$KEY_PATH=\"paste ssh key path here\" # local path to key, e.g. ssh/trn.pem`\n",
+    "\n",
+    "`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`\n",
+    "\n",
+    "#### ssh for linux/macOS:\n",
+    "\n",
+    "`export PUBLIC_DNS=\"paste your public ipv4 dns here\" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`\n",
+    "`export KEY_PATH=\"paste ssh key path here\" # local path to key, e.g. ssh/trn.pem`\n",
+    "\n",
+    "`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`\n",
+    "\n",
+    "You should have sshed into your EC2 instance. \n",
+    "\n",
+    "- Activate your NXDI venv:\n",
+    "\n",
+    "`source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate`\n",
+    "\n",
+    "- Activate jupyter server:\n",
+    "\n",
+    "`jupyter lab —no-browser —port 8888 —ip 0.0.0.0`\n",
+    "\n",
+    "You should see a familiar jupyter output with a URL to the notebook.\n",
+    "\n",
+    "`http://localhost:8888/....`\n",
+    "\n",
+    "We can click on it, and a jupyter environment opens in our local browser. Upload this notebook to your jupyter environment and run the steps in the cells below."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8d12827-cdd6-49ad-a5b9-fae27f1e7cda",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "511f6786-8ed7-4036-9070-3d6669fcb92b",
+   "metadata": {},
+   "source": [
+    "## Install and Set up Dependencies"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3652fc5a",
+   "metadata": {},
+   "source": [
+    "### 1. Validate / Activate Python Environment\n",
+    "\n",
+    "Inside a Jupyter notebook, using `source myenv/bin/activate` directly will not persist the environment in subsequent cells, because source runs in a subshell. Please run the command to actuvate the venv in the terminal or activate prior to spinning up the server"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1fa75aff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "# (Optional) Uncomment or modify the following line to activate a custom environment.\n",
+    "#source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate\n",
+    "\n",
+    "echo 'Python environment check:'\n",
+    "which python\n",
+    "python --version"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e657dec1-61fc-4cf7-bec8-48ecd1213c75",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile requirements.txt\n",
+    "torch==2.5.1\n",
+    "transformers==4.45.2\n",
+    "huggingface_hub\n",
+    "git-lfs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f4844ef-f16c-4bda-bb05-62ca83a7d65f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -U -r requirements.txt --quiet"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88dc63be-90d4-4145-b42b-ef3fcf4c1f6c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip list | grep neuron"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e9200f12-356c-43f9-afad-68cb1b486e4b",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18ea6ed2",
+   "metadata": {},
+   "source": [
+    "### 2. Install Neuron vLLM Fork\n",
+    "\n",
+    "If you would like to serve your model via [vLLM](https://vllm.readthedocs.io/en/latest/) specialized for Neuron-based inference, you can install AWS Neuron's vLLM fork. NxD Inference integrates into vLLM by extending the model execution components responsible for loading and invoking models used in vLLM’s LLMEngine (see [link](https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine) for more details on vLLM architecture). This means input processing, scheduling and output processing follow the default vLLM behavior.\n",
+    "\n",
+    "You enable the Neuron integration in vLLM by setting the device type used by `vLLM` to `neuron`.\n",
+    "\n",
+    "Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.\n",
+    "\n",
+    "\n",
+    "Skip this step if you do not need the vLLM server. Cloning and installing vLLM takes 8-10 minutes to complete\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb9c80f8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "set -euxo pipefail\n",
+    "\n",
+    "if [ -d \"/home/ubuntu/upstreaming-to-vllm\" ]; then\n",
+    "    echo \"Neuron vLLM fork already cloned. Skipping.\"\n",
+    "else\n",
+    "    echo \"Cloning and installing AWS Neuron vLLM fork...\"\n",
+    "    cd /home/ubuntu/\n",
+    "    git clone -b neuron-2.22-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git #neuron 2.22 vllm version\n",
+    "    cd upstreaming-to-vllm\n",
+    "    pip install -r requirements-neuron.txt --quiet\n",
+    "\n",
+    "    # Install in editable mode with device set to neuron\n",
+    "    VLLM_TARGET_DEVICE=\"neuron\" pip install -e . --quiet\n",
+    "fi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2c245e5-21ee-4002-b85e-01127d3e4340",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "adb91034",
+   "metadata": {},
+   "source": [
+    "### 3. (Optional) Install benchmarking and profiling tools\n",
+    "\n",
+    "#### 3.1 Install llmperf\n",
+    "\n",
+    "If you'd like to run benchmarks or load tests, you can install [llmperf](https://github.com/ray-project/llmperf). Skip if not needed.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cba57d72",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "if pip show llmperf > /dev/null 2>&1; then\n",
+    "    echo \"llmperf is already installed. Skipping.\"\n",
+    "else\n",
+    "    echo \"Installing llmperf...\"\n",
+    "    cd /home/ubuntu/\n",
+    "    git clone https://github.com/ray-project/llmperf.git > /dev/null 2>&1 --quiet\n",
+    "    cd llmperf\n",
+    "    pip install -e . --quiet\n",
+    "fi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e1b95852-a69f-4c6a-bdcd-915bdaf6d46a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip list| grep neuron"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ef8c80e-20dd-4d40-8d38-9973f03e08eb",
+   "metadata": {},
+   "source": [
+    "#### 3.2 Install AWS Neuron Tools (If Needed)\n",
+    "\n",
+    "This cell installs the Neuron packages for profiling and other tooling. If already installed, the script checks and skips. For more information, see [Installing Neuron Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/index.html).\n",
+    "\n",
+    "> **Note**: If you have your apt sources already configured and have installed the Neuron packages, you can skip this step.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2855f8f0-7451-454d-a25f-2ac3574334de",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "set -euxo pipefail\n",
+    "\n",
+    "# Check if aws-neuronx-tools is installed\n",
+    "if dpkg -s aws-neuronx-tools > /dev/null 2>&1; then\n",
+    "    echo \"aws-neuronx-tools is already installed. Skipping.\"\n",
+    "else\n",
+    "    echo \"Installing aws-neuronx-tools...\"\n",
+    "    . /etc/os-release\n",
+    "\n",
+    "    sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF\n",
+    "deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main\n",
+    "EOF\n",
+    "\n",
+    "    wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -\n",
+    "    sudo apt-get update -y\n",
+    "    sudo apt-get install -y aws-neuronx-runtime-lib aws-neuronx-dkms aws-neuronx-tools\n",
+    "fi\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7322c6a7-89b9-4188-a0e1-270097e45639",
+   "metadata": {},
+   "source": [
+    "#### 3.3 (Optional) Install InfluxDB 2.x\n",
+    "\n",
+    "Install InfluxDB if using the Neuron Profiler"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b047c8ba-864f-4191-bf20-9e7239ecf5c4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "if dpkg -s influxdb2 > /dev/null 2>&1; then\n",
+    "    echo \"InfluxDB2 is already installed, skipping.\"\n",
+    "    if systemctl is-active --quiet influxdb; then\n",
+    "        echo \"InfluxDB is already running.\"\n",
+    "    else\n",
+    "        sudo systemctl start influxdb\n",
+    "        echo \"Setting up InfluxDB ...\"\n",
+    "        # influx setup\n",
+    "    fi\n",
+    "else\n",
+    "    # Install InfluxDB\n",
+    "    wget -q https://repos.influxdata.com/influxdata-archive_compat.key\n",
+    "    echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && \\\n",
+    "      cat influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null\n",
+    "    echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list\n",
+    "    \n",
+    "    sudo apt-get update && sudo apt-get install influxdb2 influxdb2-cli -y\n",
+    "    sudo systemctl start influxdb\n",
+    "    \n",
+    "    # Run non-interactive influx setup with all necessary flags\n",
+    "    # replace the following flags below with the necessary credentials\n",
+    "    influx setup \\\n",
+    "      --username admin \\\n",
+    "      --password testpassowrd \\\n",
+    "      --org yourorg \\\n",
+    "      --bucket yourbucket \\\n",
+    "      --token yoursupersecrettoken \\\n",
+    "      --force\n",
+    "\n",
+    "fi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "16eb98a5-dcc6-4d82-8283-949c6b3bf266",
+   "metadata": {},
+   "source": [
+    "#### 3.4 Accuracy-benchmarking with lm_eval\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d614ddb2-5284-488c-b377-77598508e001",
+   "metadata": {},
+   "source": [
+    "Copy the [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) directory to some location on your instance. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8702b601-0a7e-4d87-8004-fa7bb7e0389a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! git clone https://github.com/aws-neuron/aws-neuron-samples.git"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12f5e818-6488-4270-975a-81845b3d3cd4",
+   "metadata": {},
+   "source": [
+    "Change directory to the your copy of inference-benchmarking. Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed manual install NxD Inference ) by:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07dddce8-101c-4d2d-b01c-fb93c6a07b72",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/\n",
+    "pip install -r requirements.txt --quiet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4abe3940-45bb-4bae-a12f-d7d6959a2927",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d1970fc",
+   "metadata": {},
+   "source": [
+    "## 4. Download or Provide Your Model\n",
+    "\n",
+    "Below is a template for downloading the model. You can skip or adjust if you already have a local model.\n",
+    "\n",
+    "For more information on model checkpoint usage, see the [NxDI inference with Hugging Face-based models](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ccff01a8-94f7-4d10-bdf7-71229ec19cb9",
+   "metadata": {},
+   "source": [
+    "You will need to log in to huggingface from the commandline.  You will need your token from https://huggingface.co/settings/tokens Paste it to replace the MY_HUGGINGFACE_TOKEN_HERE text below. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d7186278-332e-4e07-87b8-fdf1a30f3335",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git config --global credential.helper store\n",
+    "from huggingface_hub import notebook_login\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "860b9863-fc7e-4752-ae55-8d4b89312d4b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#run the following code in the terminal to install git-lfs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2500bcdc-6f0a-4484-b64d-dc23ef94c445",
+   "metadata": {},
+   "source": [
+    "`sudo apt-get update`\n",
+    "\n",
+    "`sudo apt-get install git-lfs`\n",
+    "\n",
+    "`git lfs install`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "01d2bdbf-1af1-4a20-a255-7642b8da704d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#check that git lfs is installed on path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c72f91e7-89ee-4b90-99f4-91239197ed72",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git lfs version"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a8774eaf-d975-49ab-a1dc-2f6062513aab",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#start a tmux session and run the following code in the terminal:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5117214b-99fd-4ecf-92fa-e084ee4d95cc",
+   "metadata": {},
+   "source": [
+    "`sudo apt-get update`\n",
+    "\n",
+    "`sudo apt-get install tmux`\n",
+    "\n",
+    "`tmux new -s mysession`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bcae2706-cd59-4e38-9353-fe9d3b2283ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# run the following code to download the model in a tmux session since this may take a while - run in terminal"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb753745-5fd9-4e17-984f-e1647be82dda",
+   "metadata": {},
+   "source": [
+    "`git clone https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "baae5abf-e4e8-4418-8747-d5bc05e7f3fe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!du -sh /home/ubuntu/Mistral-Small-24B-Instruct-2501/ #check if the full model was copied in"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f481bc0-8312-4d24-8dd3-ad834518d1dc",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "083589fe-ee9b-4908-bbe0-b66fff1703a9",
+   "metadata": {},
+   "source": [
+    "## 5. Compile and save model and run generation with HuggingFaceGenerationAdapter- `inference_demo.py`\n",
+    "\n",
+    "NxD Inference supports running inference with the HuggingFace generate inference. To use HuggingFace-style generation, create a HuggingFaceGenerationAdapter that wraps a Neuron application model. Then, you can call generate on the adapted model. In the below cell, we use the `inference_demo` script that NXDI provides to compile, save, and run some prompts with our Mistral Small 24B model, for more information on the flags we set, refer to the [nxdi api reference guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a76a78e0-4ed2-4057-b702-d472d24f515f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "# Replace this with the path where you downloaded and saved the model files.\n",
+    "# These should be the same paths used when compiling the model.\n",
+    "MODEL_PATH=\"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n",
+    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n",
+    "TP_DEGREE=32\n",
+    "\n",
+    "inference_demo \\\n",
+    "    --model-type llama \\\n",
+    "    --task-type causal-lm \\\n",
+    "        run \\\n",
+    "        --model-path $MODEL_PATH \\\n",
+    "        --compiled-model-path $COMPILED_MODEL_PATH \\\n",
+    "        --torch-dtype bfloat16 \\\n",
+    "        --start_rank_id 0 \\\n",
+    "        --tp-degree $TP_DEGREE \\\n",
+    "        --batch-size 1 \\\n",
+    "        --max-context-length 12288 \\\n",
+    "        --seq-len 12800 \\\n",
+    "        --on-device-sampling \\\n",
+    "        --top-k 1 \\\n",
+    "        --do-sample \\\n",
+    "        --fused-qkv \\\n",
+    "        --sequence-parallel-enabled \\\n",
+    "        --pad-token-id 2 \\\n",
+    "        --enable-bucketing \\\n",
+    "        --context-encoding-buckets 2048 4096 8192 12288 \\\n",
+    "            --token-generation-buckets 2048 4096 8192 12800 \\\n",
+    "        --prompt \"What is annapurna labs?\" 2>&1 | tee log"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7df76059-45ba-4726-9bcc-7a545a6d1575",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a8ba793-3901-4620-b4b0-5150325a005a",
+   "metadata": {},
+   "source": [
+    "## 6. vLLM demo and perf benchmarking - standalone model "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cc9425cb-9354-4f0a-8375-900961ff54af",
+   "metadata": {},
+   "source": [
+    "#### 6.1 Run Mistral Small 2501 on Trainium"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1d58b2a2-8475-4599-8750-24253c3c9c29",
+   "metadata": {},
+   "source": [
+    "Here is an example for running online inference with Mistral Small 2501 and let's get some perf results. We will first compile and run generation on a sample prompt using a command installed by neuronx-distributed-inference. The script compiles the model and runs generation on the given input prompt. Note the path we used to save the compiled model. This path should be used when launching vLLM server for inference so that the compiled model can be loaded without recompilation. Please refer to [NxD Inference API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html) and [VLLM user guide for NxDI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)for more information on these `inference_demo` flags."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "71c866cb-9ace-4fae-a940-8776c4e41ad1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip list | grep neuron"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8fd75bb0-6fb8-4f89-8203-73b8247fc879",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# RUN THE FOLLOWING CELL IN A TERMINAL - spin up the vllm server"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca565569-45f9-4c0c-b81d-e0c7eaa98249",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# These should be the same paths used when compiling the model. - command for terminal\n",
+    "MODEL_PATH=\"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n",
+    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n",
+    "\n",
+    "export VLLM_NEURON_FRAMEWORK=\"neuronx-distributed-inference\"\n",
+    "export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH\n",
+    "VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \\\n",
+    "    --model $MODEL_PATH \\\n",
+    "    --max-num-seqs 1 \\\n",
+    "    --max-model-len 12800 \\\n",
+    "    --tensor-parallel-size 32 \\\n",
+    "    --device neuron \\\n",
+    "    --use-v2-block-manager \\\n",
+    "    --port 8000 &\n",
+    "PID=$!\n",
+    "echo \"vLLM server started with PID $PID\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d24058f-67a6-4f94-9a39-0ce4ab64d801",
+   "metadata": {},
+   "source": [
+    "Let's send a quick request with a python client to the server:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "124db78f-f910-4bcd-b631-4a2fb4166a54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openai import OpenAI\n",
+    "\n",
+    "# Client Setup\n",
+    "openai_api_key = \"EMPTY\"\n",
+    "openai_api_base = \"http://localhost:8000/v1\"\n",
+    "\n",
+    "client = OpenAI(\n",
+    "    api_key=openai_api_key,\n",
+    "    base_url=openai_api_base,\n",
+    ")\n",
+    "\n",
+    "models = client.models.list()\n",
+    "model_name = models.data[0].id\n",
+    "\n",
+    "# Sampling Parameters\n",
+    "max_tokens = 1024\n",
+    "temperature = 1.0\n",
+    "top_p = 1.0\n",
+    "top_k = 50\n",
+    "stream = False\n",
+    "\n",
+    "# Chat Completion Request\n",
+    "response = client.chat.completions.create(\n",
+    "    model=model_name,\n",
+    "    messages=[\n",
+    "       {\"role\": \"system\", \"content\": \"You are a helpful AI assistant.\"},\n",
+    "       {\"role\": \"user\", \"content\": \"What is AWS Neuron?\"}\n",
+    "    ],\n",
+    ")\n",
+    "\n",
+    "# Parse the response\n",
+    "generated_text = \"\"\n",
+    "generated_text = response.choices[0].message.content\n",
+    "\n",
+    "print(generated_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d9c4bad-6cb2-43d6-86db-d2011f1bdaf3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!neuron-ls # show running processes - vllm server is still running"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d8f6afb-6f64-4387-ad65-a7fe705d9fe9",
+   "metadata": {},
+   "source": [
+    "----"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "053e81d2-e2cf-4ef7-9a55-1fb495bc416f",
+   "metadata": {},
+   "source": [
+    "#### 6.2 llmperf- let's run some quick benchmarks "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c45478cd-64a5-4a9a-8622-294848e2f852",
+   "metadata": {},
+   "source": [
+    "After the above steps, the vllm server should be running. You can now measure the performance using LLMPerf. Before we can use the llmperf package, we need to make a few changes to its code. Follow benchmarking with LLMPerf guide to apply the code changes.\n",
+    "\n",
+    "Below is a sample shell script to run LLMPerf. To provide the model with 10000 tokens as input and generate 1500 tokens as output on average, we use the following parameters from LLMPerf:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa9ab418-c069-40cb-8041-5c0574c3377f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "cd /home/ubuntu/llmperf/\n",
+    "\n",
+    "MODEL_PATH=\"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n",
+    "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n",
+    "OUTPUT_PATH=llmperf-results-sonnets\n",
+    "\n",
+    "export OPENAI_API_BASE=\"http://localhost:8000/v1\"\n",
+    "export OPENAI_API_KEY=\"mock_key\"\n",
+    "\n",
+    "python token_benchmark_ray.py \\\n",
+    "    --model $MODEL_PATH \\\n",
+    "    --mean-input-tokens 10000 \\\n",
+    "    --stddev-input-tokens 0 \\\n",
+    "    --mean-output-tokens 1500 \\\n",
+    "    --stddev-output-tokens 0 \\\n",
+    "    --num-concurrent-requests 1\\\n",
+    "    --timeout 3600 \\\n",
+    "    --max-num-completed-requests 50 \\\n",
+    "    --additional-sampling-params '{}' \\\n",
+    "    --results-dir $OUTPUT_PATH \\\n",
+    "    --llm-api \"openai\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c38d5e8d-0c16-4061-b22f-c558ba1d5050",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!sudo kill 55509 #stop the server"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc00332e-d7e0-4607-b91e-e08503a2319a",
+   "metadata": {},
+   "source": [
+    "Summarized results:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3943d616-6474-40a7-b9cd-61faac231c15",
+   "metadata": {},
+   "source": [
+    "| Scenario                                                                  | TTFT (p50 ms) | TPOT (p50 ms) | Output-token Throughput (tokens/s, p50) |\n",
+    "|---------------------------------------------------------------------------|---------------|---------------|-----------------------------------------|\n",
+    "| Mistral-Small-24B-Instruct-2501 on Trainium (OpenAI-style API)            | 347           | 10.55         | 107.35                                  |\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c99fb0e-5712-43f1-be25-2892342bbce4",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6ec3558-b527-44a2-b107-72a6c89d4722",
+   "metadata": {},
+   "source": [
+    "#### 6.3 Running Evaluations\n",
+    "\n",
+    "There are two methods that you can use the evaluation scirpts to run your evaluation. For more information, check out the [inference-demo](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/)directory and [tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn1-llama3.1-70b-instruct-accuracy-eval-tutorial.html) in NXDI.\n",
+    "\n",
+    "1. Using a `yaml` configuration file and `accuracy.py` script\n",
+    "\n",
+    "2. Writing your own python script that uses several components provided in `accuracy.py` and `server_config.py`\n",
+    "\n",
+    "In this notebook we only demonstrate running an eval with the `yaml` config file.\n",
+    "\n",
+    "In this method all you need is to create a yaml config file that specifies the server configuration and testing scenario you want to run. Create `config.yaml` with the following content."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c81c85a-6e50-47f5-9baa-f22e465f7297",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile mistral_config.yaml\n",
+    "\n",
+    "server:\n",
+    "  name: \"Mistral-Small-24B-Instruct\"\n",
+    "  model_path: \"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n",
+    "  model_s3_path: null\n",
+    "  compiled_model_path: \"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n",
+    "  max_seq_len: 12800\n",
+    "  context_encoding_len: 12288\n",
+    "  tp_degree: 32\n",
+    "  n_vllm_threads: 32\n",
+    "  server_port: 8888\n",
+    "  continuous_batch_size: 1\n",
+    "\n",
+    "test:\n",
+    "  accuracy:\n",
+    "    mytest:\n",
+    "      client: \"lm_eval\"\n",
+    "      datasets: [\"gsm8k_cot\"]\n",
+    "      max_concurrent_requests: 1\n",
+    "      timeout: 3600\n",
+    "      client_params:\n",
+    "        limit: 200\n",
+    "        use_chat: True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b4c171e0-a6ef-4782-bb67-2927d393e4b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "if test -f \"/home/ubuntu/aws-neuron-samples/inference-benchmarking/mistral_config.yaml\"; then\n",
+    "   echo \"config file exists.\"\n",
+    "else \n",
+    "   echo \"Copying config file.\"\n",
+    "   mv /home/ubuntu/mistral_config.yaml /home/ubuntu/aws-neuron-samples/inference-benchmarking/\n",
+    "fi"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1efea9af-ed19-476e-9841-764ea6f1b101",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/\n",
+    "python accuracy.py --config mistral_config.yaml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6ee55692-a5f9-41f5-a955-ce36c52f5b96",
+   "metadata": {},
+   "source": [
+    "Results Summary:\n",
+    "\n",
+    "Accuracy_mytest_gsm8k_cot:\n",
+    "    Saved at  results/accuracy/mytest/gsm8k_cot/__home__ubuntu__Mistral-Small-24B-Instruct-2501__/results_2025-04-26T20-02-47.843052.json:\n",
+    "    \n",
+    "    Metrics: {'gsm8k_cot': {'AccuracyExactMatchStrictMatch': 39.5, 'AccuracyExactMatchStrictMatchStderr': 3.46537, 'AccuracyExactMatchFlexibleExtract': 78.5, 'AccuracyExactMatchFlexibleExtractStderr': 2.91224}}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0b0f6af-8bb3-4e62-b98e-a6bb779f85a4",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7b7fa83-aa5c-44c1-986b-c472a1243ed4",
+   "metadata": {},
+   "source": [
+    "#### 6.4 Profiling with `neuron-profile`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3376804-9c21-4e68-ba32-e299ed75fb9d",
+   "metadata": {},
+   "source": [
+    "`neuron-profile` helps developers identify performance bottlenecks and optimize their workloads for NeuronDevices. `neuron-profile` provides insights into NeuronDevice activity including the instructions executed on each compute engine (ex. Tensor engine, Vector engine, etc.), DMA data movement activity, and performance metrics such as engine utilization, DMA throughput, memory usage, and more. NeuronDevice activity is collected by the `neuron-profile` capture command which runs the model with tracing enabled. Profiling typically has near zero overhead because NeuronDevices have dedicated on-chip hardware profiling."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2be3b7c-a7e0-4295-b306-e31bbbac8da9",
+   "metadata": {},
+   "source": [
+    "Let's cd into `/tmp/nxd_model` for the compiler working dir with the `context_encoding` and `token_generation` directories that we set the context encoding and token generation buckets for, which hold the NEFFs for these. The neuron-profile tool can both capture and post-process profiling information. neuron-profile takes a compiled model (a NEFF), executes it, and saves the profile results to a NTFF (profile.ntff by default)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d6ce0b5-f029-41a3-9971-a087341584fc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash \n",
+    "cd /tmp/nxd_model/\n",
+    "ls #list directories\n",
+    "cd context_encoding_model\n",
+    "ls "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea3c44a0-7c75-46f8-adb6-e413344e51bf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#_tp0_bk0  _tp0_bk1  _tp0_bk2  _tp0_bk3 - are the context encoding buckets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9398648-ae6c-4069-baee-338e1205a1e0",
+   "metadata": {},
+   "source": [
+    "##### Capturing profiles for multi-worker jobs\n",
+    "`neuron-profile` can capture profiles for collectives-enabled NEFFs running across multiple NeuronCores, NeuronDevices, or even nodes. This is useful for understanding performance and communication overheads when deploying larger distributed models.\n",
+    "\n",
+    "The following example, performs a distributed run across all NeuronDevices and NeuronCores on our trn1.32xlarge instance, capturing profiles for all 32 workers (one for each NeuronCore)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c746ce34-fc4a-48bb-9558-c1babbcd7c91",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "# 1. Make sure the directory exists and is writable\n",
+    "mkdir -p /tmp/output/          \n",
+    "\n",
+    "cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/\n",
+    "# 2. Run the capture, pointing -s at that directory\n",
+    "neuron-profile capture \\\n",
+    "  -n graph.neff \\\n",
+    "  --collectives-workers-per-node 32 \\\n",
+    "  -s /tmp/output/profile.ntff        \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c5250d4-f3a7-494b-9c42-d14fbbb07a26",
+   "metadata": {},
+   "source": [
+    "Now if we check our output dir- A profile is saved for each worker in the output directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b40fe3a-f01f-4f62-96f5-2df5db0d05dd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash \n",
+    "cd /tmp/output/\n",
+    "ls"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee5fbaba-ecfb-4f5e-8af8-31bd6df05162",
+   "metadata": {},
+   "source": [
+    "##### Viewing profiles for multi-worker jobs\n",
+    "Profiles from multi-worker jobs (i.e. more than one NeuronCore) can either be viewed individually or in a combined collectives view. Since profile data is often similar between workers and processing profile data for all workers can be time-consuming, it is recommended to first explore the profile for a single worker or small subset of workers. Viewing the profile for a specific worker is the same as for single-worker profiles."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a5b39974-8a7a-456b-b035-c841adecb8d1",
+   "metadata": {},
+   "source": [
+    "In the beginning, we forwarded port 3001 and 806. This is because `neuron-profile` view is running on a remote instance, we need to use port forwarding to access the profiles.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "898c5311-fc27-4dc3-b9e1-ec3da5654089",
+   "metadata": {},
+   "source": [
+    "Viewing the profile for a specific worker is as below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "504f6ec8-ee60-49b2-912f-b2a9be8c1f03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/\n",
+    "neuron-profile view -n graph.neff -s /tmp/output/profile_rank_2.ntff"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3308394-08a3-413d-a38b-d3266ae2c58f",
+   "metadata": {},
+   "source": [
+    "You will see an output like- View profile at http://localhost:3001/profile/n_a1143c514431fb4c23b7aae9208fd1a89cad42f6"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f756c277-bc7b-462a-91ed-eaaf3fa6dd8d",
+   "metadata": {},
+   "source": [
+    "![image-profile](imgs/img-neff.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e323e2f-3980-4ac7-8c32-ed395c937f0e",
+   "metadata": {},
+   "source": [
+    "To view the profile for multiple workers, pass the directory containing all worker profiles to neuron-profile."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b082bd6-ed2c-4d70-b092-1e97fe9df1ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/\n",
+    "neuron-profile view -n graph.neff -d /tmp/output"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb861db0-99e5-42ce-b9c9-bb7cc185af6a",
+   "metadata": {},
+   "source": [
+    "For more on profiling with neuron and understanding profiles,check out the [link](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html) to `neuron-profile` user guide"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "40aca927-7701-4f4e-9320-cc2faf9d05bd",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f071be6d-3a5c-4fa0-b02f-4cf3802ff7c2",
+   "metadata": {},
+   "source": [
+    "## Conclusion"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e70149be-ba89-4320-a17c-082ba6a72b6d",
+   "metadata": {},
+   "source": [
+    "In this notebook, we successfully walked through deploying, benchmarking, and generating profiles for NEFFs on TRN1 using Mistral Small 2501"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a281311e-0c58-41c1-9f61-93735c768f1f",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d98b2877-bb28-4762-8ef2-9165ef505046",
+   "metadata": {},
+   "source": [
+    "#### Distributors\n",
+    "- AWS\n",
+    "- Mistral"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "conda_python3",
+   "language": "python",
+   "name": "conda_python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/README.md b/README.md
index a27e950..43ba121 100644
--- a/README.md
+++ b/README.md
@@ -13,14 +13,15 @@ README                                # Project Summaries
 |-- README                            # Setup for infrastructure (VPC, EKS cluster etc)
 |-- 0_setup_vpc/                      # CloudFormation templates for reference VPC
 |-- 1_setup_cluster/                  # Scripts to create your cluster using EKS
-2.project/
+2.projects/
 |-- nims-inference/
 |-- trtllm-inference/
 |-- sglang-inference/
 |-- ray-service/ 
 |-- multinode-triton-trtllm-inference/
 |-- mixture-of-agents/
-|-- mig/
+|-- neuronx-distributed/
+|-- mig-gpu-partitioning/
 3.use-cases/
 |-- nims-inference/
 `-- ...
@@ -64,6 +65,14 @@ Recent advances in large language models (LLMs) have shown substantial capabilit
 
 These examples shows how to deploy LLMs like T5, Mistral using NVIDIA Triton TRT-LLM on Amazon SageMaker. See [triton-trtllm-sagemaker](2.projects/triton-trtllm-sagemaker) for more information.
 
+### NEURONX-DISTRIBUTED-INFERENCE EKS
+
+This project demonstrates deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** with **vLLM-Neuron** on Amazon EKS with AWS Trainium instances (trn1.32xlarge). The solution showcases advanced inference optimization techniques including **fused draft speculative decoding** and comprehensive performance monitoring including NeuronCore utilization, inference latency, and throughput metrics. See [neuronx-distributed/nxd-inference-eks](2.projects/neuronx-distributed/nxd-inference-eks) for more information.
+
+### NEURONX DISTRIBUTED INFERENCE - Profiling and Benchmarking Guide on EC2
+
+This example on EC2 walks through deploying, profiling, and benchmarking LLMs on EC2 with NxDI. See [neuronx-distributed/nxd-inference-ec2](2.projects/neuronx-distributed/nxdi-ec2-vllm) for more information.
+
 ### MIG 
 
 This directory contains a README on how you can leverage Multi-Instance GPUs (MIGs) to partition your GPUs based on your workload. For more detailed documentation, check out the [MIG user guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/).