diff --git a/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md b/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md new file mode 100644 index 0000000..9de121b --- /dev/null +++ b/1.infrastructure/1_setup_cluster/nxd-inference/Create_EKS_Cluster.md @@ -0,0 +1,181 @@ +# Steps to create EKS cluster with EFS + +In this example we create an EKS cluster consisting of one `trn1.32xlarge` compute node. +### a. Configure AWS CLI + +``` +aws configure +``` + +### b. Create a config file for EKS cluster creation + +We have provided an example file here: [trn1-nxd-cluster-config..yaml](./trn1-nxd-cluster-config.yaml) + +``` +apiVersion: eksctl.io/v1alpha5 +kind: ClusterConfig + +metadata: + name: nxdi-inference-cluster + region: $REGION + version: "1.00" + +vpc: + id: $PLACEHOLDER_VPC_ID + subnets: + private: + $AVAILABILITY_ZONE_1: + id: $PLACEHOLDER_SUBNET_PRIVATE_1 + public: + $AVAILABILITY_ZONE_1: + id: $PLACEHOLDER_SUBNET_PUBLIC_1 + $AVAILABILITY_ZONE_2: + id: $PLACEHOLDER_SUBNET_PUBLIC_2 + + clusterEndpoints: + privateAccess: true + publicAccess: true + +cloudwatch: + clusterLogging: + enableTypes: ["*"] + +iam: + withOIDC: true + +# Adding additional section to Cluster (eksctl) for any controllers you may want to install. Uncomment as required. +wellKnownPolicies: +# ebsCSIController: true # Adds policies for using the ebs-csi-controller + efsCSIController: true # Adds policies for using the efs-csi-controller + +addons: + - name: vpc-cni + version: 1.18.1-eksbuild.1 + configurationValues: '{"env":{"ENABLE_PREFIX_DELEGATION":"true", "ENABLE_POD_ENI":"true", "POD_SECURITY_GROUP_ENFORCING_MODE":"standard"},"enableNetworkPolicy": "true"}' + resolveConflicts: overwrite + - name: amazon-cloudwatch-observability + version: v1.16.4-eksbuild.1 + # - name: aws-ebs-csi-driver + # version: v1.26.0-eksbuild.1 + - name: aws-efs-csi-driver + version: v2.1.9-eksbuild.1 + +managedNodeGroups: + - name: trn-compute-node-group + instanceType: trn1.32xlarge + instancePrefix: trtllm-compute-node + subnets: + - $PLACEHOLDER_SUBNET_PRIVATE_1 + privateNetworking: true + efaEnabled: true + minSize: 0 + desiredCapacity: 2 + maxSize: 2 + volumeSize: 500 + # comment out capacityReservation if you do not need ODCR + #capacityReservation: + #capacityReservationTarget: + #capacityReservationID: "$CR_ID" + ami: ami-07c8bc6b0bb890e9e + amiFamily: AmazonLinux2 + ssh: + publicKeyName: $PUBLIC_KEYPAIR_NAME + sourceSecurityGroupIds: [$SECURITY_GROUP_IDS] + updateConfig: + maxUnavailablePercentage: 50 + iam: + withAddonPolicies: + externalDNS: true + certManager: true + autoScaler: true + cloudWatch: true + ebs: true + efs: true + fsx: true + imageBuilder: true + xRay: true + awsLoadBalancerController: true + albIngress: true + +``` + + +### c. Create the EKS cluster + +``` +eksctl create cluster -f trn1-nxdi-cluster-config.yaml +``` + +## 3. (Optional) Capacity Blocks + +If you have Capacity Blocks for P5 or P4 instances, you can follow the [steps here](https://github.com/aws-samples/awsome-inference/tree/main/1.infrastructure#capacity-blocks) to create a self-managed nodegroup and add to your existing EKS cluster. + + +## 4. Create an EFS file system + +To enable multiple pods deployed to multiple nodes to load shards of the same model so that they can used in coordination to serve inference request too large to loaded by a single GPU, we'll need a common, shared storage location. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume. + +Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod. + +### a. Create an IAM role + +Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver. + +### b. Install EFS CSI driver + +Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status. + +### c. Create EFS file system + +### Create EFS Filesystem +```bash +# Create EFS filesystem (Elastic, encrypted) +EFS_JSON=$(aws efs create-file-system \ + --region us-west-2 \ + --creation-token "neuron-models-$(date +%s)" \ + --performance-mode generalPurpose \ + --throughput-mode elastic \ + --encrypted \ + --tags Key=Name,Value=neuron-inference) + +EFS_ID=$(echo "$EFS_JSON" | jq -r '.FileSystemId') + +# Wait until available (avoids race) +until [ "$(aws efs describe-file-systems \ + --region us-west-2 \ + --file-system-id "$EFS_ID" \ + --query 'FileSystems[0].LifeCycleState' \ + --output text)" = "available" ]; do + echo "waiting for EFS $EFS_ID ..." + sleep 3 +done +echo "EFS is available." +``` +# Create mount targets (one per AZ your nodes use) +``` +for subnet in subnet-xxx subnet-yyy subnet-zzz; do + aws efs create-mount-target \ + --region us-west-2 \ + --file-system-id "$EFS_ID" \ + --subnet-id "$subnet" \ + --security-groups sg-your-efs-security-group +done +``` +# Create StorageClass +```bash +cat < **Two steps:** (1) **Download** both models to EFS, (2) **Compile** with or without speculation. +> Compiles write to **separate directories** so you can keep both. + +### Create a secret once for your HF token: + +kubectl -n neuron-inference create secret generic hf-token \ + --from-literal=HF_TOKEN='YOUR_HF_TOKEN' + + +### Step 1 — Download both models (target + draft) + +Apply the **download job** manifest: + +```bash +kubectl apply -n neuron-inference -f fused-SD/manifests/model_download.yaml +kubectl -n neuron-inference wait --for=condition=complete job/neuron-model-download --timeout=3600s +kubectl -n neuron-inference logs job/neuron-model-download --tail=200 +``` + +Expected locations after success: + +``` +/shared/model_hub/${MODEL_DIRNAME}/config.json +/shared/model_hub/${DRAFT_DIRNAME}/config.json +``` + +Quick verify: + +```bash +kubectl -n neuron-inference exec -it -- ls -l /shared/model_hub/${MODEL_DIRNAME} | head +``` + +### Step 2 — Compile (separate outputs for spec vs non-spec) + +Apply the **compile job** manifest. Control speculation by editing the `ENABLE_SPECULATIVE` environment variable in the manifest as well as other env vars you would like to set or toggle. The script sets a number of defaults in the ConfigMap: + +```bash +# Non-spec compile (kept in /shared/compiled_models/Llama-3.3-70B/nospec_tp32) +kubectl apply -n neuron-inference -f fused-SD/manifests/compile.yaml +kubectl -n neuron-inference wait --for=condition=complete job/neuron-model-compilation --timeout=3600s +kubectl -n neuron-inference logs job/neuron-model-compilation --tail=200 + +# For speculative compile, edit the manifest to set ENABLE_SPECULATIVE=true, then: +kubectl -n neuron-inference delete job neuron-model-compilation --ignore-not-found +kubectl apply -n neuron-inference -f fused-SD/manifests/compile.yaml +kubectl -n neuron-inference wait --for=condition=complete job/neuron-model-compilation --timeout=3600s +kubectl -n neuron-inference logs job/neuron-model-compilation --tail=200 +``` + +**Output layout (no overwrites):** +``` +/shared/compiled_models/Qwen3-32B/ + ├─ nospec_tp32/ + └─ spec_slen7_tp32/ +``` + +### Step 3 — Deploy Inference + +**Important:** Before deploying, you must update the compiled model paths in `fused-SD/manifests/fsd-deploy.yaml` to match your compilation parameters. Please make sure that the env vars being set and used by vLLM are consistent with your compile-time input shapes and configs. + +The deployment manifest has hardcoded paths that need to match your compile job settings: + +```yaml +# In fsd-deploy.yaml, update these env vars to match your compilation: +- name: COMPILED_MODEL_PATH_STD + value: "/shared/compiled_models/Qwen3-32B/spec_slen7_tp32" # for speculative +- name: COMPILED_MODEL_PATH_SPEC + value: "/shared/compiled_models/Qwen3-32B/nospec_tp32" # for non-speculative +``` + +**Path format:** `/shared/compiled_models/{MODEL_NAME}/{mode}_{params}` + +Where: +- `{MODEL_NAME}` = your `COMPILED_ROOT` basename (e.g., `Qwen3-32B`) +- `{mode}` = `spec` or `nospec` +- `{params}` = `slen{SPECULATION_LENGTH}_tp{TP_DEGREE}` for spec, or just `tp{TP_DEGREE}` for nospec + +**Examples:** +- TP=32, no speculation: `nospec_tp32` +- TP=32, speculation length 7: `spec_slen7_tp32` +- TP=16, speculation length 5: `spec_slen5_tp16` + +**Quick update command:** +```bash +# For TP=32, SPECULATION_LENGTH=7 (adjust as needed) +sed -i 's|/shared/compiled_models/Qwen3-32B/spec_slen7_tp32|/shared/compiled_models/Qwen3-32B/spec_slen7_tp32|g' fused-SD/manifests/fsd-deploy.yaml +sed -i 's|/shared/compiled_models/Qwen3-32B/nospec_tp32|/shared/compiled_models/Qwen3-32B/nospec_tp32|g' fused-SD/manifests/fsd-deploy.yaml +``` + +Then apply your inference deployment: + +```bash +kubectl apply -n neuron-inference -f fused-SD/manifests/fsd-deploy.yaml +kubectl -n neuron-inference wait --for=condition=available deployment/neuron-llama-inference --timeout=600s +``` + +### Step 4 — Load Balancing with Application Load Balancer + +To expose your inference service externally and distribute traffic across multiple pods, you'll set up an Application Load Balancer (ALB) using the AWS Load Balancer Controller. + +#### 4.1 Install AWS Load Balancer Controller + +**Prerequisites:** +- Your EKS cluster must have an IAM OIDC identity provider +- The AWS Load Balancer Controller requires specific IAM permissions + +**Option A: Using Kubernetes Manifests (Recommended)** + +1. Create the IAM policy and service account: +```bash +# Download the IAM policy document +curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.2/docs/install/iam_policy.json + +# Create the IAM policy +aws iam create-policy \ + --policy-name AWSLoadBalancerControllerIAMPolicy \ + --policy-document file://iam_policy.json + +# Create service account with IAM role +eksctl create iamserviceaccount \ + --cluster=your-cluster-name \ + --namespace=kube-system \ + --name=aws-load-balancer-controller \ + --role-name AmazonEKSLoadBalancerControllerRole \ + --attach-policy-arn=arn:aws:iam::ACCOUNT-ID:policy/AWSLoadBalancerControllerIAMPolicy \ + --approve +``` + +2. Install the controller: +```bash +# Add the EKS chart repo +helm repo add eks https://aws.github.io/eks-charts +helm repo update + +# Install AWS Load Balancer Controller +helm install aws-load-balancer-controller eks/aws-load-balancer-controller \ + -n kube-system \ + --set clusterName=your-cluster-name \ + --set serviceAccount.create=false \ + --set serviceAccount.name=aws-load-balancer-controller +``` + +3. Verify installation: +```bash +kubectl get deployment -n kube-system aws-load-balancer-controller +kubectl logs -n kube-system deployment/aws-load-balancer-controller +``` + +**Option B: Using Helm (Alternative)** +Follow the [AWS documentation for Helm installation](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html). + +#### 4.2 Deploy the Ingress + +Once the AWS Load Balancer Controller is installed and running: + +```bash +# Apply the ingress configuration +kubectl -n neuron-inference apply -f fused-SD/manifests/neuron-ingress.yaml + +# Monitor ingress creation (wait for ADDRESS to appear) +kubectl -n neuron-inference get ingress neuron-qwen-ingress -w +``` + +The ingress will create an Application Load Balancer that: +- Routes traffic to your inference service pods +- Provides health checks on the `/health` endpoint +- Supports both HTTP and HTTPS traffic +- Automatically scales with your deployment + +#### 4.3 Test Your Deployment + +Once the ALB is provisioned (this can take 2-3 minutes): + +```bash +# Get the ALB hostname +ALB=$(kubectl -n neuron-inference get ing neuron-qwen-ingress -o jsonpath='{.status.loadBalancer.ingress[0].hostname}') +echo "ALB Endpoint: http://$ALB" + +# Test the health endpoint +curl -i "http://$ALB/health" + +# List available models +curl -i "http://$ALB/v1/models" + +# Test inference with a simple completion +curl -s "http://$ALB/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "/shared/model_hub/Qwen3-32B", + "prompt": "Say hi from vLLM on Neuron.", + "max_tokens": 64, + "temperature": 0.7 + }' + +# Test with chat completions API +curl -s "http://$ALB/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "/shared/model_hub/Qwen3-32B", + "messages": [{"role": "user", "content": "Hello! How are you?"}], + "max_tokens": 100 + }' +``` +Congratulation! + +#### 4.4 Production Considerations + +For production deployments, consider: + +- **HTTPS/TLS**: Configure SSL certificates using AWS Certificate Manager +- **Custom Domain**: Set up Route 53 records pointing to your ALB +- **WAF Integration**: Add AWS WAF for additional security +- **Access Logging**: Enable ALB access logs for monitoring and debugging +- **Target Group Settings**: Tune health check intervals and thresholds based on your model's startup time + +Example ingress with HTTPS: +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: neuron-qwen-ingress-https + annotations: + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:region:account:certificate/cert-id + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]' + alb.ingress.kubernetes.io/ssl-redirect: '443' +spec: + rules: + - host: your-domain.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: neuron-llama-service + port: + number: 8000 +``` + +### Step 5 — Deploy Neuron Monitor for Observability + +AWS Neuron Monitor provides comprehensive monitoring and observability for your Neuron workloads, including hardware utilization, model performance metrics, and system health indicators. + +#### 5.1 Understanding Neuron Monitor + +Neuron Monitor offers: +- **Hardware Metrics**: NeuronCore utilization, memory usage, temperature +- **Model Performance**: Inference latency, throughput, queue depth +- **System Health**: Device status, error rates, compilation metrics +- **Integration**: Works with Prometheus, Grafana, CloudWatch, and other monitoring systems + +#### 5.2 Deploy Neuron Monitor DaemonSet + +The Neuron Monitor runs as a DaemonSet to collect metrics from all Neuron devices across your cluster: + +```bash +# Create the Neuron Monitor DaemonSet +cat < neuron-monitor-policy.json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "cloudwatch:PutMetricData", + "logs:CreateLogGroup", + "logs:CreateLogStream", + "logs:PutLogEvents", + "logs:DescribeLogStreams" + ], + "Resource": "*" + } + ] +} +EOF + +# Create the policy +aws iam create-policy \ + --policy-name NeuronMonitorCloudWatchPolicy \ + --policy-document file://neuron-monitor-policy.json + +# Create service account with IAM role (replace ACCOUNT-ID and CLUSTER-NAME) +eksctl create iamserviceaccount \ + --cluster=CLUSTER-NAME \ + --namespace=neuron-inference \ + --name=neuron-monitor \ + --role-name=NeuronMonitorRole \ + --attach-policy-arn=arn:aws:iam::ACCOUNT-ID:policy/NeuronMonito +--approve +``` +**Verify Neuron Monitor Deployment** + +```bash +# Check DaemonSet status +kubectl -n neuron-inference get daemonset neuron-monitor +kubectl -n neuron-inference get pods -l app=neuron-monitor + +# View logs +kubectl -n neuron-inference logs -l app=neuron-monitor --tail=50 +``` + +#### 5.5 Configure Prometheus Integration (Optional) + +To scrape metrics with Prometheus, add the following ServiceMonitor: + +```bash +cat < -- test -f /shared/model_hub/${MODEL_DIRNAME}/config.json && echo OK + ``` + +**Spec compile overwrote non-spec?** +- With the provided manifests, outputs are separated per mode (`nospec_*` vs `spec_*`). If you see overwrites, confirm your `COMPILED_ROOT` and job env vars. + +**Inference deployment can't find compiled artifacts** +- Check that the paths in `fsd-deploy.yaml` match your actual compilation output: + ```bash + kubectl -n neuron-inference exec -it -- ls -la /shared/compiled_models/Qwen3-32B/ + ``` +- Update the `COMPILED_MODEL_PATH_STD` and `COMPILED_MODEL_PATH_SPEC` env vars to match your TP degree and speculation length. + +**Neuron compiler errors** +- These are model/hardware/SDK specific. Re-run with smaller `TP_DEGREE`, confirm SDK image version, or inspect `/shared/compile*.log`. Consider filing an issue with logs. + +## Cleanup + +```bash +kubectl -n neuron-inference delete job neuron-model-download --ignore-not-found +kubectl -n neuron-inference delete job neuron-model-compilation --ignore-not-found +kubectl -n neuron-inference delete deployment neuron-llama-inference service neuron-llama-service --ignore-not-found +kubectl -n neuron-inference delete pvc efs-models-pvc --ignore-not-found +kubectl delete namespace neuron-inference +``` + +## References + +- [AWS Neuron Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) – NxDI & compiler guidance +- [NeuronX Distributed Inference Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html) +- [vLLM Documentation](https://docs.vllm.ai/) – Server flags and deployment considerations +- [Kubernetes Jobs Documentation](https://kubernetes.io/docs/concepts/workloads/controllers/job/) +- [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) +- [AWS Neuron Helm Charts](https://github.com/aws-neuron/aws-neuron-helm-charts) +- [AWS Trainium Instance Types](https://aws.amazon.com/ec2/instance-types/trn1/) diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/.env.example b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/.env.example new file mode 100644 index 0000000..bfc3d43 --- /dev/null +++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/.env.example @@ -0,0 +1,24 @@ +# Hugging Face Configuration +HF_TOKEN=your_huggingface_token_here +HF_MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B +HF_DRAFT_MODEL_ID=meta-llama/Llama-3.2-1B-Instruct +MODEL_NAME=llama-3-70B-inst + +# Inference Configuration +MAX_MODEL_LEN=12800 +SEQ_LEN=12800 +MAX_CONTEXT_LEN=12288 + +# Neuron Configuration +TENSOR_PARALLEL_SIZE=32 +TP_DEGREE=32 # Add this - same as TENSOR_PARALLEL_SIZE +NAMESPACE=neuron-inference +BATCH_SIZE=1 +MAX_NUM_SEQS=1 +ENABLE_SPECULATIVE=false # Start with false +SPECULATION_LENGTH=7 # Add this + +# Paths +MODEL_PATH=/shared/models/Llama-3.3-70B-Instruct +COMPILED_MODEL_PATH=/shared/traced_model/Llama-3.3-70B-Instruct +DRAFT_MODEL_PATH=/shared/models/Llama-3.2-1B-Instruct \ No newline at end of file diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/compile.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/compile.yaml new file mode 100644 index 0000000..2081e41 --- /dev/null +++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/compile.yaml @@ -0,0 +1,140 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: neuron-compile-scripts + namespace: neuron-inference +data: + compile_model.sh: | + #!/usr/bin/env bash + set -euo pipefail + + # ---- Inputs / defaults ---- + : "${ENABLE_SPECULATIVE:=false}" + : "${SPECULATION_LENGTH:=7}" + : "${TP_DEGREE:=32}" + : "${BATCH_SIZE:=1}" + : "${MAX_CONTEXT_LEN:=8192}" + : "${SEQ_LEN:=8192}" + + # Where the downloader put things + : "${MODEL_DIRNAME:=Qwen3-32B}" + : "${DRAFT_DIRNAME:=Qwen3-0.6B}" + : "${MODEL_ROOT:=/shared/model_hub}" + + MODEL_PATH="${MODEL_ROOT}/${MODEL_DIRNAME}" + DRAFT_MODEL_PATH="${MODEL_ROOT}/${DRAFT_DIRNAME}" + + # Option 1: separate compiled outputs per configuration + : "${COMPILED_ROOT:=/shared/compiled_models/Qwen3-32B}" + if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then + COMPILED_MODEL_PATH="${COMPILED_ROOT}/spec_slen${SPECULATION_LENGTH}_tp${TP_DEGREE}" + else + COMPILED_MODEL_PATH="${COMPILED_ROOT}/nospec_tp${TP_DEGREE}" + fi + mkdir -p "${COMPILED_MODEL_PATH}" + + # Sanity checks (fail fast if downloads missing) + [[ -f "${MODEL_PATH}/config.json" ]] || { echo "Missing target model at ${MODEL_PATH}"; exit 2; } + if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then + [[ -f "${DRAFT_MODEL_PATH}/config.json" ]] || { echo "Missing draft model at ${DRAFT_MODEL_PATH}"; exit 2; } + fi + + echo "Speculative=${ENABLE_SPECULATIVE} SLEN=${SPECULATION_LENGTH} TP=${TP_DEGREE} BS=${BATCH_SIZE} SEQ=${SEQ_LEN}" + echo "MODEL_PATH=${MODEL_PATH}" + echo "DRAFT_MODEL_PATH=${DRAFT_MODEL_PATH}" + echo "COMPILED_MODEL_PATH=${COMPILED_MODEL_PATH}" + + BASE_ARGS=( + --model-type qwen3 + --task-type causal-lm + run + --model-path "${MODEL_PATH}" + --compiled-model-path "${COMPILED_MODEL_PATH}" + --torch-dtype bfloat16 + --start_rank_id 0 + --local_ranks_size "${TP_DEGREE}" + --tp-degree "${TP_DEGREE}" + --batch-size "${BATCH_SIZE}" + --max-context-length "${MAX_CONTEXT_LEN}" + --seq-len "${SEQ_LEN}" + --on-device-sampling + --top-k 1 + --do-sample + --sequence-parallel-enabled + --cc-pipeline-tiling-factor 1 + --enable-bucketing + --context-encoding-buckets 512 1024 2048 4096 8192 + --token-generation-buckets 512 1024 2048 4096 8192 + --compile-only + --prompt "What is annapurna labs?" + ) + + LOG=/shared/compile.log + if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then + BASE_ARGS+=( --draft-model-path "${DRAFT_MODEL_PATH}" --enable-fused-speculation --speculation-length "${SPECULATION_LENGTH}" ) + LOG=/shared/compile_speculative.log + fi + + echo "inference_demo ${BASE_ARGS[*]}" + inference_demo "${BASE_ARGS[@]}" 2>&1 | tee "${LOG}" +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: neuron-model-compilation + namespace: neuron-inference +spec: + template: + spec: + restartPolicy: OnFailure + nodeSelector: + workload-type: "neuron-inference" + tolerations: + - key: aws.amazon.com/neuron + operator: Exists + effect: NoSchedule + containers: + - name: model-compiler + image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.0-ubuntu22.04 + command: ["/bin/bash","-lc"] + args: + - | + set -euxo pipefail + df -h /shared || true + bash /scripts/compile_model.sh + env: + # tune/override as needed: + - name: ENABLE_SPECULATIVE + value: "false" # set "true" for SD compile + - name: SPECULATION_LENGTH + value: "7" + - name: TP_DEGREE + value: "32" + - name: MODEL_DIRNAME + value: "Qwen3-32B" + - name: DRAFT_DIRNAME + value: "Qwen3-0.6B" + - name: COMPILED_ROOT + value: "/shared/compiled_models/Qwen3-32B" + resources: + limits: + aws.amazon.com/neuroncore: "32" + memory: "300Gi" + cpu: "64" + requests: + aws.amazon.com/neuroncore: "32" + memory: "300Gi" + cpu: "64" + volumeMounts: + - name: shared-storage + mountPath: /shared + - name: scripts + mountPath: /scripts + volumes: + - name: shared-storage + persistentVolumeClaim: + claimName: efs-models-pvc + - name: scripts + configMap: + name: neuron-compile-scripts + defaultMode: 0755 diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/fsd-deploy.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/fsd-deploy.yaml new file mode 100644 index 0000000..e2ad23d --- /dev/null +++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/fsd-deploy.yaml @@ -0,0 +1,170 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: neuron-vllm-scripts + namespace: neuron-inference +data: + start_vllm.sh: | + #!/bin/bash + set -euo pipefail + + : "${ENABLE_SPECULATIVE:=false}" + : "${SPECULATION_LENGTH:=7}" + : "${TP_DEGREE:=32}" + : "${BATCH_SIZE:=1}" + : "${MAX_CONTEXT_LEN:=8192}" + : "${SEQ_LEN:=8192}" + : "${MAX_NUM_SEQS:=1}" + + : "${MODEL_DIRNAME:=Qwen3-32B}" + : "${DRAFT_DIRNAME:=Qwen3-0.6B}" + : "${MODEL_ROOT:=/shared/model_hub}" + + MODEL_PATH="${MODEL_ROOT}/${MODEL_DIRNAME}" + DRAFT_MODEL_PATH="${MODEL_ROOT}/${DRAFT_DIRNAME}" + + : "${COMPILED_MODEL_PATH_STD:=/shared/compiled_models/Qwen3-32B/nospec_tp32}" + : "${COMPILED_MODEL_PATH_SPEC:=/shared/compiled_models/Qwen3-32B/spec_slen7_tp32}" + + # -------- Select compiled dir based on mode -------- + if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then + export NEURON_COMPILED_ARTIFACTS="${COMPILED_MODEL_PATH_SPEC}" + else + export NEURON_COMPILED_ARTIFACTS="${COMPILED_MODEL_PATH_STD}" + fi + + echo "Mode: ${ENABLE_SPECULATIVE} | TP=${TP_DEGREE} | MAX_NUM_SEQS=${MAX_NUM_SEQS} | MAX_LEN=${SEQ_LEN}" + echo "Using NEURON_COMPILED_ARTIFACTS=${NEURON_COMPILED_ARTIFACTS}" + export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference" + + if [[ "${ENABLE_SPECULATIVE}" == "true" ]]; then + echo "Starting vLLM (speculative decoding)…" + VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \ + --model "${MODEL_PATH}" \ + --speculative-model "${DRAFT_MODEL_PATH}" \ + --num-speculative-tokens "${SPECULATION_LENGTH}" \ + --max-model-len "${SEQ_LEN}" \ + --use-v2-block-manager \ + --max-num-seqs "${MAX_NUM_SEQS}" \ + --tensor-parallel-size "${TP_DEGREE}" \ + --device neuron \ + --override-neuron-config "{\"enable_fused_speculation\": true}" \ + --host 0.0.0.0 --port 8000 + else + echo "Starting vLLM (standard)…" + VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \ + --model "${MODEL_PATH}" \ + --max-model-len "${SEQ_LEN}" \ + --max-num-seqs "${MAX_NUM_SEQS}" \ + --tensor-parallel-size "${TP_DEGREE}" \ + --device neuron \ + --use-v2-block-manager \ + --host 0.0.0.0 --port 8000 + fi +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: neuron-qwen-inference + namespace: neuron-inference +spec: + replicas: 1 + selector: + matchLabels: + app: neuron-qwen-inference + template: + metadata: + labels: + app: neuron-qwen-inference + spec: + nodeSelector: + workload-type: "neuron-inference" + node.kubernetes.io/instance-type: "trn1.32xlarge" + tolerations: + - key: aws.amazon.com/neuron + operator: Exists + effect: NoSchedule + containers: + - name: vllm-server + image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.0-ubuntu22.04 + command: ["/bin/bash","-lc"] + args: + - | + set -euo pipefail + exec /scripts/start_vllm.sh + ports: + - containerPort: 8000 + name: http + resources: + limits: + aws.amazon.com/neuroncore: "32" + memory: "400Gi" + cpu: "120" + requests: + aws.amazon.com/neuroncore: "32" + memory: "400Gi" + cpu: "120" + env: + - name: ENABLE_SPECULATIVE + value: "false" + - name: SPECULATION_LENGTH + value: "7" + - name: TP_DEGREE + value: "32" + - name: BATCH_SIZE + value: "1" + - name: MAX_CONTEXT_LEN + value: "8192" + - name: SEQ_LEN + value: "8192" + - name: MAX_NUM_SEQS + value: "1" + - name: MODEL_DIRNAME + value: "Qwen3-32B" + - name: DRAFT_DIRNAME + value: "Qwen3-0.6B" + - name: MODEL_ROOT + value: "/shared/model_hub" + - name: COMPILED_MODEL_PATH_STD + value: "/shared/compiled_models/Qwen3-32B/nospec_tp32" + - name: COMPILED_MODEL_PATH_SPEC + value: "/shared/compiled_models/Qwen3-32B/spec_slen7_tp32" + volumeMounts: + - name: shared-storage + mountPath: /shared + - name: vllm-scripts + mountPath: /scripts + readinessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 30 + periodSeconds: 5 + livenessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 60 + periodSeconds: 10 + volumes: + - name: shared-storage + persistentVolumeClaim: + claimName: efs-models-pvc + - name: vllm-scripts + configMap: + name: neuron-vllm-scripts + defaultMode: 0755 +--- +apiVersion: v1 +kind: Service +metadata: + name: neuron-qwen-svc + namespace: neuron-inference +spec: + type: ClusterIP + selector: + app: neuron-qwen-inference + ports: + - name: http + port: 8000 + targetPort: 8000 diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/model_download.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/model_download.yaml new file mode 100644 index 0000000..5a44c92 --- /dev/null +++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/model_download.yaml @@ -0,0 +1,96 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: neuron-download-scripts + namespace: neuron-inference +data: + download_models.sh: | + #!/usr/bin/env bash + set -euo pipefail + + : "${HF_MODEL_ID:=Qwen/Qwen3-32B}" + : "${HF_DRAFT_MODEL_ID:=Qwen/Qwen3-0.6B}" + + : "${MODEL_DIRNAME:=Qwen3-32B}" + : "${DRAFT_DIRNAME:=Qwen3-0.6B}" + + : "${MODEL_ROOT:=/shared/model_hub}" + MODEL_PATH="${MODEL_ROOT}/${MODEL_DIRNAME}" + DRAFT_MODEL_PATH="${MODEL_ROOT}/${DRAFT_DIRNAME}" + + mkdir -p "$MODEL_PATH" "$DRAFT_MODEL_PATH" + + if ! command -v hf >/dev/null 2>&1; then + python -m pip install -q --no-cache-dir 'huggingface_hub[cli]' + fi + + if [[ -z "${HF_TOKEN:-}" ]]; then + echo "HF_TOKEN not set; cannot access gated repos." >&2 + exit 1 + fi + export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN" + + echo "Downloading target: ${HF_MODEL_ID} -> ${MODEL_PATH}" + hf download "${HF_MODEL_ID}" --local-dir "${MODEL_PATH}" --exclude "*/.git/*" + echo "OK: ${MODEL_PATH}" + + echo "Downloading draft: ${HF_DRAFT_MODEL_ID} -> ${DRAFT_MODEL_PATH}" + hf download "${HF_DRAFT_MODEL_ID}" --local-dir "${DRAFT_MODEL_PATH}" --exclude "*/.git/*" + echo "OK: ${DRAFT_MODEL_PATH}" + + echo "Listing:" + find "${MODEL_PATH}" -maxdepth 1 -type f | head -n 20 + find "${DRAFT_MODEL_PATH}" -maxdepth 1 -type f | head -n 20 +--- +apiVersion: batch/v1 +kind: Job +metadata: + name: neuron-model-download + namespace: neuron-inference +spec: + template: + spec: + restartPolicy: OnFailure + nodeSelector: + workload-type: "neuron-inference" + tolerations: + - key: aws.amazon.com/neuron + operator: Exists + effect: NoSchedule + containers: + - name: downloader + image: public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.0-ubuntu22.04 + command: ["/bin/bash","-lc"] + args: + - | + set -euxo pipefail + df -h /shared || true + bash /scripts/download_models.sh 2>&1 | tee /shared/download.log + env: + # Set/override these as needed: + - name: HF_MODEL_ID + value: "Qwen/Qwen3-32B" + - name: HF_DRAFT_MODEL_ID + value: "Qwen/Qwen3-0.6B" + - name: MODEL_DIRNAME + value: "Qwen3-32B" + - name: DRAFT_DIRNAME + value: "Qwen3-0.6B" + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: HF_TOKEN + volumeMounts: + - name: shared-storage + mountPath: /shared + - name: scripts + mountPath: /scripts + volumes: + - name: shared-storage + persistentVolumeClaim: + claimName: efs-models-pvc + - name: scripts + configMap: + name: neuron-download-scripts + defaultMode: 0755 diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/neuron-ingress.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/neuron-ingress.yaml new file mode 100644 index 0000000..76a2c0c --- /dev/null +++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/neuron-ingress.yaml @@ -0,0 +1,29 @@ +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: neuron-qwen-ingress + namespace: neuron-inference + annotations: + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + # Health checks + alb.ingress.kubernetes.io/healthcheck-path: /health + alb.ingress.kubernetes.io/healthcheck-port: "8000" + alb.ingress.kubernetes.io/success-codes: "200" + # Ensure ALB is created in the AZs with your pods + alb.ingress.kubernetes.io/subnets: subnet-0186a8f2614f8c522, subnet-0ad9fdda1d6f7dc98 #replace with yours- min 2 + # (Optional) Cross-zone LB + alb.ingress.kubernetes.io/load-balancer-attributes: load_balancing.cross_zone.enabled=true +spec: + ingressClassName: alb + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: neuron-qwen-svc + port: + number: 8000 diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/storage.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/storage.yaml new file mode 100644 index 0000000..8881a9e --- /dev/null +++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/manifests/storage.yaml @@ -0,0 +1,28 @@ +apiVersion: v1 +kind: PersistentVolume +metadata: + name: efs-pv +spec: + capacity: + storage: 500Gi + volumeMode: Filesystem + accessModes: + - ReadWriteMany + persistentVolumeReclaimPolicy: Retain + storageClassName: efs-sc + csi: + driver: efs.csi.aws.com + volumeHandle: fs-0f239c95d507d9517 # Change to your own ID +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: efs-models-pvc # aligned + namespace: neuron-inference +spec: + accessModes: + - ReadWriteMany + storageClassName: efs-sc + resources: + requests: + storage: 500Gi \ No newline at end of file diff --git a/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/neuron-monitor/neuron-container.yaml b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/neuron-monitor/neuron-container.yaml new file mode 100644 index 0000000..9e77bf0 --- /dev/null +++ b/2.projects/neuronx-distributed/nxd-inference-eks/fused-SD/neuron-monitor/neuron-container.yaml @@ -0,0 +1,62 @@ +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: neuron-monitor + namespace: neuron-monitor + labels: + app: neuron-monitor + version: v1 +spec: + selector: + matchLabels: + app: neuron-monitor + template: + metadata: + labels: + app: neuron-monitor + version: v1 + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/os + operator: In + values: + - linux + - key: node.kubernetes.io/instance-type + operator: In + values: + - trn1.2xlarge + - trn1.32xlarge + - trn1n.32xlarge + - inf1.xlarge + - inf1.2xlarge + - inf1.6xlarge + - inf2.xlarge + - inf2.8xlarge + - inf2.24xlarge + - inf2.48xlarge + containers: + - name: neuron-monitor + image: public.ecr.aws/neuron/neuron-monitor:1.0.1 + ports: + - containerPort: 8000 + command: + - "/opt/bin/entrypoint.sh" + args: + - "--port" + - "8000" + resources: + limits: + cpu: 500m + memory: 256Mi + requests: + cpu: 256m + memory: 128Mi + env: + - name: GOMEMLIMIT + value: 160MiB + securityContext: + privileged: true diff --git a/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img-neff.png b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img-neff.png new file mode 100644 index 0000000..7b5d5a4 Binary files /dev/null and b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img-neff.png differ diff --git a/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img.txt b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/2.projects/neuronx-distributed/nxdi-ec2-vllm/imgs/img.txt @@ -0,0 +1 @@ + diff --git a/2.projects/neuronx-distributed/nxdi-ec2-vllm/mistral-small-3-0-nxdi-guide.ipynb b/2.projects/neuronx-distributed/nxdi-ec2-vllm/mistral-small-3-0-nxdi-guide.ipynb new file mode 100644 index 0000000..e2a00e6 --- /dev/null +++ b/2.projects/neuronx-distributed/nxdi-ec2-vllm/mistral-small-3-0-nxdi-guide.ipynb @@ -0,0 +1,1173 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5a972332", + "metadata": {}, + "source": [ + "# Guide to deploy, benchmark, and profile Mistral Small 2501 with NXDI and VLLM on Trn1\n", + "\n", + "This notebook provides a step-by-step guide for serving, profiling, and running benchmarks on Mistral Small 24B model on a **Trn1** instance. \n", + "\n", + "## Mistral Small 2501\n", + "\n", + "[Mistral Small 3.0](https://mistral.ai/news/mistral-small-3) is a 24B-parameter language model from Mistral AI optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment. The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.\n", + "\n", + "## Neuronx-Distributed-Inference (NxDI)\n", + "\n", + "[NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview) (where NxD stands for NeuronX Distributed) is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances. Introduced with Neuron SDK 2.21 release, it offers advanced inference capabilities, including features such as continuous batching and speculative decoding for high performance inference. Additionally, it supports inference engine for vLLM for seamless integration with the majority of customers’ production deployment systems. ML developers can use NxD Inference library at different levels of abstraction that fits their inference use case.\n", + "\n", + "## Overview\n", + "\n", + "1. **Check/Install Dependencies** for AWS Neuron (NXDI, vLLM fork, etc.).\n", + "2. **Optional**: Install additional utilities (`inference-benchmarking` (lm_eval), InfluxDB, `llmperf` for performance benchmarking, etc.).\n", + "3. **Download** Mistral Small 24B base model.\n", + "4. **Spin Up** a VLLM server, benchmark and pull a profile.\n", + " \n", + "### Prerequisites\n", + "\n", + "- **Amazon EC2 Trn1.32xlarge instance** with AWS Neuron drivers and recommended PyTorch environment.\n", + "- **NXDI virtual environment** (e.g., `aws_neuronx_venv_pytorch_2_5_nxd_inference`) is required.\n", + "\n", + "- To request a quota increase for `trn1.32xlarge` on EC2, follow these steps:\n", + "\n", + "1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).\n", + "2. Choose Amazon EC2.\n", + "3. Review your default quota for the following resources:\n", + " - `trn1.32xlarge` for ec2 on-demand use\n", + "4. If needed, request a quota increase for these resources.\n" + ] + }, + { + "cell_type": "markdown", + "id": "355c31c9-b15d-4879-84d4-92d62fd82d4d", + "metadata": {}, + "source": [ + "
\n", + "\n", + "NOTE: To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "8934c846-b3f5-458a-b11b-77b813904e0c", + "metadata": {}, + "source": [ + "### Create Your EC2 instance\n", + "\n", + "Follow the steps here for a detailed set up of your EC2 instance setup:\n", + "\n", + "#### Steps:\n", + "- Navigate to the EC2 dashboard from the AWS mgmt console and launch your instance.\n", + "- Search for the Ubuntu 22.04 Neuron DLAMI.\n", + "- Choose the instance size as Trn1.32xlarge or any other Neuron based instance you're able to fit the model.\n", + "- Set the inbound rule for ssh to your local machine's ip address or anywhere (note that it is not in accordance to set this to allow trafic from any ipv4, please ensure you secure these ports once done testing.\n", + "- Create and specify your ssh key in the instance configuration step. You will need your .pem file\n", + "- Create your instance.\n", + "- Once you have launched your instance, navigate to either your terminal or VSCODE and follow the steps below:\n", + "\n", + "#### ssh for powershell:\n", + "\n", + "`$PUBLIC_DNS=\"paste your public ipv4 dns here\" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`\n", + "`$KEY_PATH=\"paste ssh key path here\" # local path to key, e.g. ssh/trn.pem`\n", + "\n", + "`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`\n", + "\n", + "#### ssh for linux/macOS:\n", + "\n", + "`export PUBLIC_DNS=\"paste your public ipv4 dns here\" # public ipv4 DNS, e.g. ec2-3-80-.... from ec2 console`\n", + "`export KEY_PATH=\"paste ssh key path here\" # local path to key, e.g. ssh/trn.pem`\n", + "\n", + "`ssh -i $KEY_PATH -L 8888:127.0.0.1:8888 -L 8000:127.0.0.1:8000 -L 8086:127.0.0.1:8086 -L 3001:127.0.0.1:3001 ec2-user@$PUBLIC_DNS`\n", + "\n", + "You should have sshed into your EC2 instance. \n", + "\n", + "- Activate your NXDI venv:\n", + "\n", + "`source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate`\n", + "\n", + "- Activate jupyter server:\n", + "\n", + "`jupyter lab —no-browser —port 8888 —ip 0.0.0.0`\n", + "\n", + "You should see a familiar jupyter output with a URL to the notebook.\n", + "\n", + "`http://localhost:8888/....`\n", + "\n", + "We can click on it, and a jupyter environment opens in our local browser. Upload this notebook to your jupyter environment and run the steps in the cells below." + ] + }, + { + "cell_type": "markdown", + "id": "c8d12827-cdd6-49ad-a5b9-fae27f1e7cda", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "511f6786-8ed7-4036-9070-3d6669fcb92b", + "metadata": {}, + "source": [ + "## Install and Set up Dependencies" + ] + }, + { + "cell_type": "markdown", + "id": "3652fc5a", + "metadata": {}, + "source": [ + "### 1. Validate / Activate Python Environment\n", + "\n", + "Inside a Jupyter notebook, using `source myenv/bin/activate` directly will not persist the environment in subsequent cells, because source runs in a subshell. Please run the command to actuvate the venv in the terminal or activate prior to spinning up the server" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1fa75aff", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "# (Optional) Uncomment or modify the following line to activate a custom environment.\n", + "#source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate\n", + "\n", + "echo 'Python environment check:'\n", + "which python\n", + "python --version" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e657dec1-61fc-4cf7-bec8-48ecd1213c75", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile requirements.txt\n", + "torch==2.5.1\n", + "transformers==4.45.2\n", + "huggingface_hub\n", + "git-lfs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f4844ef-f16c-4bda-bb05-62ca83a7d65f", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -U -r requirements.txt --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88dc63be-90d4-4145-b42b-ef3fcf4c1f6c", + "metadata": {}, + "outputs": [], + "source": [ + "! pip list | grep neuron" + ] + }, + { + "cell_type": "markdown", + "id": "e9200f12-356c-43f9-afad-68cb1b486e4b", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "18ea6ed2", + "metadata": {}, + "source": [ + "### 2. Install Neuron vLLM Fork\n", + "\n", + "If you would like to serve your model via [vLLM](https://vllm.readthedocs.io/en/latest/) specialized for Neuron-based inference, you can install AWS Neuron's vLLM fork. NxD Inference integrates into vLLM by extending the model execution components responsible for loading and invoking models used in vLLM’s LLMEngine (see [link](https://docs.vllm.ai/en/latest/design/arch_overview.html#llm-engine) for more details on vLLM architecture). This means input processing, scheduling and output processing follow the default vLLM behavior.\n", + "\n", + "You enable the Neuron integration in vLLM by setting the device type used by `vLLM` to `neuron`.\n", + "\n", + "Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.\n", + "\n", + "\n", + "Skip this step if you do not need the vLLM server. Cloning and installing vLLM takes 8-10 minutes to complete\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb9c80f8", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "set -euxo pipefail\n", + "\n", + "if [ -d \"/home/ubuntu/upstreaming-to-vllm\" ]; then\n", + " echo \"Neuron vLLM fork already cloned. Skipping.\"\n", + "else\n", + " echo \"Cloning and installing AWS Neuron vLLM fork...\"\n", + " cd /home/ubuntu/\n", + " git clone -b neuron-2.22-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git #neuron 2.22 vllm version\n", + " cd upstreaming-to-vllm\n", + " pip install -r requirements-neuron.txt --quiet\n", + "\n", + " # Install in editable mode with device set to neuron\n", + " VLLM_TARGET_DEVICE=\"neuron\" pip install -e . --quiet\n", + "fi" + ] + }, + { + "cell_type": "markdown", + "id": "b2c245e5-21ee-4002-b85e-01127d3e4340", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "adb91034", + "metadata": {}, + "source": [ + "### 3. (Optional) Install benchmarking and profiling tools\n", + "\n", + "#### 3.1 Install llmperf\n", + "\n", + "If you'd like to run benchmarks or load tests, you can install [llmperf](https://github.com/ray-project/llmperf). Skip if not needed.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cba57d72", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "if pip show llmperf > /dev/null 2>&1; then\n", + " echo \"llmperf is already installed. Skipping.\"\n", + "else\n", + " echo \"Installing llmperf...\"\n", + " cd /home/ubuntu/\n", + " git clone https://github.com/ray-project/llmperf.git > /dev/null 2>&1 --quiet\n", + " cd llmperf\n", + " pip install -e . --quiet\n", + "fi" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1b95852-a69f-4c6a-bdcd-915bdaf6d46a", + "metadata": {}, + "outputs": [], + "source": [ + "!pip list| grep neuron" + ] + }, + { + "cell_type": "markdown", + "id": "7ef8c80e-20dd-4d40-8d38-9973f03e08eb", + "metadata": {}, + "source": [ + "#### 3.2 Install AWS Neuron Tools (If Needed)\n", + "\n", + "This cell installs the Neuron packages for profiling and other tooling. If already installed, the script checks and skips. For more information, see [Installing Neuron Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/index.html).\n", + "\n", + "> **Note**: If you have your apt sources already configured and have installed the Neuron packages, you can skip this step.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2855f8f0-7451-454d-a25f-2ac3574334de", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "set -euxo pipefail\n", + "\n", + "# Check if aws-neuronx-tools is installed\n", + "if dpkg -s aws-neuronx-tools > /dev/null 2>&1; then\n", + " echo \"aws-neuronx-tools is already installed. Skipping.\"\n", + "else\n", + " echo \"Installing aws-neuronx-tools...\"\n", + " . /etc/os-release\n", + "\n", + " sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null < /dev/null 2>&1; then\n", + " echo \"InfluxDB2 is already installed, skipping.\"\n", + " if systemctl is-active --quiet influxdb; then\n", + " echo \"InfluxDB is already running.\"\n", + " else\n", + " sudo systemctl start influxdb\n", + " echo \"Setting up InfluxDB ...\"\n", + " # influx setup\n", + " fi\n", + "else\n", + " # Install InfluxDB\n", + " wget -q https://repos.influxdata.com/influxdata-archive_compat.key\n", + " echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && \\\n", + " cat influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null\n", + " echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list\n", + " \n", + " sudo apt-get update && sudo apt-get install influxdb2 influxdb2-cli -y\n", + " sudo systemctl start influxdb\n", + " \n", + " # Run non-interactive influx setup with all necessary flags\n", + " # replace the following flags below with the necessary credentials\n", + " influx setup \\\n", + " --username admin \\\n", + " --password testpassowrd \\\n", + " --org yourorg \\\n", + " --bucket yourbucket \\\n", + " --token yoursupersecrettoken \\\n", + " --force\n", + "\n", + "fi" + ] + }, + { + "cell_type": "markdown", + "id": "16eb98a5-dcc6-4d82-8283-949c6b3bf266", + "metadata": {}, + "source": [ + "#### 3.4 Accuracy-benchmarking with lm_eval\n" + ] + }, + { + "cell_type": "markdown", + "id": "d614ddb2-5284-488c-b377-77598508e001", + "metadata": {}, + "source": [ + "Copy the [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) directory to some location on your instance. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8702b601-0a7e-4d87-8004-fa7bb7e0389a", + "metadata": {}, + "outputs": [], + "source": [ + "! git clone https://github.com/aws-neuron/aws-neuron-samples.git" + ] + }, + { + "cell_type": "markdown", + "id": "12f5e818-6488-4270-975a-81845b3d3cd4", + "metadata": {}, + "source": [ + "Change directory to the your copy of inference-benchmarking. Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed manual install NxD Inference ) by:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07dddce8-101c-4d2d-b01c-fb93c6a07b72", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/\n", + "pip install -r requirements.txt --quiet" + ] + }, + { + "cell_type": "markdown", + "id": "4abe3940-45bb-4bae-a12f-d7d6959a2927", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "0d1970fc", + "metadata": {}, + "source": [ + "## 4. Download or Provide Your Model\n", + "\n", + "Below is a template for downloading the model. You can skip or adjust if you already have a local model.\n", + "\n", + "For more information on model checkpoint usage, see the [NxDI inference with Hugging Face-based models](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html)." + ] + }, + { + "cell_type": "markdown", + "id": "ccff01a8-94f7-4d10-bdf7-71229ec19cb9", + "metadata": {}, + "source": [ + "You will need to log in to huggingface from the commandline. You will need your token from https://huggingface.co/settings/tokens Paste it to replace the MY_HUGGINGFACE_TOKEN_HERE text below. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7186278-332e-4e07-87b8-fdf1a30f3335", + "metadata": {}, + "outputs": [], + "source": [ + "!git config --global credential.helper store\n", + "from huggingface_hub import notebook_login\n", + "notebook_login()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "860b9863-fc7e-4752-ae55-8d4b89312d4b", + "metadata": {}, + "outputs": [], + "source": [ + "#run the following code in the terminal to install git-lfs" + ] + }, + { + "cell_type": "markdown", + "id": "2500bcdc-6f0a-4484-b64d-dc23ef94c445", + "metadata": {}, + "source": [ + "`sudo apt-get update`\n", + "\n", + "`sudo apt-get install git-lfs`\n", + "\n", + "`git lfs install`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01d2bdbf-1af1-4a20-a255-7642b8da704d", + "metadata": {}, + "outputs": [], + "source": [ + "#check that git lfs is installed on path" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c72f91e7-89ee-4b90-99f4-91239197ed72", + "metadata": {}, + "outputs": [], + "source": [ + "!git lfs version" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8774eaf-d975-49ab-a1dc-2f6062513aab", + "metadata": {}, + "outputs": [], + "source": [ + "#start a tmux session and run the following code in the terminal:" + ] + }, + { + "cell_type": "markdown", + "id": "5117214b-99fd-4ecf-92fa-e084ee4d95cc", + "metadata": {}, + "source": [ + "`sudo apt-get update`\n", + "\n", + "`sudo apt-get install tmux`\n", + "\n", + "`tmux new -s mysession`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bcae2706-cd59-4e38-9353-fe9d3b2283ff", + "metadata": {}, + "outputs": [], + "source": [ + "# run the following code to download the model in a tmux session since this may take a while - run in terminal" + ] + }, + { + "cell_type": "markdown", + "id": "fb753745-5fd9-4e17-984f-e1647be82dda", + "metadata": {}, + "source": [ + "`git clone https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "baae5abf-e4e8-4418-8747-d5bc05e7f3fe", + "metadata": {}, + "outputs": [], + "source": [ + "!du -sh /home/ubuntu/Mistral-Small-24B-Instruct-2501/ #check if the full model was copied in" + ] + }, + { + "cell_type": "markdown", + "id": "6f481bc0-8312-4d24-8dd3-ad834518d1dc", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "083589fe-ee9b-4908-bbe0-b66fff1703a9", + "metadata": {}, + "source": [ + "## 5. Compile and save model and run generation with HuggingFaceGenerationAdapter- `inference_demo.py`\n", + "\n", + "NxD Inference supports running inference with the HuggingFace generate inference. To use HuggingFace-style generation, create a HuggingFaceGenerationAdapter that wraps a Neuron application model. Then, you can call generate on the adapted model. In the below cell, we use the `inference_demo` script that NXDI provides to compile, save, and run some prompts with our Mistral Small 24B model, for more information on the flags we set, refer to the [nxdi api reference guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a76a78e0-4ed2-4057-b702-d472d24f515f", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "# Replace this with the path where you downloaded and saved the model files.\n", + "# These should be the same paths used when compiling the model.\n", + "MODEL_PATH=\"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n", + "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n", + "TP_DEGREE=32\n", + "\n", + "inference_demo \\\n", + " --model-type llama \\\n", + " --task-type causal-lm \\\n", + " run \\\n", + " --model-path $MODEL_PATH \\\n", + " --compiled-model-path $COMPILED_MODEL_PATH \\\n", + " --torch-dtype bfloat16 \\\n", + " --start_rank_id 0 \\\n", + " --tp-degree $TP_DEGREE \\\n", + " --batch-size 1 \\\n", + " --max-context-length 12288 \\\n", + " --seq-len 12800 \\\n", + " --on-device-sampling \\\n", + " --top-k 1 \\\n", + " --do-sample \\\n", + " --fused-qkv \\\n", + " --sequence-parallel-enabled \\\n", + " --pad-token-id 2 \\\n", + " --enable-bucketing \\\n", + " --context-encoding-buckets 2048 4096 8192 12288 \\\n", + " --token-generation-buckets 2048 4096 8192 12800 \\\n", + " --prompt \"What is annapurna labs?\" 2>&1 | tee log" + ] + }, + { + "cell_type": "markdown", + "id": "7df76059-45ba-4726-9bcc-7a545a6d1575", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "9a8ba793-3901-4620-b4b0-5150325a005a", + "metadata": {}, + "source": [ + "## 6. vLLM demo and perf benchmarking - standalone model " + ] + }, + { + "cell_type": "markdown", + "id": "cc9425cb-9354-4f0a-8375-900961ff54af", + "metadata": {}, + "source": [ + "#### 6.1 Run Mistral Small 2501 on Trainium" + ] + }, + { + "cell_type": "markdown", + "id": "1d58b2a2-8475-4599-8750-24253c3c9c29", + "metadata": {}, + "source": [ + "Here is an example for running online inference with Mistral Small 2501 and let's get some perf results. We will first compile and run generation on a sample prompt using a command installed by neuronx-distributed-inference. The script compiles the model and runs generation on the given input prompt. Note the path we used to save the compiled model. This path should be used when launching vLLM server for inference so that the compiled model can be loaded without recompilation. Please refer to [NxD Inference API Reference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/api-guides/api-guide.html) and [VLLM user guide for NxDI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)for more information on these `inference_demo` flags." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "71c866cb-9ace-4fae-a940-8776c4e41ad1", + "metadata": {}, + "outputs": [], + "source": [ + "!pip list | grep neuron" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fd75bb0-6fb8-4f89-8203-73b8247fc879", + "metadata": {}, + "outputs": [], + "source": [ + "# RUN THE FOLLOWING CELL IN A TERMINAL - spin up the vllm server" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca565569-45f9-4c0c-b81d-e0c7eaa98249", + "metadata": {}, + "outputs": [], + "source": [ + "# These should be the same paths used when compiling the model. - command for terminal\n", + "MODEL_PATH=\"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n", + "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n", + "\n", + "export VLLM_NEURON_FRAMEWORK=\"neuronx-distributed-inference\"\n", + "export NEURON_COMPILED_ARTIFACTS=$COMPILED_MODEL_PATH\n", + "VLLM_RPC_TIMEOUT=100000 python -m vllm.entrypoints.openai.api_server \\\n", + " --model $MODEL_PATH \\\n", + " --max-num-seqs 1 \\\n", + " --max-model-len 12800 \\\n", + " --tensor-parallel-size 32 \\\n", + " --device neuron \\\n", + " --use-v2-block-manager \\\n", + " --port 8000 &\n", + "PID=$!\n", + "echo \"vLLM server started with PID $PID\"" + ] + }, + { + "cell_type": "markdown", + "id": "2d24058f-67a6-4f94-9a39-0ce4ab64d801", + "metadata": {}, + "source": [ + "Let's send a quick request with a python client to the server:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "124db78f-f910-4bcd-b631-4a2fb4166a54", + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "\n", + "# Client Setup\n", + "openai_api_key = \"EMPTY\"\n", + "openai_api_base = \"http://localhost:8000/v1\"\n", + "\n", + "client = OpenAI(\n", + " api_key=openai_api_key,\n", + " base_url=openai_api_base,\n", + ")\n", + "\n", + "models = client.models.list()\n", + "model_name = models.data[0].id\n", + "\n", + "# Sampling Parameters\n", + "max_tokens = 1024\n", + "temperature = 1.0\n", + "top_p = 1.0\n", + "top_k = 50\n", + "stream = False\n", + "\n", + "# Chat Completion Request\n", + "response = client.chat.completions.create(\n", + " model=model_name,\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful AI assistant.\"},\n", + " {\"role\": \"user\", \"content\": \"What is AWS Neuron?\"}\n", + " ],\n", + ")\n", + "\n", + "# Parse the response\n", + "generated_text = \"\"\n", + "generated_text = response.choices[0].message.content\n", + "\n", + "print(generated_text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d9c4bad-6cb2-43d6-86db-d2011f1bdaf3", + "metadata": {}, + "outputs": [], + "source": [ + "!neuron-ls # show running processes - vllm server is still running" + ] + }, + { + "cell_type": "markdown", + "id": "8d8f6afb-6f64-4387-ad65-a7fe705d9fe9", + "metadata": {}, + "source": [ + "----" + ] + }, + { + "cell_type": "markdown", + "id": "053e81d2-e2cf-4ef7-9a55-1fb495bc416f", + "metadata": {}, + "source": [ + "#### 6.2 llmperf- let's run some quick benchmarks " + ] + }, + { + "cell_type": "markdown", + "id": "c45478cd-64a5-4a9a-8622-294848e2f852", + "metadata": {}, + "source": [ + "After the above steps, the vllm server should be running. You can now measure the performance using LLMPerf. Before we can use the llmperf package, we need to make a few changes to its code. Follow benchmarking with LLMPerf guide to apply the code changes.\n", + "\n", + "Below is a sample shell script to run LLMPerf. To provide the model with 10000 tokens as input and generate 1500 tokens as output on average, we use the following parameters from LLMPerf:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa9ab418-c069-40cb-8041-5c0574c3377f", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd /home/ubuntu/llmperf/\n", + "\n", + "MODEL_PATH=\"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n", + "COMPILED_MODEL_PATH=\"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n", + "OUTPUT_PATH=llmperf-results-sonnets\n", + "\n", + "export OPENAI_API_BASE=\"http://localhost:8000/v1\"\n", + "export OPENAI_API_KEY=\"mock_key\"\n", + "\n", + "python token_benchmark_ray.py \\\n", + " --model $MODEL_PATH \\\n", + " --mean-input-tokens 10000 \\\n", + " --stddev-input-tokens 0 \\\n", + " --mean-output-tokens 1500 \\\n", + " --stddev-output-tokens 0 \\\n", + " --num-concurrent-requests 1\\\n", + " --timeout 3600 \\\n", + " --max-num-completed-requests 50 \\\n", + " --additional-sampling-params '{}' \\\n", + " --results-dir $OUTPUT_PATH \\\n", + " --llm-api \"openai\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c38d5e8d-0c16-4061-b22f-c558ba1d5050", + "metadata": {}, + "outputs": [], + "source": [ + "!sudo kill 55509 #stop the server" + ] + }, + { + "cell_type": "markdown", + "id": "dc00332e-d7e0-4607-b91e-e08503a2319a", + "metadata": {}, + "source": [ + "Summarized results:" + ] + }, + { + "cell_type": "markdown", + "id": "3943d616-6474-40a7-b9cd-61faac231c15", + "metadata": {}, + "source": [ + "| Scenario | TTFT (p50 ms) | TPOT (p50 ms) | Output-token Throughput (tokens/s, p50) |\n", + "|---------------------------------------------------------------------------|---------------|---------------|-----------------------------------------|\n", + "| Mistral-Small-24B-Instruct-2501 on Trainium (OpenAI-style API) | 347 | 10.55 | 107.35 |\n" + ] + }, + { + "cell_type": "markdown", + "id": "2c99fb0e-5712-43f1-be25-2892342bbce4", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "f6ec3558-b527-44a2-b107-72a6c89d4722", + "metadata": {}, + "source": [ + "#### 6.3 Running Evaluations\n", + "\n", + "There are two methods that you can use the evaluation scirpts to run your evaluation. For more information, check out the [inference-demo](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/)directory and [tutorials](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn1-llama3.1-70b-instruct-accuracy-eval-tutorial.html) in NXDI.\n", + "\n", + "1. Using a `yaml` configuration file and `accuracy.py` script\n", + "\n", + "2. Writing your own python script that uses several components provided in `accuracy.py` and `server_config.py`\n", + "\n", + "In this notebook we only demonstrate running an eval with the `yaml` config file.\n", + "\n", + "In this method all you need is to create a yaml config file that specifies the server configuration and testing scenario you want to run. Create `config.yaml` with the following content." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c81c85a-6e50-47f5-9baa-f22e465f7297", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile mistral_config.yaml\n", + "\n", + "server:\n", + " name: \"Mistral-Small-24B-Instruct\"\n", + " model_path: \"/home/ubuntu/Mistral-Small-24B-Instruct-2501/\"\n", + " model_s3_path: null\n", + " compiled_model_path: \"/home/ubuntu/traced_model/Mistral-Small-24B-Instruct-2501/\"\n", + " max_seq_len: 12800\n", + " context_encoding_len: 12288\n", + " tp_degree: 32\n", + " n_vllm_threads: 32\n", + " server_port: 8888\n", + " continuous_batch_size: 1\n", + "\n", + "test:\n", + " accuracy:\n", + " mytest:\n", + " client: \"lm_eval\"\n", + " datasets: [\"gsm8k_cot\"]\n", + " max_concurrent_requests: 1\n", + " timeout: 3600\n", + " client_params:\n", + " limit: 200\n", + " use_chat: True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b4c171e0-a6ef-4782-bb67-2927d393e4b7", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "if test -f \"/home/ubuntu/aws-neuron-samples/inference-benchmarking/mistral_config.yaml\"; then\n", + " echo \"config file exists.\"\n", + "else \n", + " echo \"Copying config file.\"\n", + " mv /home/ubuntu/mistral_config.yaml /home/ubuntu/aws-neuron-samples/inference-benchmarking/\n", + "fi" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1efea9af-ed19-476e-9841-764ea6f1b101", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd /home/ubuntu/aws-neuron-samples/inference-benchmarking/\n", + "python accuracy.py --config mistral_config.yaml" + ] + }, + { + "cell_type": "markdown", + "id": "6ee55692-a5f9-41f5-a955-ce36c52f5b96", + "metadata": {}, + "source": [ + "Results Summary:\n", + "\n", + "Accuracy_mytest_gsm8k_cot:\n", + " Saved at results/accuracy/mytest/gsm8k_cot/__home__ubuntu__Mistral-Small-24B-Instruct-2501__/results_2025-04-26T20-02-47.843052.json:\n", + " \n", + " Metrics: {'gsm8k_cot': {'AccuracyExactMatchStrictMatch': 39.5, 'AccuracyExactMatchStrictMatchStderr': 3.46537, 'AccuracyExactMatchFlexibleExtract': 78.5, 'AccuracyExactMatchFlexibleExtractStderr': 2.91224}}" + ] + }, + { + "cell_type": "markdown", + "id": "c0b0f6af-8bb3-4e62-b98e-a6bb779f85a4", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "b7b7fa83-aa5c-44c1-986b-c472a1243ed4", + "metadata": {}, + "source": [ + "#### 6.4 Profiling with `neuron-profile`" + ] + }, + { + "cell_type": "markdown", + "id": "a3376804-9c21-4e68-ba32-e299ed75fb9d", + "metadata": {}, + "source": [ + "`neuron-profile` helps developers identify performance bottlenecks and optimize their workloads for NeuronDevices. `neuron-profile` provides insights into NeuronDevice activity including the instructions executed on each compute engine (ex. Tensor engine, Vector engine, etc.), DMA data movement activity, and performance metrics such as engine utilization, DMA throughput, memory usage, and more. NeuronDevice activity is collected by the `neuron-profile` capture command which runs the model with tracing enabled. Profiling typically has near zero overhead because NeuronDevices have dedicated on-chip hardware profiling." + ] + }, + { + "cell_type": "markdown", + "id": "c2be3b7c-a7e0-4295-b306-e31bbbac8da9", + "metadata": {}, + "source": [ + "Let's cd into `/tmp/nxd_model` for the compiler working dir with the `context_encoding` and `token_generation` directories that we set the context encoding and token generation buckets for, which hold the NEFFs for these. The neuron-profile tool can both capture and post-process profiling information. neuron-profile takes a compiled model (a NEFF), executes it, and saves the profile results to a NTFF (profile.ntff by default)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d6ce0b5-f029-41a3-9971-a087341584fc", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash \n", + "cd /tmp/nxd_model/\n", + "ls #list directories\n", + "cd context_encoding_model\n", + "ls " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea3c44a0-7c75-46f8-adb6-e413344e51bf", + "metadata": {}, + "outputs": [], + "source": [ + "#_tp0_bk0 _tp0_bk1 _tp0_bk2 _tp0_bk3 - are the context encoding buckets" + ] + }, + { + "cell_type": "markdown", + "id": "f9398648-ae6c-4069-baee-338e1205a1e0", + "metadata": {}, + "source": [ + "##### Capturing profiles for multi-worker jobs\n", + "`neuron-profile` can capture profiles for collectives-enabled NEFFs running across multiple NeuronCores, NeuronDevices, or even nodes. This is useful for understanding performance and communication overheads when deploying larger distributed models.\n", + "\n", + "The following example, performs a distributed run across all NeuronDevices and NeuronCores on our trn1.32xlarge instance, capturing profiles for all 32 workers (one for each NeuronCore)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c746ce34-fc4a-48bb-9558-c1babbcd7c91", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "# 1. Make sure the directory exists and is writable\n", + "mkdir -p /tmp/output/ \n", + "\n", + "cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/\n", + "# 2. Run the capture, pointing -s at that directory\n", + "neuron-profile capture \\\n", + " -n graph.neff \\\n", + " --collectives-workers-per-node 32 \\\n", + " -s /tmp/output/profile.ntff \n" + ] + }, + { + "cell_type": "markdown", + "id": "1c5250d4-f3a7-494b-9c42-d14fbbb07a26", + "metadata": {}, + "source": [ + "Now if we check our output dir- A profile is saved for each worker in the output directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b40fe3a-f01f-4f62-96f5-2df5db0d05dd", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash \n", + "cd /tmp/output/\n", + "ls" + ] + }, + { + "cell_type": "markdown", + "id": "ee5fbaba-ecfb-4f5e-8af8-31bd6df05162", + "metadata": {}, + "source": [ + "##### Viewing profiles for multi-worker jobs\n", + "Profiles from multi-worker jobs (i.e. more than one NeuronCore) can either be viewed individually or in a combined collectives view. Since profile data is often similar between workers and processing profile data for all workers can be time-consuming, it is recommended to first explore the profile for a single worker or small subset of workers. Viewing the profile for a specific worker is the same as for single-worker profiles." + ] + }, + { + "cell_type": "markdown", + "id": "a5b39974-8a7a-456b-b035-c841adecb8d1", + "metadata": {}, + "source": [ + "In the beginning, we forwarded port 3001 and 806. This is because `neuron-profile` view is running on a remote instance, we need to use port forwarding to access the profiles.\n" + ] + }, + { + "cell_type": "markdown", + "id": "898c5311-fc27-4dc3-b9e1-ec3da5654089", + "metadata": {}, + "source": [ + "Viewing the profile for a specific worker is as below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "504f6ec8-ee60-49b2-912f-b2a9be8c1f03", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/\n", + "neuron-profile view -n graph.neff -s /tmp/output/profile_rank_2.ntff" + ] + }, + { + "cell_type": "markdown", + "id": "e3308394-08a3-413d-a38b-d3266ae2c58f", + "metadata": {}, + "source": [ + "You will see an output like- View profile at http://localhost:3001/profile/n_a1143c514431fb4c23b7aae9208fd1a89cad42f6" + ] + }, + { + "cell_type": "markdown", + "id": "f756c277-bc7b-462a-91ed-eaaf3fa6dd8d", + "metadata": {}, + "source": [ + "![image-profile](imgs/img-neff.png)" + ] + }, + { + "cell_type": "markdown", + "id": "3e323e2f-3980-4ac7-8c32-ed395c937f0e", + "metadata": {}, + "source": [ + "To view the profile for multiple workers, pass the directory containing all worker profiles to neuron-profile." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b082bd6-ed2c-4d70-b092-1e97fe9df1ce", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd /tmp/nxd_model/context_encoding_model/_tp0_bk1/\n", + "neuron-profile view -n graph.neff -d /tmp/output" + ] + }, + { + "cell_type": "markdown", + "id": "eb861db0-99e5-42ce-b9c9-bb7cc185af6a", + "metadata": {}, + "source": [ + "For more on profiling with neuron and understanding profiles,check out the [link](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html) to `neuron-profile` user guide" + ] + }, + { + "cell_type": "markdown", + "id": "40aca927-7701-4f4e-9320-cc2faf9d05bd", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "f071be6d-3a5c-4fa0-b02f-4cf3802ff7c2", + "metadata": {}, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "e70149be-ba89-4320-a17c-082ba6a72b6d", + "metadata": {}, + "source": [ + "In this notebook, we successfully walked through deploying, benchmarking, and generating profiles for NEFFs on TRN1 using Mistral Small 2501" + ] + }, + { + "cell_type": "markdown", + "id": "a281311e-0c58-41c1-9f61-93735c768f1f", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "d98b2877-bb28-4762-8ef2-9165ef505046", + "metadata": {}, + "source": [ + "#### Distributors\n", + "- AWS\n", + "- Mistral" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_python3", + "language": "python", + "name": "conda_python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.16" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/README.md b/README.md index a27e950..43ba121 100644 --- a/README.md +++ b/README.md @@ -13,14 +13,15 @@ README # Project Summaries |-- README # Setup for infrastructure (VPC, EKS cluster etc) |-- 0_setup_vpc/ # CloudFormation templates for reference VPC |-- 1_setup_cluster/ # Scripts to create your cluster using EKS -2.project/ +2.projects/ |-- nims-inference/ |-- trtllm-inference/ |-- sglang-inference/ |-- ray-service/ |-- multinode-triton-trtllm-inference/ |-- mixture-of-agents/ -|-- mig/ +|-- neuronx-distributed/ +|-- mig-gpu-partitioning/ 3.use-cases/ |-- nims-inference/ `-- ... @@ -64,6 +65,14 @@ Recent advances in large language models (LLMs) have shown substantial capabilit These examples shows how to deploy LLMs like T5, Mistral using NVIDIA Triton TRT-LLM on Amazon SageMaker. See [triton-trtllm-sagemaker](2.projects/triton-trtllm-sagemaker) for more information. +### NEURONX-DISTRIBUTED-INFERENCE EKS + +This project demonstrates deploying Large Language Models using **NeuronX Distributed Inference (NxDI)** with **vLLM-Neuron** on Amazon EKS with AWS Trainium instances (trn1.32xlarge). The solution showcases advanced inference optimization techniques including **fused draft speculative decoding** and comprehensive performance monitoring including NeuronCore utilization, inference latency, and throughput metrics. See [neuronx-distributed/nxd-inference-eks](2.projects/neuronx-distributed/nxd-inference-eks) for more information. + +### NEURONX DISTRIBUTED INFERENCE - Profiling and Benchmarking Guide on EC2 + +This example on EC2 walks through deploying, profiling, and benchmarking LLMs on EC2 with NxDI. See [neuronx-distributed/nxd-inference-ec2](2.projects/neuronx-distributed/nxdi-ec2-vllm) for more information. + ### MIG This directory contains a README on how you can leverage Multi-Instance GPUs (MIGs) to partition your GPUs based on your workload. For more detailed documentation, check out the [MIG user guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/).