End-to-end ML training and serving platform on AWS EKS using Terraform, Kubernetes, and HuggingFace models.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS VPC β
β β
β Public Subnets Private Subnets β
β βββββββββββββββ ββββββββββββββββββββββββββββ β
β β NAT Gateway β<ββββββββββββββ β EKS Node Group (SPOT) β β
β β (1x cost β β β β
β β optimized) β β ββββββββββββββββββββββ β β
β βββββββββββββββ β β aiforge-api Pod β β β
β β β (FastAPI + K8s β β β
β ββββββββββββββββ β β client) β β β
β β AWS LB β ββββββββββββ> β ββββββββββββββββββββββ β β
β β (port 80) β β β β
β ββββββββββββββββ β ββββββββββββββββββββββ β β
β β β β Training Job Pods β β β
β ββββββββββββββββ β β (spawned on-demand)β β β
β β Internet β β ββββββββββββββββββββββ β β
β β Gateway β ββββββββββββββββββββββββββββ β
β ββββββββββββββββ β
β β β
β UI / curl β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 Bucket (model-artifacts) β
β models/<model_name>/<run_id>/checkpoint/ (weights) β
β models/<model_name>/<run_id>/metadata.json β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββββββββββββββββ
β ECR β β IAM / IRSA β
β training β β (least-priv S3) β
β serving β ββββββββββββββββββββ
ββββββββββββ
Component interaction:
- User hits the UI or calls
POST /trainon the LoadBalancer endpoint - The API pod creates a Kubernetes
Jobin the same namespace - The Job pod downloads the HuggingFace model, runs mock training, and uploads artifacts +
metadata.jsonto S3 - User calls
POST /generate/textorPOST /generate/image; the API pod loads the model from HuggingFace (production: from S3) and returns the result
ai-forge/
βββ .github/ Github Actions workflow (CI/CD)
β βββ workflows/
β βββ deploy.yaml
βββ iac/ Terraform (VPC, EKS, S3, ECR, IAM)
β βββ main.tf
β βββ variables.tf
β βββ outputs.tf
β βββ vpc.tf
β βββ eks.tf
β βββ s3.tf
β βββ ecr.tf
β βββ iam.tf
βββ training_logic/ Mock training container
β βββ Dockerfile
β βββ train.py
β βββ requirements.txt
βββ backend_api/ FastAPI serving + trigger API
β βββ Dockerfile
β βββ main.py
β βββ models.py
β βββ requirements.txt
βββ k8s/ Kubernetes manifests
β βββ api-deployment.yaml
β βββ configmap.yaml
β βββ rbac.yaml
βββ ui/ Single-file web UI
β βββ index.html
βββ .gitignore
βββ Makefile
βββ README.md
| Model | Type | HuggingFace ID |
|---|---|---|
| Smol LLAMA | Text generation | BEE-spoke-data/smol_llama-101M-GQA |
| Sana Sprint | Image generation | Efficient-Large-Model/Sana_Sprint_1.6B_1024px |
| Tool | Version |
|---|---|
| Terraform | >= 1.3.0 |
| AWS | ~> 5.0 |
| kubectl | >= 1.28 |
| Docker | >= 24.x |
| Python | >= 3.11 |
aws configure
# or export AWS_PROFILE=my-profilecd iac/
terraform init
terraform plan -var="aws_region=us-west-2"
terraform apply -var="aws_region=us-west-2" -auto-approve$(terraform output -raw configure_kubectl)
kubectl get nodes # should show 2 SPOT nodes as ReadyCI/CD is handled using GitHub Actions, which automates the lifecycle of your container images from build to deployment.
Use the following commands to build and push to the ECR
make deployWait for the pod and LoadBalancer:
kubectl rollout status deployment/aiforge-api
export API_URL=$(kubectl get svc aiforge-api -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "API available at: http://$API_URL"Health check:
make health-checkList models:
curl http://$API_URL/models | python3 -m json.toolTrigger training run:
curl -X POST http://$API_URL/train \
-H "Content-Type: application/json" \
-d '{"model_name": "BEE-spoke-data/smol_llama-101M-GQA", "epochs": 3}'Watch the training job:
make get-job-logsWatch API logs:
make get-logsGenerate text:
curl -X POST http://$API_URL/generate/text \
-H "Content-Type: application/json" \
-d '{"model_name": "BEE-spoke-data/smol_llama-101M-GQA", "prompt": "Once upon a time", "max_new_tokens": 80}'Generate image:
curl -X POST http://$API_URL/generate/image \
-H "Content-Type: application/json" \
-d '{"model_name": "Efficient-Large-Model/Sana_Sprint_1.6B_1024px", "prompt": "A sunset over mountains"}' \
| python3 -c "import sys,json,base64; d=json.load(sys.stdin); open('out.png','wb').write(base64.b64decode(d['result']))"
open out.pngmake serve# Delete K8s resources first to release the LoadBalancer EIP
kubectl delete -f k8s/
# Destroy all Terraform-managed AWS resources
cd iac/
terraform destroy -auto-approve# Inspect Deployment Status and Events
kubectl describe deployment aiforge-api# Check Pod Status and Logs
kubectl get pods -l app=aiforge-api
kubectl describe pod <pod-name>
kubectl logs <pod-name> # Check recent events
kubectl get events| Resource | Unit Details | Count | Est. $/hr | Est. $/mo |
|---|---|---|---|---|
| EKS Control Plane | Cluster management fee | 1 | $0.10 | $73.00 |
| EKS Worker Nodes | m6i.2xlarge Spot Instances |
2 (Desired) | ~$0.244 | ~$178.12 |
| NAT Gateway | Connectivity for private subnets | 1 | $0.045 | $32.85 |
| ECR Storage | Training & Serving Repos (5 image limit) | 2 | < $0.01 | ~$1.00 |
| S3 Storage | Model Artifacts bucket | 1 | < $0.01 | ~$1.00 |
| VPC & Data | Public IP and data transfer | β | ~$0.01 | ~$7.30 |
| Total | ~$0.41 | ~$293.27 |
A simple VPC with 2 public + 2 private subnets across AZs was chosen. EKS nodes live in private subnets and reach the internet (for HuggingFace downloads) through a single NAT Gateway β cost-optimized but not HA. The LoadBalancer is provisioned in public subnets by the AWS LB controller.
Instead of embedding AWS credentials in the pod environment, the API pod assumes an IAM role via OIDC federation. This follows the least-privilege principle: the role grants only s3:Get/Put/List/Delete on the specific artifact bucket.
The training Job downloads the real model weights, simulates epoch iterations with decaying loss metrics, then uploads the checkpoint + metadata.json to S3. This makes the artifact lifecycle realistic without requiring expensive GPU fine-tuning. The API uses the metadata.json to discover completed runs.
Training is implemented as a Kubernetes Job (not a Deployment) because training is a one-shot, finite workload. The API dynamically creates Jobs with a unique run_id per request. ttlSecondsAfterFinished: 3600 auto-cleans completed Job objects.