AI Forge

End-to-end ML training and serving platform on AWS EKS using Terraform, Kubernetes, and HuggingFace models.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                          AWS VPC                                │
│                                                                 │
│  Public Subnets                  Private Subnets                │
│  ┌─────────────┐                ┌──────────────────────────┐    │
│  │ NAT Gateway │<────────────── │  EKS Node Group (SPOT)   │    │
│  │  (1x cost   │                │                          │    │
│  │  optimized) │                │  ┌────────────────────┐  │    │
│  └─────────────┘                │  │  aiforge-api Pod   │  │    │
│                                 │  │  (FastAPI + K8s    │  │    │
│  ┌──────────────┐               │  │   client)          │  │    │
│  │  AWS LB      │ ────────────> │  └────────────────────┘  │    │
│  │  (port 80)   │               │                          │    │
│  └──────────────┘               │  ┌────────────────────┐  │    │
│        ↑                        │  │ Training Job Pods  │  │    │
│  ┌──────────────┐               │  │ (spawned on-demand)│  │    │
│  │  Internet    │               │  └────────────────────┘  │    │
│  │  Gateway     │               └──────────────────────────┘    │
│  └──────────────┘                                               │         
│        ↑                                                        │
│     UI / curl                                                   │
└─────────────────────────────────────────────────────────────────┘ 

  ┌──────────────────────────────────────────────────────────┐   
  │  S3 Bucket (model-artifacts)                             │   
  │  models/<model_name>/<run_id>/checkpoint/  (weights)     │   
  │  models/<model_name>/<run_id>/metadata.json              │   
  └──────────────────────────────────────────────────────────┘   
                                                                 
  ┌──────────┐   ┌──────────────────┐                            
  │  ECR     │   │  IAM / IRSA      │                            
  │ training │   │  (least-priv S3) │                            
  │ serving  │   └──────────────────┘                            
  └──────────┘

Component interaction:

User hits the UI or calls POST /train on the LoadBalancer endpoint
The API pod creates a Kubernetes Job in the same namespace
The Job pod downloads the HuggingFace model, runs mock training, and uploads artifacts + metadata.json to S3
User calls POST /generate/text or POST /generate/image; the API pod loads the model from HuggingFace (production: from S3) and returns the result

Repository Structure

ai-forge/
├── .github/               Github Actions workflow (CI/CD)
│   └── workflows/
│       └── deploy.yaml
├── iac/                   Terraform (VPC, EKS, S3, ECR, IAM)
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   ├── vpc.tf
│   ├── eks.tf
│   ├── s3.tf
│   ├── ecr.tf
│   └── iam.tf
├── training_logic/        Mock training container
│   ├── Dockerfile
│   ├── train.py
│   └── requirements.txt
├── backend_api/           FastAPI serving + trigger API
│   ├── Dockerfile
│   ├── main.py
│   ├── models.py
│   └── requirements.txt
├── k8s/                   Kubernetes manifests
│   ├── api-deployment.yaml
│   ├── configmap.yaml
│   └── rbac.yaml
├── ui/                    Single-file web UI
│   └── index.html
├── .gitignore
├── Makefile
└── README.md

Models

Model	Type	HuggingFace ID
Smol LLAMA	Text generation	`BEE-spoke-data/smol_llama-101M-GQA`
Sana Sprint	Image generation	`Efficient-Large-Model/Sana_Sprint_1.6B_1024px`

Prerequisites

Tool	Version
Terraform	>= 1.3.0
AWS	~> 5.0
kubectl	>= 1.28
Docker	>= 24.x
Python	>= 3.11

Deployment Guide

1. Configure AWS credentials

aws configure
# or export AWS_PROFILE=my-profile

2. Provision infrastructure with Terraform

cd iac/
terraform init
terraform plan -var="aws_region=us-west-2"
terraform apply -var="aws_region=us-west-2" -auto-approve

3. Configure kubectl

$(terraform output -raw configure_kubectl)
kubectl get nodes   # should show 2 SPOT nodes as Ready

4. Build and push Docker images

CI/CD is handled using GitHub Actions, which automates the lifecycle of your container images from build to deployment.

Manual Deployment (Optional)

Use the following commands to build and push to the ECR

make deploy

5. Get the API URL

Wait for the pod and LoadBalancer:

kubectl rollout status deployment/aiforge-api
export API_URL=$(kubectl get svc aiforge-api -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "API available at: http://$API_URL"

6. Test the APIs

Health check:

make health-check

List models:

curl http://$API_URL/models | python3 -m json.tool

Trigger training run:

curl -X POST http://$API_URL/train \
  -H "Content-Type: application/json" \
  -d '{"model_name": "BEE-spoke-data/smol_llama-101M-GQA", "epochs": 3}'

Watch the training job:

make get-job-logs

Watch API logs:

make get-logs

Generate text:

curl -X POST http://$API_URL/generate/text \
  -H "Content-Type: application/json" \
  -d '{"model_name": "BEE-spoke-data/smol_llama-101M-GQA", "prompt": "Once upon a time", "max_new_tokens": 80}'

Generate image:

curl -X POST http://$API_URL/generate/image \
  -H "Content-Type: application/json" \
  -d '{"model_name": "Efficient-Large-Model/Sana_Sprint_1.6B_1024px", "prompt": "A sunset over mountains"}' \
  | python3 -c "import sys,json,base64; d=json.load(sys.stdin); open('out.png','wb').write(base64.b64decode(d['result']))"
open out.png

7. Open the UI

make serve

8. Teardown

# Delete K8s resources first to release the LoadBalancer EIP
kubectl delete -f k8s/
# Destroy all Terraform-managed AWS resources
cd iac/
terraform destroy -auto-approve

Troubleshooting

# Inspect Deployment Status and Events
kubectl describe deployment aiforge-api

# Check Pod Status and Logs
kubectl get pods -l app=aiforge-api 
kubectl describe pod <pod-name> 
kubectl logs <pod-name>

#  Check recent events
kubectl get events

Cost Estimate

Resource	Unit Details	Count	Est. $/hr	Est. $/mo
EKS Control Plane	Cluster management fee	1	$0.10	$73.00
EKS Worker Nodes	`m6i.2xlarge` Spot Instances	2 (Desired)	~$0.244	~$178.12
NAT Gateway	Connectivity for private subnets	1	$0.045	$32.85
ECR Storage	Training & Serving Repos (5 image limit)	2	< $0.01	~$1.00
S3 Storage	Model Artifacts bucket	1	< $0.01	~$1.00
VPC & Data	Public IP and data transfer	—	~$0.01	~$7.30
Total			~$0.41	~$293.27

Technical Justification

Network Design

A simple VPC with 2 public + 2 private subnets across AZs was chosen. EKS nodes live in private subnets and reach the internet (for HuggingFace downloads) through a single NAT Gateway — cost-optimized but not HA. The LoadBalancer is provisioned in public subnets by the AWS LB controller.

IRSA (IAM Roles for Service Accounts)

Instead of embedding AWS credentials in the pod environment, the API pod assumes an IAM role via OIDC federation. This follows the least-privilege principle: the role grants only s3:Get/Put/List/Delete on the specific artifact bucket.

Mock Training Design

The training Job downloads the real model weights, simulates epoch iterations with decaying loss metrics, then uploads the checkpoint + metadata.json to S3. This makes the artifact lifecycle realistic without requiring expensive GPU fine-tuning. The API uses the metadata.json to discover completed runs.

Kubernetes Job Pattern

Training is implemented as a Kubernetes Job (not a Deployment) because training is a one-shot, finite workload. The API dynamically creates Jobs with a unique run_id per request. ttlSecondsAfterFinished: 3600 auto-cleans completed Job objects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Forge

Architecture Overview

Repository Structure

Models

Prerequisites

Deployment Guide

1. Configure AWS credentials

2. Provision infrastructure with Terraform

3. Configure kubectl

4. Build and push Docker images

Manual Deployment (Optional)

5. Get the API URL

6. Test the APIs

7. Open the UI

8. Teardown

Troubleshooting

Cost Estimate

Technical Justification

Network Design

IRSA (IAM Roles for Service Accounts)

Mock Training Design

Kubernetes Job Pattern

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
backend_api		backend_api
iac		iac
k8s		k8s
training_logic		training_logic
ui		ui
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AI Forge

Architecture Overview

Repository Structure

Models

Prerequisites

Deployment Guide

1. Configure AWS credentials

2. Provision infrastructure with Terraform

3. Configure kubectl

4. Build and push Docker images

Manual Deployment (Optional)

5. Get the API URL

6. Test the APIs

7. Open the UI

8. Teardown

Troubleshooting

Cost Estimate

Technical Justification

Network Design

IRSA (IAM Roles for Service Accounts)

Mock Training Design

Kubernetes Job Pattern

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages