Skip to content

avaibh/ai-forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

78 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AI Forge

End-to-end ML training and serving platform on AWS EKS using Terraform, Kubernetes, and HuggingFace models.

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          AWS VPC                                β”‚
β”‚                                                                 β”‚
β”‚  Public Subnets                  Private Subnets                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ NAT Gateway β”‚<────────────── β”‚  EKS Node Group (SPOT)   β”‚    β”‚
β”‚  β”‚  (1x cost   β”‚                β”‚                          β”‚    β”‚
β”‚  β”‚  optimized) β”‚                β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚  β”‚  aiforge-api Pod   β”‚  β”‚    β”‚
β”‚                                 β”‚  β”‚  (FastAPI + K8s    β”‚  β”‚    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚  β”‚   client)          β”‚  β”‚    β”‚
β”‚  β”‚  AWS LB      β”‚ ────────────> β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚  (port 80)   β”‚               β”‚                          β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚        ↑                        β”‚  β”‚ Training Job Pods  β”‚  β”‚    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚  β”‚ (spawned on-demand)β”‚  β”‚    β”‚
β”‚  β”‚  Internet    β”‚               β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚  Gateway     β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                               β”‚         
β”‚        ↑                                                        β”‚
β”‚     UI / curl                                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ 

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   
  β”‚  S3 Bucket (model-artifacts)                             β”‚   
  β”‚  models/<model_name>/<run_id>/checkpoint/  (weights)     β”‚   
  β”‚  models/<model_name>/<run_id>/metadata.json              β”‚   
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   
                                                                 
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            
  β”‚  ECR     β”‚   β”‚  IAM / IRSA      β”‚                            
  β”‚ training β”‚   β”‚  (least-priv S3) β”‚                            
  β”‚ serving  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                   

Component interaction:

  1. User hits the UI or calls POST /train on the LoadBalancer endpoint
  2. The API pod creates a Kubernetes Job in the same namespace
  3. The Job pod downloads the HuggingFace model, runs mock training, and uploads artifacts + metadata.json to S3
  4. User calls POST /generate/text or POST /generate/image; the API pod loads the model from HuggingFace (production: from S3) and returns the result

Repository Structure

ai-forge/
β”œβ”€β”€ .github/               Github Actions workflow (CI/CD)
β”‚   └── workflows/
β”‚       └── deploy.yaml
β”œβ”€β”€ iac/                   Terraform (VPC, EKS, S3, ECR, IAM)
β”‚   β”œβ”€β”€ main.tf
β”‚   β”œβ”€β”€ variables.tf
β”‚   β”œβ”€β”€ outputs.tf
β”‚   β”œβ”€β”€ vpc.tf
β”‚   β”œβ”€β”€ eks.tf
β”‚   β”œβ”€β”€ s3.tf
β”‚   β”œβ”€β”€ ecr.tf
β”‚   └── iam.tf
β”œβ”€β”€ training_logic/        Mock training container
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ train.py
β”‚   └── requirements.txt
β”œβ”€β”€ backend_api/           FastAPI serving + trigger API
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ main.py
β”‚   β”œβ”€β”€ models.py
β”‚   └── requirements.txt
β”œβ”€β”€ k8s/                   Kubernetes manifests
β”‚   β”œβ”€β”€ api-deployment.yaml
β”‚   β”œβ”€β”€ configmap.yaml
β”‚   └── rbac.yaml
β”œβ”€β”€ ui/                    Single-file web UI
β”‚   └── index.html
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Makefile
└── README.md

Models

Model Type HuggingFace ID
Smol LLAMA Text generation BEE-spoke-data/smol_llama-101M-GQA
Sana Sprint Image generation Efficient-Large-Model/Sana_Sprint_1.6B_1024px

Prerequisites

Tool Version
Terraform >= 1.3.0
AWS ~> 5.0
kubectl >= 1.28
Docker >= 24.x
Python >= 3.11

Deployment Guide

1. Configure AWS credentials

aws configure
# or export AWS_PROFILE=my-profile

2. Provision infrastructure with Terraform

cd iac/
terraform init
terraform plan -var="aws_region=us-west-2"
terraform apply -var="aws_region=us-west-2" -auto-approve

3. Configure kubectl

$(terraform output -raw configure_kubectl)
kubectl get nodes   # should show 2 SPOT nodes as Ready

4. Build and push Docker images

CI/CD is handled using GitHub Actions, which automates the lifecycle of your container images from build to deployment.

Manual Deployment (Optional)

Use the following commands to build and push to the ECR

make deploy

5. Get the API URL

Wait for the pod and LoadBalancer:

kubectl rollout status deployment/aiforge-api
export API_URL=$(kubectl get svc aiforge-api -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "API available at: http://$API_URL"

6. Test the APIs

Health check:

make health-check

List models:

curl http://$API_URL/models | python3 -m json.tool

Trigger training run:

curl -X POST http://$API_URL/train \
  -H "Content-Type: application/json" \
  -d '{"model_name": "BEE-spoke-data/smol_llama-101M-GQA", "epochs": 3}'

Watch the training job:

make get-job-logs

Watch API logs:

make get-logs

Generate text:

curl -X POST http://$API_URL/generate/text \
  -H "Content-Type: application/json" \
  -d '{"model_name": "BEE-spoke-data/smol_llama-101M-GQA", "prompt": "Once upon a time", "max_new_tokens": 80}'

Generate image:

curl -X POST http://$API_URL/generate/image \
  -H "Content-Type: application/json" \
  -d '{"model_name": "Efficient-Large-Model/Sana_Sprint_1.6B_1024px", "prompt": "A sunset over mountains"}' \
  | python3 -c "import sys,json,base64; d=json.load(sys.stdin); open('out.png','wb').write(base64.b64decode(d['result']))"
open out.png

7. Open the UI

make serve

8. Teardown

# Delete K8s resources first to release the LoadBalancer EIP
kubectl delete -f k8s/
# Destroy all Terraform-managed AWS resources
cd iac/
terraform destroy -auto-approve

Troubleshooting

# Inspect Deployment Status and Events
kubectl describe deployment aiforge-api
# Check Pod Status and Logs
kubectl get pods -l app=aiforge-api 
kubectl describe pod <pod-name> 
kubectl logs <pod-name> 
#  Check recent events
kubectl get events

Cost Estimate

Resource Unit Details Count Est. $/hr Est. $/mo
EKS Control Plane Cluster management fee 1 $0.10 $73.00
EKS Worker Nodes m6i.2xlarge Spot Instances 2 (Desired) ~$0.244 ~$178.12
NAT Gateway Connectivity for private subnets 1 $0.045 $32.85
ECR Storage Training & Serving Repos (5 image limit) 2 < $0.01 ~$1.00
S3 Storage Model Artifacts bucket 1 < $0.01 ~$1.00
VPC & Data Public IP and data transfer β€” ~$0.01 ~$7.30
Total ~$0.41 ~$293.27

Technical Justification

Network Design

A simple VPC with 2 public + 2 private subnets across AZs was chosen. EKS nodes live in private subnets and reach the internet (for HuggingFace downloads) through a single NAT Gateway β€” cost-optimized but not HA. The LoadBalancer is provisioned in public subnets by the AWS LB controller.

IRSA (IAM Roles for Service Accounts)

Instead of embedding AWS credentials in the pod environment, the API pod assumes an IAM role via OIDC federation. This follows the least-privilege principle: the role grants only s3:Get/Put/List/Delete on the specific artifact bucket.

Mock Training Design

The training Job downloads the real model weights, simulates epoch iterations with decaying loss metrics, then uploads the checkpoint + metadata.json to S3. This makes the artifact lifecycle realistic without requiring expensive GPU fine-tuning. The API uses the metadata.json to discover completed runs.

Kubernetes Job Pattern

Training is implemented as a Kubernetes Job (not a Deployment) because training is a one-shot, finite workload. The API dynamically creates Jobs with a unique run_id per request. ttlSecondsAfterFinished: 3600 auto-cleans completed Job objects.


About

Automated training system for open-source models, enabling users to trigger new training runs via a simple API or web interface

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors