Deployed, stress-tested, and broken on purpose. This is what a real Kubernetes project looks like.
This is not a "kubectl apply and call it done" project.
Stan's Robot Shop was deployed on a multi-node AWS EKS cluster and deliberately pushed beyond a standard deployment โ with real traffic, real autoscaling, real failures, and real debugging. The goal was to understand how a distributed system behaves under load, where it breaks, and how to fix it.
The focus throughout: system behavior, not just configuration.
| Metric | Result |
|---|---|
| ๐ Peak throughput | ~4,850 requests/sec |
| โ Failure rate at 1000 concurrent users | ~0% |
| โก p95 latency | ~195 ms |
| ๐ Bottleneck identified | Catalogue service at >170% CPU |
| ๐ง Critical bug fixed | Istio routing misconfiguration blocking traffic distribution |
| ๐ HPA validated | Live autoscaling under sustained load |
Users
โ
โผ
AWS Elastic Load Balancer
โ
โผ
Istio Ingress Gateway
โ
โโโโบ Web (NodeJS Frontend)
โโโโบ Catalogue โโโบ MongoDB
โโโโบ Cart โโโโโโโบ Redis
โโโโบ User โโโโโโโบ MongoDB + Redis
โโโโบ Shipping โโโบ MySQL
โโโโบ Ratings โโโโบ MySQL
โโโโบ Payment โโโโบ RabbitMQ
โโโโบ Dispatch โโโบ RabbitMQ
โ
โผ
Prometheus โโโบ Grafana Dashboards
8-node cluster topology:
- 5 nodes โ stateless application workloads
- 3 nodes โ dedicated database workloads (isolated with taints)
| Service | Language | Database |
|---|---|---|
| Web | NodeJS + AngularJS | โ |
| Catalogue | NodeJS (Express) | MongoDB |
| Cart | โ | Redis |
| User | โ | MongoDB + Redis |
| Shipping | Java (Spring Boot) | MySQL |
| Ratings | PHP (Apache) | MySQL |
| Payment | Python (Flask) | RabbitMQ |
| Dispatch | Golang | RabbitMQ |
| Component | Tool |
|---|---|
| Container Orchestration | AWS EKS (Kubernetes) |
| Service Mesh | Istio |
| Package Management | Helm + Kustomize |
| GitOps | ArgoCD |
| Monitoring | Prometheus + Grafana |
| Load Testing | k6 |
| Messaging | RabbitMQ |
The cluster separates stateful and stateless workloads โ a pattern that matters in production:
Application nodes (ร2) handle all microservice pods. Spread using pod anti-affinity to avoid co-location of critical replicas.
Database nodes (ร1) are isolated with taints and tolerations. Only database pods can schedule here, preventing compute workloads from starving DB I/O.
# Example: DB node taint
taints:
- key: "workload"
value: "database"
effect: "NoSchedule"
# DB pod toleration
tolerations:
- key: "workload"
operator: "Equal"
value: "database"
effect: "NoSchedule"Why this matters: Without node isolation, a CPU-hungry microservice pod can be co-located with MongoDB, and suddenly your DB performance collapses under load. Many projects skip this and then misdiagnose the root cause.
Three layers of scheduling control are applied:
- Node affinity โ forces pods to specific node types
- Pod affinity โ co-locates related services for lower latency
- Pod anti-affinity โ spreads replicas across nodes for availability
Kustomize is implemented so you can compare real-world approaches:
Best for declarative environment variations without templating complexity.
kubectl apply -k kustomize/overlays/production/The cluster is managed via ArgoCD โ every change goes through Git, not kubectl apply on a laptop.
What ArgoCD manages here:
- Application grouping by service
- Automated sync on Git push
- Health status and rollout visibility
- Drift detection between desired and live state
After deploying Istio, all inbound traffic was being routed only to the Web service. Under load testing, only Web pods were scaling โ everything else sat idle. The performance numbers looked acceptable, but they were measuring the wrong thing.
Root cause: The Istio VirtualService was missing path-based routing rules. All traffic defaulted to the frontend.
# VirtualService โ path-based routing
http:
- match:
- uri:
prefix: /api/catalogue
route:
- destination:
host: catalogue
- match:
- uri:
prefix: /api/user
route:
- destination:
host: user
- route:
- destination:
host: web| Before | After |
|---|---|
| Only Web scaled | Multiple services scaled under HPA |
| Catalogue idle | Catalogue hit >170% CPU โ real bottleneck exposed |
| Performance numbers misleading | Accurate service-level load distribution |
Key lesson: Routing configuration directly affects what your autoscaler sees. If traffic never reaches a service, its HPA will never trigger โ and you'll miss the actual bottleneck entirely.
Realistic user behavior was simulated โ not just hitting one endpoint:
// k6 script: simulated user journey
export default function () {
http.get(`${BASE_URL}/`); // Homepage
http.get(`${BASE_URL}/api/catalogue/`); // Browse products
http.get(`${BASE_URL}/api/catalogue/${id}`); // Product detail
sleep(1);
}Test configuration: 1,000 virtual users (VUs), sustained load, ramped over time.
| Metric | Value |
|---|---|
| Throughput | ~4,850 req/sec |
| p95 latency | ~195 ms |
| p99 latency | ~410 ms |
| Failure rate | ~0% |
| Max spike observed | ~8s (investigated โ traced to DB cold start) |
CPU-based HPA was configured per service with individual thresholds:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: catalogue
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: catalogue
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50| Service | Behavior |
|---|---|
| Catalogue | Scaled to max replicas โ confirmed CPU bottleneck at >170% |
| Web | Scaled moderately โ handled frontend load |
| Cart, User, Payment | Remained near idle โ traffic pattern didn't trigger them |
Vertical Pod Autoscaling was also deployed alongside HPA to right-size resource requests based on observed usage. Running both is uncommon but powerful โ it prevents over-provisioning while HPA handles burst scaling.
โ ๏ธ Note: HPA + VPA together requires careful tuning. VPA changing resource requests can interfere with HPA's CPU utilization calculation if not configured correctly (VPA set toUpdateMode: Offfor HPA-managed deployments).
Real-time visibility into every layer of the system:
Prometheus scrapes metrics from all pods, the Istio control plane, and node exporters.
Grafana dashboards surface:
- Per-service CPU and memory usage
- Request rate and error rate per service
- HPA replica counts over time
- p95/p99 latency per endpoint
- RabbitMQ queue depth
Istio telemetry provides:
- Service-to-service traffic maps
- Distributed tracing (via Jaeger integration)
- Latency breakdowns per hop
(Add Grafana screenshots here)
Resilience was tested deliberately โ not just hoped for.
# Delete a running catalogue pod
kubectl delete pod -l app=catalogue -n robot-shopObserved: Kubernetes self-healing โ new pod scheduled and running within ~15 seconds. No user-visible errors due to multiple replicas.
# Inject CPU stress inside a running pod
kubectl exec -it <pod> -- stress --cpu 4 --timeout 60sObserved: HPA detected CPU spike within ~30 seconds and began scaling. New replicas became ready and absorbed load.
Simulated MongoDB downtime by scaling the MongoDB deployment to 0 replicas.
Observed: Catalogue service began returning 5xx errors. Other services (Cart on Redis, Shipping on MySQL) continued operating normally โ confirming proper service isolation.
These aren't "best practices" from a blog post. These came from breaking things and debugging them:
1. Routing config determines what your autoscaler sees. HPA can't scale a service that never receives traffic. Getting Istio routing right was the prerequisite for everything else working.
2. CPU-based HPA is not always enough. Under bursty I/O workloads, CPU stays low while latency spikes. Custom metrics (RPS, queue depth, latency) are needed for accurate scaling signals.
3. The 8-second latency spike was a DB cold start, not a CPU problem. Observability with Grafana + Prometheus was what made this traceable. Without it, this would have looked like a random anomaly.
4. VPA + HPA together requires deliberate configuration.
Setting VPA to UpdateMode: Off on HPA-managed workloads prevents them from conflicting over resource requests.
5. Node isolation prevents invisible performance degradation. Mixing stateful (DB) and stateless (app) workloads without taints causes noisy-neighbour issues that only show up under load.
- Custom metrics HPA using Prometheus adapter (scale on RPS and p95 latency, not just CPU)
- Full end-to-end user journey simulation in k6 (browse โ cart โ checkout โ payment)
- Database-level observability (MongoDB exporter, MySQL exporter, Redis exporter)
- Canary deployments via Istio traffic splitting (10% โ 50% โ 100%)
- CI/CD pipeline with GitHub Actions triggering ArgoCD sync
- Karpenter for intelligent node autoscaling on EKS
.
โโโ helm/ # Helm chart for full deployment
โ โโโ templates/ # Kubernetes manifests (templated)
โ โโโ values.yaml # Default configuration values
โโโ kustomize/ # Kustomize-based deployment
โ โโโ base/ # Base manifests
โ โโโ overlays/ # Environment-specific patches
โ โโโ dev/
โ โโโ production/
โโโ k6/ # Load testing scripts
โ โโโ load-test.js
โโโ monitoring/ # Prometheus + Grafana config
โ โโโ prometheus/
โ โโโ grafana/dashboards/
โโโ docs/ # Architecture diagrams, screenshots
- AWS EKS cluster (โฅ8 nodes recommended)
kubectl,helm,istioctlinstalled- ArgoCD deployed on cluster
# 1. Add Istio and apply CRDs
istioctl install --set profile=production -y
# 2. Create namespace with Istio injection
kubectl create namespace robot-shop
kubectl label namespace robot-shop istio-injection=enabled
# 3. Deploy Robot Shop
helm install robot-shop ./helm/ \
--namespace robot-shop \
-f helm/values.yaml
# 4. Verify all pods are running
kubectl get pods -n robot-shop
# 5. Get the external load balancer URL
kubectl get svc -n istio-system istio-ingressgatewaykubectl apply -k kustomize/overlays/production/k6 run k6/load-test.js \
--vus 1000 \
--duration 5m \
--env BASE_URL=http://<your-elb-url>
This project demonstrates practical, hands-on experience with:
| Area | What Was Done |
|---|---|
| Kubernetes scheduling | Node affinity, taints, tolerations, pod anti-affinity on 8-node EKS |
| Service mesh | Istio traffic routing, debugged and fixed a real misconfiguration |
| Autoscaling | HPA + VPA deployed and validated under real load |
| Performance testing | k6 at 1000 VUs, ~4850 RPS, p95 < 200ms |
| Observability | Prometheus + Grafana, service-level metrics, latency tracing |
| Chaos engineering | Pod failure, CPU stress injection, database downtime simulation |
| GitOps | ArgoCD managing full application lifecycle |
| Deployment tooling | Helm + Kustomize, both approaches implemented and compared |
The difference between this and a tutorial project: things were broken, debugged, and fixed โ and that's documented here.