-
Notifications
You must be signed in to change notification settings - Fork 42
Module 6
By completing this module, you will deliver:
Monitoring Infrastructure:
- β Prometheus Server: Time-series database scraping metrics every 15s with 7-day retention
- β 8 Production Alert Rules: High error rate, latency, service downtime, resource exhaustion
- β Grafana Dashboard: Real-time visualization of request rates, errors, latency percentiles, resource usage
- β Kubernetes Service Discovery: Automatic detection and monitoring of ML services
Real-World Impact:
- Incident Detection: Alert fires within 2 minutes of error rate exceeding 5%
- Debugging Speed: Reduce troubleshooting time from hours to minutes with correlated metrics
- Capacity Planning: Visualize CPU/memory trends to predict when to scale infrastructure
- SLA Monitoring: Track P95/P99 latency to ensure performance SLAs are met
By the end of this module, you will:
- β Configure Prometheus for metrics collection
- β Set up Kubernetes service discovery
- β Create alerting rules with PromQL
- β Build Grafana dashboards for ML monitoring
- β Understand MLOps-specific observability patterns
This module teaches you to build production monitoring for ML services using Prometheus and Grafana. Complete three progressive exercises that cover metrics collection, alerting, and visualization for your MLOps stack.
| Challenge | Without Monitoring | With Monitoring |
|---|---|---|
| ML Latency | "Why is inference slow?" | P95/P99 latency tracked |
| Error Rate | "Are predictions failing?" | 5xx errors alerted |
| Resource Usage | "Pod OOM killed" | Memory usage trends visible |
| Scaling Issues | "HPA not working?" | CPU/memory vs replicas correlated |
| Incident Response | Hours to debug | Minutes with correlated metrics |
This module uses a scaffolded learning approach with three progressive exercises:
Exercise 1: Alerting Rules
ββ Alert rule structure
ββ PromQL expressions for alerts
ββ Severity levels and thresholds
ββ Time-based alert conditions
Exercise 2: Grafana Dashboard
ββ Datasource configuration
ββ Dashboard panel creation
ββ PromQL queries for visualizations
ββ Panel types and formats
What does "scaffolded" mean?
- 80-90% of YAML is provided for you
- You fill in ~10-20% (critical configurations and queries)
- Focus on learning Prometheus/Grafana concepts
- Each TODO has inline hints showing exactly what to use
- Completed Module 4 (API Gateway deployment)
- Completed Module 3 (ML Service deployment)
- kubectl configured
- kind cluster running
Goal: Create alerting rules for high error rates, latency, and service downtime.
code prometheus-alerts.yamlTest alerts:
# Deploy alerts
kubectl apply -f prometheus-alerts.yaml
# Restart Prometheus to load rules
kubectl rollout restart deployment/prometheus
# View in UI
kubectl port-forward svc/prometheus 9090:9090
# Navigate to: Alerts tabGoal: Build a Grafana dashboard with panels for request rate, errors, latency, and resource usage.
code grafana-dashboard.yamlTest dashboard:
kubectl apply -f grafana-dashboard.yaml
kubectl wait --for=condition=ready pod -l app=grafana --timeout=120s
kubectl port-forward svc/grafana 3000:3000# Windows PowerShell β open browser
Start-Process "http://localhost:3000"# macOS / Linux
open http://localhost:3000Login: admin / admin β Dashboards β MLOps Workshop β MLOps Overview
# macOS / Linux / WSL β port-forward then generate traffic
kubectl port-forward svc/api-gateway-service 8080:80 &
for i in {1..100}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; done# Windows PowerShell β run port-forward in a separate terminal first, then:
1..100 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}' }Watch metrics update in Grafana: request rate, latency, and resource usage panels.
- Scrape Model: Pull metrics from targets every 15s
- Service Discovery: Automatically find pods to monitor
- Relabeling: Filter and transform discovered targets
- TSDB: Time-series database for efficient storage
- PromQL: Query language for metrics
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: truePods opt-in with annotations:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"- alert: GatewayHighErrorRate
expr: |
rate(gateway_http_requests_total{status=~"5.."}[5m])
/ rate(gateway_http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Error rate is {{ $value }}"# Request rate (req/sec)
rate(metric[5m])
# Error rate (percentage)
rate(errors[5m]) / rate(requests[5m])
# Latency percentiles
histogram_quantile(0.95, rate(metric_bucket[5m]))
# Service down
absent(up{job="service"} == 1)
# Resource usage
(usage / limit) > 0.9
{
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(gateway_http_requests_total[5m]))",
"legendFormat": "Requests/sec"
}
]
}
]
}Gateway Metrics (from Module 4):
gateway_http_requests_total{method,endpoint,status}
gateway_http_request_duration_seconds_bucket{le}
gateway_backend_requests_total{endpoint,status}
gateway_backend_request_duration_seconds_bucket{le}
ML Service Metrics (from BentoML):
bentoml_service_request_total
bentoml_service_request_duration_seconds
Kubernetes Metrics:
container_memory_usage_bytes
container_cpu_usage_seconds_total
kube_pod_status_phase
kube_horizontalpodautoscaler_status_current_replicas
# Prometheus
kubectl port-forward svc/prometheus 9090:9090
# Grafana
kubectl port-forward svc/grafana 3000:3000
# Check logs
kubectl logs -l app=prometheus
kubectl logs -l app=grafana# Windows PowerShell β open browsers
Start-Process "http://localhost:9090" # Prometheus β Status β Targets / Alerts
Start-Process "http://localhost:3000" # Grafana β Login: admin/admin# macOS / Linux
open http://localhost:9090
open http://localhost:3000Symptoms:
- Prometheus UI β Status β Targets shows "0/0 up"
- Service discovery finds pods but doesn't scrape them
- Metrics not appearing in Prometheus
Root Cause: Missing pod annotations or incorrect relabel configuration
Step-by-step solution:
# 1. Check service discovery (visit http://localhost:9090/service-discovery after port-forward)
kubectl port-forward svc/prometheus 9090:9090
# 2. Verify pod annotations exist
kubectl get pods -l app=api-gateway -o yaml | grep -A 3 "prometheus.io"
# 5. Check Prometheus logs for scrape errors
kubectl logs -l app=prometheus | grep -i error
kubectl logs -l app=prometheus | grep "scrape"# Windows PowerShell
# 2. Verify pod annotations
kubectl get pods -l app=api-gateway -o yaml | Select-String "prometheus.io" -Context 0,3
# 3. Add missing annotations to deployment (single line)
kubectl patch deployment api-gateway -p '{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"prometheus.io/scrape\":\"true\",\"prometheus.io/port\":\"8080\",\"prometheus.io/path\":\"/metrics\"}}}}}'
# 5. Check logs for scrape errors
kubectl logs -l app=prometheus | Select-String -Pattern "error","scrape" -SimpleMatchSymptoms:
- Dashboard panels show "No Data" message
- Prometheus datasource shows green checkmark
- Time range is set correctly
Root Cause: No metrics exist yet, or wrong PromQL query
Step-by-step solution:
- Grafana UI β Configuration β Data Sources β Prometheus β Save & Test β should show "Data source is working"
- Verify metrics exist: visit
http://localhost:9090/graphand querygateway_http_requests_total - If no metrics, generate traffic (run port-forward in a separate terminal first):
# macOS / Linux / WSL
for i in {1..20}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; sleep 1; done# Windows PowerShell
1..20 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}'; Start-Sleep 1 }- Wait 15-30 seconds for Prometheus to scrape
- Grafana β top-right time picker β Last 15 minutes
- Verify PromQL syntax in Prometheus UI first:
rate(gateway_http_requests_total[5m]) - Panel β Edit β Query β Data source: Prometheus
# Deploy Prometheus
kubectl apply -f prometheus-config.yaml
# Check Prometheus deployment
kubectl get deployment prometheus
kubectl get pods -l app=prometheus
kubectl describe pod -l app=prometheus
# View Prometheus logs
kubectl logs -l app=prometheus
kubectl logs -l app=prometheus -f # Follow logs
kubectl logs -l app=prometheus --previous # Previous container
# Access Prometheus UI
kubectl port-forward svc/prometheus 9090:9090
# macOS / Linux: open http://localhost:9090
# Windows PowerShell: Start-Process "http://localhost:9090"
# Restart Prometheus
kubectl rollout restart deployment/prometheus
kubectl wait --for=condition=ready pod -l app=prometheus --timeout=120s
# Check Prometheus configuration
kubectl get configmap prometheus-config -o yaml
# Update configuration
kubectl apply -f prometheus-config.yaml
kubectl rollout restart deployment/prometheus
# Check Prometheus metrics about itself
# macOS / Linux / WSL:
curl http://localhost:9090/metrics
# Windows PowerShell:
# Invoke-RestMethod http://localhost:9090/metrics
# Verify scrape targets β Prometheus UI β Status β Targets
# Or via API (macOS / Linux / WSL):
curl http://localhost:9090/api/v1/targets
# Windows PowerShell:
# Invoke-RestMethod http://localhost:9090/api/v1/targets# Access Prometheus UI for queries
kubectl port-forward svc/prometheus 9090:9090
# macOS / Linux: open http://localhost:9090/graph
# Windows PowerShell: Start-Process "http://localhost:9090/graph"
# Common queries for ML services:
# Request rate (requests per second)
rate(gateway_http_requests_total[5m])
sum(rate(gateway_http_requests_total[5m]))
# Error rate (percentage)
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(gateway_http_requests_total[5m])) * 100
# Request breakdown by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)
# Request breakdown by status code
sum(rate(gateway_http_requests_total[5m])) by (status)
# P95 latency
histogram_quantile(0.95,
rate(gateway_http_request_duration_seconds_bucket[5m]))
# P99 latency
histogram_quantile(0.99,
rate(gateway_http_request_duration_seconds_bucket[5m]))
# ML inference latency
histogram_quantile(0.95,
rate(gateway_backend_request_duration_seconds_bucket[5m]))
# Memory usage (bytes)
container_memory_usage_bytes{pod=~"api-gateway.*"}
container_memory_usage_bytes{pod=~"sentiment-api.*"}
# Memory usage (percentage)
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100
# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"api-gateway.*"}[5m])
# HPA replicas
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
# Pod status
kube_pod_status_phase{pod=~"api-gateway.*"}
kube_pod_status_phase{pod=~"sentiment-api.*"}Run
kubectl port-forward svc/api-gateway-service 8080:80in a separate terminal first, then use the commands below.
Single request:
curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'# Windows PowerShell
Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}'Continuous traffic (light β 100 requests):
# macOS / Linux / WSL
for i in {1..100}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d "{\"text\": \"Go is amazing!\", \"request_id\": \"$i\"}"; sleep 0.1; done# Windows PowerShell
1..100 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body "{\"text\": \"Go is amazing!\", \"request_id\": \"$_\"}"; Start-Sleep -Milliseconds 100 }Sustained load (heavy β loops forever, Ctrl+C to stop):
# macOS / Linux / WSL
while true; do for i in {1..10}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; done; sleep 1; done# Windows PowerShell
while ($true) { 1..10 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}' }; Start-Sleep 1 }Mixed traffic (success + errors):
# macOS / Linux / WSL
for i in {1..50}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; curl -X POST http://localhost:8080/predict -d 'invalid json'; done# Windows PowerShell
1..50 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}' -ErrorAction SilentlyContinue; Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -Body 'invalid json' -ErrorAction SilentlyContinue }Stop the port-forward:
pkill -f "port-forward.*8080:80"# Windows PowerShell β close the terminal running port-forward, or:
Stop-Process -Id (Get-NetTCPConnection -LocalPort 8080).OwningProcess -ErrorAction SilentlyContinueIf you get stuck, reference implementations are in solution/:
Note: Try to complete exercises on your own first!
The Go API Gateway from Module 4 exposes Prometheus metrics automatically:
Gateway metrics exposed:
// modules/module-4/main.go
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "gateway_http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gateway_http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)Prometheus scrapes these automatically via annotations:
# modules/module-4/deployment.yaml
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"Query gateway metrics in Grafana:
# Request rate by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)
# Error rate
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(gateway_http_requests_total[5m]))
# P95 latency
histogram_quantile(0.95,
rate(gateway_http_request_duration_seconds_bucket[5m]))
BentoML services from Module 3 expose metrics automatically:
BentoML default metrics:
bentoml_service_request_total{endpoint, http_response_code, service_name, service_version}
bentoml_service_request_duration_seconds{endpoint, service_name, service_version}
bentoml_service_request_in_progress{endpoint, service_name, service_version}
Kubernetes resource metrics:
# Memory usage of ML service
container_memory_usage_bytes{pod=~"sentiment-api.*"}
# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"sentiment-api.*"}[5m])
# HPA status
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
Alert on ML service issues:
# prometheus-alerts.yaml
- alert: MLServiceDown
expr: absent(up{job="ml-service"} == 1)
for: 1m
labels:
severity: critical
annotations:
summary: "ML Service is down"
description: "ML service has been unavailable for 1+ minutes"
- alert: MLInferenceLatencyHigh
expr: |
histogram_quantile(0.95,
rate(gateway_backend_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "ML inference latency high: {{ $value }}s"Monitor Kubeflow pipeline runs and model training metrics:
Pipeline execution metrics:
# Pipeline runs by status
count(argo_workflows_status) by (status)
# Pipeline duration
histogram_quantile(0.95, argo_workflow_duration_seconds_bucket)
# Failed pipelines
count(argo_workflows_status{status="Failed"})
Model training metrics (custom):
# modules/module-1/train.py
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
training_accuracy = Gauge('model_training_accuracy',
'Model training accuracy',
registry=registry)
training_loss = Gauge('model_training_loss',
'Model training loss',
registry=registry)
# After training
training_accuracy.set(accuracy)
training_loss.set(loss)
push_to_gateway('prometheus-pushgateway:9091',
job='model-training',
registry=registry)Dashboard for ML lifecycle:
# Training jobs completed today
count(model_training_accuracy{job="model-training"})
# Latest model accuracy
model_training_accuracy{job="model-training"}
# Model deployment count
count(kube_deployment_labels{deployment=~"sentiment-api.*"})
| Component | Workshop | Production |
|---|---|---|
| Deployment | Raw manifests | Helm (kube-prometheus-stack) |
| Storage | emptyDir (ephemeral) | PersistentVolumeClaim (50Gi+) |
| Retention | 7 days | 30+ days |
| Replicas | 1 (single pod) | 2+ with HA |
| Auth | Anonymous enabled | RBAC + OAuth |
| Alerting | No AlertManager | AlertManager + PagerDuty/Slack |
| TLS | HTTP only | HTTPS with cert-manager |
Once you've completed all exercises:
Extend monitoring:
- Add more alert rules (CPU throttling, disk space)
- Create custom Grafana dashboards
- Integrate with AlertManager
- Add Loki for log aggregation
Production deployment:
- Use Helm for easier management
- Configure persistent storage
- Enable authentication and TLS
- Set up alert routing (PagerDuty, Slack)
β Workshop Complete! You've mastered the entire MLOps stack! π
β Metrics Collection - Automatic service discovery with Prometheus β Alerting - PromQL-based alerts for ML services β Visualization - Production dashboards with Grafana β MLOps Observability - Specific patterns for ML systems β Production Ready - Scalable monitoring architecture
Congratulations! You've completed the MLOps workshop and built a full production ML platform! π
From model training (Module 1) to monitoring (Module 6), you now have hands-on experience with the entire MLOps lifecycle.
| Previous | Home | Next |
|---|---|---|
| β Module 5: Kubeflow Pipelines & Model Serving | π Home | Module 7: CI/CD with GitHub Actions β |
MLOps Workshop | GitHub Repository