Skip to content

Module 6

rfashwall edited this page Apr 20, 2026 · 2 revisions

Module 6: MLOps Monitoring with Prometheus & Grafana

What You'll Build

By completing this module, you will deliver:

Monitoring Infrastructure:

  • βœ… Prometheus Server: Time-series database scraping metrics every 15s with 7-day retention
  • βœ… 8 Production Alert Rules: High error rate, latency, service downtime, resource exhaustion
  • βœ… Grafana Dashboard: Real-time visualization of request rates, errors, latency percentiles, resource usage
  • βœ… Kubernetes Service Discovery: Automatic detection and monitoring of ML services

Real-World Impact:

  • Incident Detection: Alert fires within 2 minutes of error rate exceeding 5%
  • Debugging Speed: Reduce troubleshooting time from hours to minutes with correlated metrics
  • Capacity Planning: Visualize CPU/memory trends to predict when to scale infrastructure
  • SLA Monitoring: Track P95/P99 latency to ensure performance SLAs are met

Learning Objectives

By the end of this module, you will:

  • βœ… Configure Prometheus for metrics collection
  • βœ… Set up Kubernetes service discovery
  • βœ… Create alerting rules with PromQL
  • βœ… Build Grafana dashboards for ML monitoring
  • βœ… Understand MLOps-specific observability patterns

Part 1: Setup & Prerequisites

This module teaches you to build production monitoring for ML services using Prometheus and Grafana. Complete three progressive exercises that cover metrics collection, alerting, and visualization for your MLOps stack.

Why Monitoring Matters for MLOps

Challenge Without Monitoring With Monitoring
ML Latency "Why is inference slow?" P95/P99 latency tracked
Error Rate "Are predictions failing?" 5xx errors alerted
Resource Usage "Pod OOM killed" Memory usage trends visible
Scaling Issues "HPA not working?" CPU/memory vs replicas correlated
Incident Response Hours to debug Minutes with correlated metrics

Workshop Format

This module uses a scaffolded learning approach with three progressive exercises:

Exercise 1: Alerting Rules 
β”œβ”€ Alert rule structure
β”œβ”€ PromQL expressions for alerts
β”œβ”€ Severity levels and thresholds
└─ Time-based alert conditions

Exercise 2: Grafana Dashboard 
β”œβ”€ Datasource configuration
β”œβ”€ Dashboard panel creation
β”œβ”€ PromQL queries for visualizations
└─ Panel types and formats

What does "scaffolded" mean?

  • 80-90% of YAML is provided for you
  • You fill in ~10-20% (critical configurations and queries)
  • Focus on learning Prometheus/Grafana concepts
  • Each TODO has inline hints showing exactly what to use

Prerequisites

  • Completed Module 4 (API Gateway deployment)
  • Completed Module 3 (ML Service deployment)
  • kubectl configured
  • kind cluster running

Part 2: Exercises

1. Complete Exercises

Exercise 1: Alerting Rules

Goal: Create alerting rules for high error rates, latency, and service downtime.

code prometheus-alerts.yaml

Test alerts:

# Deploy alerts
kubectl apply -f prometheus-alerts.yaml

# Restart Prometheus to load rules
kubectl rollout restart deployment/prometheus

# View in UI
kubectl port-forward svc/prometheus 9090:9090
# Navigate to: Alerts tab

Exercise 2: Grafana Dashboard

Goal: Build a Grafana dashboard with panels for request rate, errors, latency, and resource usage.

code grafana-dashboard.yaml

Test dashboard:

kubectl apply -f grafana-dashboard.yaml
kubectl wait --for=condition=ready pod -l app=grafana --timeout=120s
kubectl port-forward svc/grafana 3000:3000
# Windows PowerShell β€” open browser
Start-Process "http://localhost:3000"
# macOS / Linux
open http://localhost:3000

Login: admin / admin β†’ Dashboards β†’ MLOps Workshop β†’ MLOps Overview

Generate Traffic for Metrics

# macOS / Linux / WSL β€” port-forward then generate traffic
kubectl port-forward svc/api-gateway-service 8080:80 &
for i in {1..100}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; done
# Windows PowerShell β€” run port-forward in a separate terminal first, then:
1..100 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}' }

Watch metrics update in Grafana: request rate, latency, and resource usage panels.


Part 3: Core Concepts

Key Concepts Covered

Prometheus Fundamentals

  • Scrape Model: Pull metrics from targets every 15s
  • Service Discovery: Automatically find pods to monitor
  • Relabeling: Filter and transform discovered targets
  • TSDB: Time-series database for efficient storage
  • PromQL: Query language for metrics

Kubernetes Service Discovery

kubernetes_sd_configs:
- role: pod
  namespaces:
    names:
    - default

relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  action: keep
  regex: true

Pods opt-in with annotations:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Alert Rule Structure

- alert: GatewayHighErrorRate
  expr: |
    rate(gateway_http_requests_total{status=~"5.."}[5m])
    / rate(gateway_http_requests_total[5m]) > 0.05
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Error rate is {{ $value }}"

PromQL Patterns

# Request rate (req/sec)
rate(metric[5m])

# Error rate (percentage)
rate(errors[5m]) / rate(requests[5m])

# Latency percentiles
histogram_quantile(0.95, rate(metric_bucket[5m]))

# Service down
absent(up{job="service"} == 1)

# Resource usage
(usage / limit) > 0.9

Grafana Dashboard JSON

{
  "panels": [
    {
      "title": "Request Rate",
      "targets": [
        {
          "expr": "sum(rate(gateway_http_requests_total[5m]))",
          "legendFormat": "Requests/sec"
        }
      ]
    }
  ]
}

MLOps-Specific Metrics

Gateway Metrics (from Module 4):

gateway_http_requests_total{method,endpoint,status}
gateway_http_request_duration_seconds_bucket{le}
gateway_backend_requests_total{endpoint,status}
gateway_backend_request_duration_seconds_bucket{le}

ML Service Metrics (from BentoML):

bentoml_service_request_total
bentoml_service_request_duration_seconds

Kubernetes Metrics:

container_memory_usage_bytes
container_cpu_usage_seconds_total
kube_pod_status_phase
kube_horizontalpodautoscaler_status_current_replicas

Part 4: Testing & Production

Common Commands

# Prometheus
kubectl port-forward svc/prometheus 9090:9090

# Grafana
kubectl port-forward svc/grafana 3000:3000

# Check logs
kubectl logs -l app=prometheus
kubectl logs -l app=grafana
# Windows PowerShell β€” open browsers
Start-Process "http://localhost:9090"   # Prometheus β†’ Status β†’ Targets / Alerts
Start-Process "http://localhost:3000"   # Grafana β€” Login: admin/admin
# macOS / Linux
open http://localhost:9090
open http://localhost:3000

Part 5: Troubleshooting

Issue 1: Prometheus not scraping targets

Symptoms:

  • Prometheus UI β†’ Status β†’ Targets shows "0/0 up"
  • Service discovery finds pods but doesn't scrape them
  • Metrics not appearing in Prometheus

Root Cause: Missing pod annotations or incorrect relabel configuration

Step-by-step solution:

# 1. Check service discovery (visit http://localhost:9090/service-discovery after port-forward)
kubectl port-forward svc/prometheus 9090:9090

# 2. Verify pod annotations exist
kubectl get pods -l app=api-gateway -o yaml | grep -A 3 "prometheus.io"

# 5. Check Prometheus logs for scrape errors
kubectl logs -l app=prometheus | grep -i error
kubectl logs -l app=prometheus | grep "scrape"
# Windows PowerShell

# 2. Verify pod annotations
kubectl get pods -l app=api-gateway -o yaml | Select-String "prometheus.io" -Context 0,3

# 3. Add missing annotations to deployment (single line)
kubectl patch deployment api-gateway -p '{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"prometheus.io/scrape\":\"true\",\"prometheus.io/port\":\"8080\",\"prometheus.io/path\":\"/metrics\"}}}}}'

# 5. Check logs for scrape errors
kubectl logs -l app=prometheus | Select-String -Pattern "error","scrape" -SimpleMatch

Issue 2: Grafana shows "No Data"

Symptoms:

  • Dashboard panels show "No Data" message
  • Prometheus datasource shows green checkmark
  • Time range is set correctly

Root Cause: No metrics exist yet, or wrong PromQL query

Step-by-step solution:

  1. Grafana UI β†’ Configuration β†’ Data Sources β†’ Prometheus β†’ Save & Test β†’ should show "Data source is working"
  2. Verify metrics exist: visit http://localhost:9090/graph and query gateway_http_requests_total
  3. If no metrics, generate traffic (run port-forward in a separate terminal first):
# macOS / Linux / WSL
for i in {1..20}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; sleep 1; done
# Windows PowerShell
1..20 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}'; Start-Sleep 1 }
  1. Wait 15-30 seconds for Prometheus to scrape
  2. Grafana β†’ top-right time picker β†’ Last 15 minutes
  3. Verify PromQL syntax in Prometheus UI first: rate(gateway_http_requests_total[5m])
  4. Panel β†’ Edit β†’ Query β†’ Data source: Prometheus

Part 6: Reference

Commands Cheat Sheet

Prometheus Operations

# Deploy Prometheus
kubectl apply -f prometheus-config.yaml

# Check Prometheus deployment
kubectl get deployment prometheus
kubectl get pods -l app=prometheus
kubectl describe pod -l app=prometheus

# View Prometheus logs
kubectl logs -l app=prometheus
kubectl logs -l app=prometheus -f  # Follow logs
kubectl logs -l app=prometheus --previous  # Previous container

# Access Prometheus UI
kubectl port-forward svc/prometheus 9090:9090
# macOS / Linux: open http://localhost:9090
# Windows PowerShell: Start-Process "http://localhost:9090"

# Restart Prometheus
kubectl rollout restart deployment/prometheus
kubectl wait --for=condition=ready pod -l app=prometheus --timeout=120s

# Check Prometheus configuration
kubectl get configmap prometheus-config -o yaml

# Update configuration
kubectl apply -f prometheus-config.yaml
kubectl rollout restart deployment/prometheus

# Check Prometheus metrics about itself
# macOS / Linux / WSL:
curl http://localhost:9090/metrics
# Windows PowerShell:
# Invoke-RestMethod http://localhost:9090/metrics

# Verify scrape targets β€” Prometheus UI β†’ Status β†’ Targets
# Or via API (macOS / Linux / WSL):
curl http://localhost:9090/api/v1/targets
# Windows PowerShell:
# Invoke-RestMethod http://localhost:9090/api/v1/targets

PromQL Queries

# Access Prometheus UI for queries
kubectl port-forward svc/prometheus 9090:9090
# macOS / Linux: open http://localhost:9090/graph
# Windows PowerShell: Start-Process "http://localhost:9090/graph"

# Common queries for ML services:

# Request rate (requests per second)
rate(gateway_http_requests_total[5m])
sum(rate(gateway_http_requests_total[5m]))

# Error rate (percentage)
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
  / sum(rate(gateway_http_requests_total[5m])) * 100

# Request breakdown by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)

# Request breakdown by status code
sum(rate(gateway_http_requests_total[5m])) by (status)

# P95 latency
histogram_quantile(0.95,
  rate(gateway_http_request_duration_seconds_bucket[5m]))

# P99 latency
histogram_quantile(0.99,
  rate(gateway_http_request_duration_seconds_bucket[5m]))

# ML inference latency
histogram_quantile(0.95,
  rate(gateway_backend_request_duration_seconds_bucket[5m]))

# Memory usage (bytes)
container_memory_usage_bytes{pod=~"api-gateway.*"}
container_memory_usage_bytes{pod=~"sentiment-api.*"}

# Memory usage (percentage)
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100

# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"api-gateway.*"}[5m])

# HPA replicas
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{horizontalpodautoscaler="sentiment-api-hpa"}

# Pod status
kube_pod_status_phase{pod=~"api-gateway.*"}
kube_pod_status_phase{pod=~"sentiment-api.*"}

Generating Test Traffic

Run kubectl port-forward svc/api-gateway-service 8080:80 in a separate terminal first, then use the commands below.

Single request:

curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'
# Windows PowerShell
Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}'

Continuous traffic (light β€” 100 requests):

# macOS / Linux / WSL
for i in {1..100}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d "{\"text\": \"Go is amazing!\", \"request_id\": \"$i\"}"; sleep 0.1; done
# Windows PowerShell
1..100 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body "{\"text\": \"Go is amazing!\", \"request_id\": \"$_\"}"; Start-Sleep -Milliseconds 100 }

Sustained load (heavy β€” loops forever, Ctrl+C to stop):

# macOS / Linux / WSL
while true; do for i in {1..10}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; done; sleep 1; done
# Windows PowerShell
while ($true) { 1..10 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}' }; Start-Sleep 1 }

Mixed traffic (success + errors):

# macOS / Linux / WSL
for i in {1..50}; do curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"request": {"text": "Go is amazing!"}}'; curl -X POST http://localhost:8080/predict -d 'invalid json'; done
# Windows PowerShell
1..50 | ForEach-Object { Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -ContentType "application/json" -Body '{"request": {"text": "Go is amazing!"}}' -ErrorAction SilentlyContinue; Invoke-RestMethod -Method Post -Uri "http://localhost:8080/predict" -Body 'invalid json' -ErrorAction SilentlyContinue }

Stop the port-forward:

pkill -f "port-forward.*8080:80"
# Windows PowerShell β€” close the terminal running port-forward, or:
Stop-Process -Id (Get-NetTCPConnection -LocalPort 8080).OwningProcess -ErrorAction SilentlyContinue

Solution Files

If you get stuck, reference implementations are in solution/:

Note: Try to complete exercises on your own first!

Integration Examples

Integration with Module 4 (API Gateway)

The Go API Gateway from Module 4 exposes Prometheus metrics automatically:

Gateway metrics exposed:

// modules/module-4/main.go
var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "gateway_http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "gateway_http_request_duration_seconds",
            Help: "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

Prometheus scrapes these automatically via annotations:

# modules/module-4/deployment.yaml
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Query gateway metrics in Grafana:

# Request rate by endpoint
sum(rate(gateway_http_requests_total[5m])) by (endpoint)

# Error rate
sum(rate(gateway_http_requests_total{status=~"5.."}[5m]))
  / sum(rate(gateway_http_requests_total[5m]))

# P95 latency
histogram_quantile(0.95,
  rate(gateway_http_request_duration_seconds_bucket[5m]))

Integration with Module 3 (ML Service)

BentoML services from Module 3 expose metrics automatically:

BentoML default metrics:

bentoml_service_request_total{endpoint, http_response_code, service_name, service_version}
bentoml_service_request_duration_seconds{endpoint, service_name, service_version}
bentoml_service_request_in_progress{endpoint, service_name, service_version}

Kubernetes resource metrics:

# Memory usage of ML service
container_memory_usage_bytes{pod=~"sentiment-api.*"}

# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"sentiment-api.*"}[5m])

# HPA status
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="sentiment-api-hpa"}

Alert on ML service issues:

# prometheus-alerts.yaml
- alert: MLServiceDown
  expr: absent(up{job="ml-service"} == 1)
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "ML Service is down"
    description: "ML service has been unavailable for 1+ minutes"

- alert: MLInferenceLatencyHigh
  expr: |
    histogram_quantile(0.95,
      rate(gateway_backend_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "ML inference latency high: {{ $value }}s"

Integration with Module 5 (Kubeflow Pipelines)

Monitor Kubeflow pipeline runs and model training metrics:

Pipeline execution metrics:

# Pipeline runs by status
count(argo_workflows_status) by (status)

# Pipeline duration
histogram_quantile(0.95, argo_workflow_duration_seconds_bucket)

# Failed pipelines
count(argo_workflows_status{status="Failed"})

Model training metrics (custom):

# modules/module-1/train.py
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

registry = CollectorRegistry()
training_accuracy = Gauge('model_training_accuracy',
                          'Model training accuracy',
                          registry=registry)
training_loss = Gauge('model_training_loss',
                      'Model training loss',
                      registry=registry)

# After training
training_accuracy.set(accuracy)
training_loss.set(loss)
push_to_gateway('prometheus-pushgateway:9091',
                job='model-training',
                registry=registry)

Dashboard for ML lifecycle:

# Training jobs completed today
count(model_training_accuracy{job="model-training"})

# Latest model accuracy
model_training_accuracy{job="model-training"}

# Model deployment count
count(kube_deployment_labels{deployment=~"sentiment-api.*"})

Production Considerations

Workshop vs Production

Component Workshop Production
Deployment Raw manifests Helm (kube-prometheus-stack)
Storage emptyDir (ephemeral) PersistentVolumeClaim (50Gi+)
Retention 7 days 30+ days
Replicas 1 (single pod) 2+ with HA
Auth Anonymous enabled RBAC + OAuth
Alerting No AlertManager AlertManager + PagerDuty/Slack
TLS HTTP only HTTPS with cert-manager

Next Steps

Once you've completed all exercises:

Extend monitoring:

  1. Add more alert rules (CPU throttling, disk space)
  2. Create custom Grafana dashboards
  3. Integrate with AlertManager
  4. Add Loki for log aggregation

Production deployment:

  1. Use Helm for easier management
  2. Configure persistent storage
  3. Enable authentication and TLS
  4. Set up alert routing (PagerDuty, Slack)

β†’ Workshop Complete! You've mastered the entire MLOps stack! πŸŽ‰

Key Takeaways

βœ… Metrics Collection - Automatic service discovery with Prometheus βœ… Alerting - PromQL-based alerts for ML services βœ… Visualization - Production dashboards with Grafana βœ… MLOps Observability - Specific patterns for ML systems βœ… Production Ready - Scalable monitoring architecture


Congratulations! You've completed the MLOps workshop and built a full production ML platform! πŸŽ‰

From model training (Module 1) to monitoring (Module 6), you now have hands-on experience with the entire MLOps lifecycle.


Navigation

Previous Home Next
← Module 5: Kubeflow Pipelines & Model Serving 🏠 Home Module 7: CI/CD with GitHub Actions β†’

Quick Links


MLOps Workshop | GitHub Repository

Clone this wiki locally