Observability

Philosophy

"If you can't measure it, you can't improve it. If you can't debug it in 5 minutes, your logging is broken."

Three Pillars

1. Logs (What happened?)

2. Metrics (How much happened?)

3. Traces (Where did time go?)

Logs

Structure

All logs are JSON for machine parsing:

{
  "timestamp": "2026-01-26T10:30:45.123Z",
  "level": "info",
  "correlationId": "req-abc123-xyz789",
  "service": "flexgate-proxy",
  "event": "request.completed",
  "http": {
    "method": "GET",
    "path": "/api/users/123",
    "statusCode": 200,
    "clientIp": "192.168.1.100",
    "userAgent": "Mozilla/5.0..."
  },
  "upstream": {
    "host": "api.backend.com",
    "duration": 145,
    "statusCode": 200
  },
  "duration": 152,
  "metadata": {
    "userId": "user-456",
    "route": "api-users"
  }
}

Log Levels

Level	Use Case	Example
`error`	Request failed, upstream down, config invalid	`"Upstream timeout after 30s"`
`warn`	Rate limit hit, slow response, auth retry	`"Request exceeded rate limit"`
`info`	Every request/response (sampled in prod)	`"Request completed"`
`debug`	Headers, body (dev only)	`"Request headers: {...}"`

Sampling Strategy

Production: Log 100% of errors, 10% of successful requests (configurable)

logging:
  level: "info"
  sampling:
    enabled: true
    successRate: 0.1  # 10% of 2xx responses
    errorRate: 1.0    # 100% of errors

Why: At 5K req/sec, logging everything = 10GB/hour. Sampling preserves costs while keeping error visibility.

Correlation IDs

Every request gets a unique ID:

X-Correlation-ID: req-abc123-xyz789

Propagated to:

Upstream requests (forwarded header)
All log entries for that request
Error responses
Metrics tags

Debugging flow:

User reports error
Get correlation ID from response
grep correlationId=req-abc123 across all logs
See full request path

What We Log

✅ Always Log

Request start/end
Authentication attempts
Rate limit hits
Upstream failures
Config reloads
Circuit breaker state changes

⚠️ Never Log (Security)

API keys / tokens
Authorization headers
Password fields
Credit card numbers
PII (email, phone unless masked)

🔧 Debug Only

Request/response bodies
Full headers
DNS resolution details

Log Rotation

logging:
  file: "/var/log/proxy/access.log"
  maxSize: "100MB"
  maxFiles: 10
  compress: true

Metrics

Key Metrics

Request Metrics

# Request rate
http_requests_total{method="GET", route="/api/users", status="200"}

# Latency histogram
http_request_duration_ms{route="/api/users", quantile="0.95"}

# Request size
http_request_size_bytes{route="/api/users"}

Upstream Metrics

# Upstream response time
upstream_duration_ms{upstream="api.backend.com", quantile="0.99"}

# Upstream errors
upstream_errors_total{upstream="api.backend.com", error="timeout"}

# Connection pool
upstream_connections_active{upstream="api.backend.com"}

System Metrics

# Memory usage
process_memory_bytes{type="heapUsed"}

# Event loop lag
event_loop_lag_ms{quantile="0.99"}

# Active connections
tcp_connections_active

Business Metrics

# Rate limit hits
rate_limit_exceeded_total{route="/api/users", client="192.168.1.100"}

# Auth failures
auth_failures_total{reason="invalid_key"}

# Circuit breaker trips
circuit_breaker_opened_total{upstream="api.backend.com"}

Percentiles We Track

P50: Typical user experience
P95: 1 in 20 requests (bad but tolerable)
P99: 1 in 100 requests (red flag)
P99.9: Worst-case outliers

Why we care about P99:

P50/P95 can look great while P99 is terrible
Users remember the slow requests, not the fast ones

Metrics Exposition

Prometheus format at /metrics:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",route="/api/users",status="200"} 12543

# HELP http_request_duration_ms Request duration in milliseconds
# TYPE http_request_duration_ms histogram
http_request_duration_ms_bucket{route="/api/users",le="10"} 8234
http_request_duration_ms_bucket{route="/api/users",le="50"} 12000
http_request_duration_ms_bucket{route="/api/users",le="100"} 12500
http_request_duration_ms_sum{route="/api/users"} 567890
http_request_duration_ms_count{route="/api/users"} 12543

Health Checks

Liveness Probe

Endpoint: GET /health/live
Purpose: Is the process running?

{
  "status": "UP",
  "timestamp": "2026-01-26T10:30:45.123Z"
}

Returns:

200 if server is alive
503 if shutting down

Use: Kubernetes liveness probe (restart if failing)

Readiness Probe

Endpoint: GET /health/ready
Purpose: Can it handle traffic?

{
  "status": "UP",
  "checks": {
    "config": "UP",
    "upstreams": "UP",
    "rateLimiter": "UP"
  },
  "timestamp": "2026-01-26T10:30:45.123Z"
}

Returns:

200 if ready to serve traffic
503 if not ready (still starting, upstreams down, etc.)

Use: Kubernetes readiness probe (remove from load balancer if failing)

Deep Health Check

Endpoint: GET /health/deep (internal only)
Purpose: Detailed diagnostics

{
  "status": "DEGRADED",
  "uptime": 86400,
  "checks": {
    "config": {
      "status": "UP",
      "lastReload": "2026-01-26T08:00:00Z"
    },
    "upstreams": {
      "status": "DEGRADED",
      "details": {
        "api.backend.com": {
          "status": "UP",
          "latency": 45
        },
        "legacy.backend.com": {
          "status": "DOWN",
          "error": "Connection timeout"
        }
      }
    },
    "memory": {
      "status": "UP",
      "heapUsed": 156782592,
      "heapTotal": 268435456,
      "percentUsed": 58.4
    },
    "eventLoop": {
      "status": "UP",
      "lag": 12
    }
  }
}

Alerts

What We Alert On

🚨 Critical (Page On-Call)

- name: "Proxy Down"
  condition: up == 0
  duration: 1m
  severity: critical
  
- name: "High Error Rate"
  condition: rate(http_requests_total{status=~"5.."}[5m]) > 0.05  # >5% errors
  duration: 2m
  severity: critical

- name: "Upstream Completely Down"
  condition: upstream_errors_total{error="connection_refused"} > 100
  duration: 1m
  severity: critical

⚠️ Warning (Investigate During Business Hours)

- name: "High Latency"
  condition: http_request_duration_ms{quantile="0.99"} > 1000  # P99 > 1s
  duration: 5m
  severity: warning

- name: "Memory Usage High"
  condition: process_memory_bytes / node_memory_total > 0.8  # >80% memory
  duration: 5m
  severity: warning

- name: "Rate Limit Hit Frequently"
  condition: rate(rate_limit_exceeded_total[5m]) > 10
  duration: 5m
  severity: warning

📊 Info (Track, Don't Alert)

Circuit breaker opened/closed
Config reloaded
Upstream degraded (but not down)
Event loop lag spikes

What We DON'T Alert On

❌ Individual Request Failures

Why: Normal noise. Alert on error rate, not individual errors.

❌ Low Traffic

Why: Might be expected (off-hours, maintenance window).

❌ Slow Upstream

Why: We track it, but alert on circuit breaker instead.

Dashboards

Overview Dashboard

RPS: Requests per second (by route, status)
Latency: P50, P95, P99 (proxy + upstream)
Error Rate: % of 5xx responses
Active Connections: Current TCP connections

Upstream Health Dashboard

Upstream Latency: P95/P99 per upstream
Error Rate: By upstream and error type
Circuit Breaker: State per upstream
Connection Pool: Active/idle connections

Security Dashboard

Auth Failures: By client IP
Rate Limit Hits: By route and client
Blocked Requests: SSRF, large payloads, etc.

Debugging Playbook

Symptom: High P99 Latency

Check upstream latency metrics → Is it the backend?
Check event loop lag → Node.js blocking?
Check GC pauses → Memory pressure?
Check request size → Large payloads?
Enable debug logging for slow requests

Symptom: 5xx Errors

Get correlation ID from logs
Check upstream status → Down? Timeout?
Check circuit breaker state → Open?
Check auth failures → Invalid keys?
Review error logs for stack traces

Symptom: Memory Leak

Check heap size over time
Enable heap dump on threshold
Analyze with Chrome DevTools
Look for connection leaks (unclosed sockets)

Log Aggregation

Recommended Stack

Fluentd/Vector: Log shipping
Elasticsearch: Storage + search
Kibana: Visualization
Grafana: Metrics + logs in one place

Query Examples

Find all errors for a user:

level:error AND metadata.userId:"user-456"

Find slow requests:

duration:>1000 AND event:"request.completed"

Find rate limit hits:

event:"rate_limit.exceeded" AND http.path:"/api/users"

Cost Optimization

Log Storage Costs

At 5K req/sec with 10% sampling:

500 requests/sec logged
~1KB per log entry
= 500 KB/sec = 1.3 TB/month

Mitigations:

Sampling (already doing)
Retention policy (7 days hot, 30 days cold, 90 days archived)
Compress old logs (gzip → 10:1 compression)

Metrics Costs

With 50 unique metrics × 10 labels each:

~500 time series
15s scrape interval
= 2.88M samples/day