Skip to content

Latest commit

 

History

History
216 lines (165 loc) · 6.21 KB

File metadata and controls

216 lines (165 loc) · 6.21 KB

AbsurderSQL Production Monitoring Stack

Complete observability setup for production deployments using AbsurderSQL with --features telemetry.

Overview

The monitoring stack provides:

  • 4 Grafana Dashboards with 28 panels for real-time visibility
  • 18 Alert Rules with severity-based routing
  • 26 Recording Rules for pre-aggregated metrics
  • Complete Runbooks for every alert type
  • Browser DevTools Extension for WASM telemetry debugging

Quick Start

1. Enable Telemetry

Build AbsurderSQL with telemetry support:

# For native applications
cargo build --features telemetry

# For WASM/browser
wasm-pack build --target web --features telemetry

2. Expose Metrics Endpoint

Add a /metrics endpoint to your application. See examples in the main README for axum and actix-web.

3. Configure Prometheus

Add this job to your prometheus.yml:

scrape_configs:
  - job_name: 'absurdersql'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

4. Import Grafana Dashboards

Import the JSON files from grafana/ into your Grafana instance:

  • query_performance.json - Query metrics and slow query detection
  • storage_operations.json - Block I/O and cache performance
  • system_health.json - Error rates and system status
  • multi_tab_coordination.json - Multi-tab sync debugging

5. Load Alert Rules

Add prometheus/alert_rules.yml to your Prometheus configuration:

rule_files:
  - /path/to/absurdersql/monitoring/prometheus/alert_rules.yml

6. Install Browser DevTools Extension (Optional)

For WASM/browser deployments:

  1. Open Chrome: chrome://extensions/
  2. Enable "Developer mode"
  3. Click "Load unpacked"
  4. Select ../browser-extension/ directory

See ../browser-extension/INSTALLATION.md for complete instructions.

Dashboards

Query Performance Dashboard

Monitors SQL query execution and identifies performance bottlenecks:

  • Query latency percentiles (p50, p90, p99)
  • Query rate and error rate
  • Slow query detection (>100ms)
  • Query type breakdown (SELECT/INSERT/UPDATE/DELETE)

Storage Operations Dashboard

Tracks block storage performance and cache efficiency:

  • Block read/write rates
  • Cache hit ratio and effectiveness
  • Storage layer latency
  • Block allocation metrics

System Health Dashboard

Overall system health and error tracking:

  • Total error rate by type
  • Transaction success rate
  • Active connections
  • Resource utilization

Multi-Tab Coordination Dashboard

Debugging multi-tab synchronization:

  • Leader election status
  • Write queue depth
  • Broadcast channel activity
  • Sync operation latency

Alert Rules

Critical Alerts (Page Database Team)

  • HighErrorRate - >5% errors over 5 minutes
  • ExtremeQueryLatency - p99 query time >1 second
  • NoQueryThroughput - Zero queries for 5 minutes (potential deadlock)
  • StorageFailures - >3 storage failures per minute

Warning Alerts (Notify Database Team)

  • ElevatedErrorRate - >2% errors over 5 minutes
  • SlowQueryDetected - Queries consistently >100ms
  • LowCacheHitRate - Cache hit rate <60%
  • MultiTabSyncDelayed - Sync operations >500ms

Info Alerts (Log Only)

  • LeaderElected - New tab became leader
  • FirstQueryExecuted - Database initialization

See RUNBOOK.md for detailed remediation procedures for each alert.

Recording Rules

Pre-aggregated metrics for faster dashboard queries:

  • absurdersql:query_rate - Queries per second
  • absurdersql:error_rate - Errors per second
  • absurdersql:error_ratio - Error percentage
  • absurdersql:query_latency_p99 - 99th percentile latency
  • absurdersql:cache_hit_ratio - Cache effectiveness
  • And 21 more...

Browser DevTools Extension

For debugging WASM telemetry in the browser:

Features:

  • Real-time span list with search/filtering
  • Export statistics visualization
  • Buffer inspection
  • Manual flush trigger
  • OTLP endpoint configuration

Installation: See ../browser-extension/INSTALLATION.md

Architecture

Application Code
       ↓
  [AbsurderSQL]
       ↓
  Observability Layer (--features telemetry)
       ↓
   ┌───┴───┐
   ↓       ↓
Prometheus  OpenTelemetry
   ↓       ↓
Grafana  DevTools Extension
   ↓
 Alerts
   ↓
Runbook Procedures

Files

monitoring/
├── README.md                      # This file
├── RUNBOOK.md                     # Alert remediation procedures
├── grafana/                       # Grafana dashboards
│   ├── query_performance.json     # 7 panels
│   ├── storage_operations.json    # 6 panels
│   ├── system_health.json         # 8 panels
│   └── multi_tab_coordination.json # 7 panels
└── prometheus/                    # Prometheus configuration
    └── alert_rules.yml            # 18 alerts + 26 recording rules

Next Steps

  1. Test Setup - Use ../examples/devtools_demo.html to generate test telemetry
  2. Configure Alertmanager - Set up alert routing and notification channels
  3. Tune Thresholds - Adjust alert thresholds based on your workload
  4. Monitor Dashboard - Regularly review dashboards for anomalies
  5. Follow Runbooks - Use RUNBOOK.md when alerts fire

Troubleshooting

Metrics Not Appearing

  • Verify --features telemetry was enabled during build
  • Check /metrics endpoint is accessible
  • Confirm Prometheus is scraping (check Prometheus UI targets page)

Dashboards Showing No Data

  • Verify Prometheus datasource is configured in Grafana
  • Check time range selector (default: last 1 hour)
  • Ensure application is generating queries

Alerts Not Firing

  • Check Prometheus Rules page to see if rules are loaded
  • Verify alert expressions evaluate to true (test in Prometheus UI)
  • Confirm Alertmanager is configured

DevTools Extension Not Receiving Data

  • Verify extension is loaded (check chrome://extensions/)
  • Open browser console and look for [Content], [DevTools], [Panel] logs
  • Ensure demo page is using window.postMessage to send telemetry

Support

For issues or questions:

  1. Check RUNBOOK.md for common problems
  2. Review dashboard panels for debugging clues
  3. Check Prometheus logs for scraping errors
  4. Inspect DevTools extension console for message flow