console-1968-improve-most-valuable-alert-types#7935
console-1968-improve-most-valuable-alert-types#7935jonathanawesome wants to merge 109 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request outlines a proposal to enhance the alerts and notifications system by adding email support and introducing metric-based alerts for latency, error rates, and traffic. The review identifies several critical technical considerations for the implementation: the inability to run certain PostgreSQL type alterations within transactions, potential inaccuracies in ClickHouse metric calculations due to interval snapping, the need for zero-division handling in percentage change formulas, and a consistency error regarding the proposed evaluation frequency.
|
🐋 This PR was built and pushed to the following Docker images: Targets: Platforms: Image Tag: |
…ic alert evaluator
Good idea. Follow up issue added to Linear triage. |
Follow up issue created! |
…other alert-rule enums
…-panel.tsx Co-authored-by: jdolle <1841898+jdolle@users.noreply.github.com>
This PR adds a metric-based alerting system to Console. Users can define rules that fire when traffic, reliability, or latency on a target breaches a threshold (fixed or % change) and get notified via the existing Slack/webhook/MS Teams channels. Ships behind a two-tier feature flag (cluster kill-switch + per-org enable, both default off) so we can selectively enroll our own org for validation without exposing the feature to other customers, then flip a single Pulumi config value to GA the feature for everyone.
What it ships
Backend
metric_alert_rules,metric_alert_incidents,metric_alert_state_logevaluateMetricAlertRulescron (NORMALtoPENDINGtoFIRINGtoRECOVERINGstate machine with confirmationMinutes hold),purgeExpiredAlertStateLogcron, ClickHouse query helper, Slack/webhook/MS Teams notifierFrontend
Two-tier feature flag, mirroring the existing schemaProposals pattern
FEATURE_FLAGS_METRIC_ALERT_RULES_ENABLED): defaults off; flipping on enables the feature cluster-wide.organizations.feature_flags.metricAlertRules): defaultsfalse; can be set via direct PG UPDATE to enable for specific orgs (no admin mutation in this PR; same operational pattern as how schemaProposals is enabled today).featureFlags:metricAlertRulesEnabledstack value flips both API and workflows.purgeExpiredAlertStateLogruns unconditionally so opted-in orgs' state-log tables stay bounded.Seed script
scripts/seed-insights-and-alerts/replaces the oldseed-insights.mtswith metric-first alert history (per-rule incident windows + matching ClickHouse ops + PG state-log rows)Notable decisions
SavedFilterConnection/AccessTokenConnection. Cursor encodesstarted_at|idso it's stable under concurrent inserts.expires_atis snapshotted at insert time so plan changes only affect new rows.schemaProposalsandappDeploymentsput their gates in their respective managers; alerts has no manager class, so the gate sits in the resolver alongside the existing env-var check. Functionally identical, just at a different layer.Worth a closer look in review
A few sub-features that touch shared infrastructure or have cost / blast-radius implications and deserve focused attention:
alerts/resolvers/Target.ts, the three mutation files, andworkflows/src/lib/metric-alert-evaluator.tsall have to agree on the OR-gate semantics. If a future contributor copies the resolver pattern but forgets to mirror the SQL predicate (or vice versa), you'd get a state where mutations are gated but the cron evaluates rules anyway, or vice versa. Worth confirming the four call sites are consistent.evaluateMetricAlertRulesevery minute againstoperations_minutely/operations_hourly, batched by (targetId,timeWindowMinutes,savedFilterId). Worst case: every enabled rule with a unique grouping key is its own ClickHouse round-trip. The query is light (covered by the (target, timestamp) index, returns 2 rows), but it's worth keeping an eye on aggregate ClickHouse load when we widen rollout. Consider checking once we've enabled for a handful of orgs whethersystem.query_logshows acceptable patterns.packages/services/workflows/src/lib/clickhouse-client.tsis a new dependency edge. The workflows service previously only talked to Postgres. New env vars (CLICKHOUSE_HOST, etc.) are wired through the workflows deployment indeployment/services/workflows.ts. Confirm the Pulumi stack actually has these set in non-dev environments before flipping the env var.metric_alert_state_logtable is the highest-volume new table. Each rule transition writes a row, and the table has plan-gated TTL. ThepurgeExpiredAlertStateLogcron is the only thing keeping it bounded.metric-alert-notifier.tssends to each channel in sequence. If Slack succeeds and the webhook fails, today the failure is logged but doesn't surface to the user...they just get a partial notification. Future observability work will add metrics here, but in the meantime any review feedback on whether we should retry / surface failures more prominently is welcome.