Skip to content

console-1968-improve-most-valuable-alert-types#7935

Open
jonathanawesome wants to merge 109 commits intomainfrom
console-1968-improve-most-valuable-alert-types
Open

console-1968-improve-most-valuable-alert-types#7935
jonathanawesome wants to merge 109 commits intomainfrom
console-1968-improve-most-valuable-alert-types

Conversation

@jonathanawesome
Copy link
Copy Markdown
Member

@jonathanawesome jonathanawesome commented Mar 30, 2026

This PR adds a metric-based alerting system to Console. Users can define rules that fire when traffic, reliability, or latency on a target breaches a threshold (fixed or % change) and get notified via the existing Slack/webhook/MS Teams channels. Ships behind a two-tier feature flag (cluster kill-switch + per-org enable, both default off) so we can selectively enroll our own org for validation without exposing the feature to other customers, then flip a single Pulumi config value to GA the feature for everyone.

What it ships

Backend

  • Three new Postgres tables: metric_alert_rules, metric_alert_incidents, metric_alert_state_log
  • GraphQL: MetricAlertRule + connection-paginated incidents/state-log, three mutations (add/update/deleteMetricAlertRules), Target queries
  • Workflows: evaluateMetricAlertRules cron (NORMAL to PENDING to FIRING to RECOVERING state machine with confirmationMinutes hold), purgeExpiredAlertStateLog cron, ClickHouse query helper, Slack/webhook/MS Teams notifier

Frontend

  • Alerts area under Target with subroutes: Activity, Rules, Create, Detail
  • Detail page: rule conditions panel, state-transitions timeline bar, metric over-time chart, events table
  • Rules table: activity table with severity-bucketed activity chart
  • Notification preview: users see what Slack/webhook payloads look like before saving
  • Refactor of filter components (filter-dropdown to floating/filter-menu)...new reusable base components (button, card, accordion, data-table, form, input, page-lead, description-list, select, etc.) with stories

Two-tier feature flag, mirroring the existing schemaProposals pattern

  • Cluster-wide kill switch (FEATURE_FLAGS_METRIC_ALERT_RULES_ENABLED): defaults off; flipping on enables the feature cluster-wide.
  • Per-org enable (organizations.feature_flags.metricAlertRules): defaults false; can be set via direct PG UPDATE to enable for specific orgs (no admin mutation in this PR; same operational pattern as how schemaProposals is enabled today).
  • OR semantics: enabled = clusterFlag || orgFlag. Matches every existing flag in the codebase. The cluster flag short-circuits...when off, resolvers fall back to checking the org flag.
  • Wired through Pulumi config so a single featureFlags:metricAlertRulesEnabled stack value flips both API and workflows.
  • Workflows cron filters rules whose org isn't enrolled when the cluster flag is off; runs everything when it's on. purgeExpiredAlertStateLog runs unconditionally so opted-in orgs' state-log tables stay bounded.

Seed script

scripts/seed-insights-and-alerts/ replaces the old seed-insights.mts with metric-first alert history (per-rule incident windows + matching ClickHouse ops + PG state-log rows)

Notable decisions

  • Connection-style incident pagination, not offset/limit. Matches existing SavedFilterConnection / AccessTokenConnection. Cursor encodes started_at|id so it's stable under concurrent inserts.
  • Cross-scope validation in mutations. Channels and saved filters must belong to the same project as the target. The DB FKs allowed cross-project references; explicit checks close that gap with structured error returns.
  • Plan-gated state-log retention. Hobby/Pro 7d, Enterprise 30d. expires_at is snapshotted at insert time so plan changes only affect new rows.
  • OR semantics for the feature flag, not AND. Considered AND for "extra safety against an accidental cluster flip" but rejected: every other flag in this codebase uses OR, AND would be a one-off divergence, and OR makes the GA flip a single Pulumi config change with no code follow-up.
  • Resolver-layer gate, not a manager-layer gate. schemaProposals and appDeployments put their gates in their respective managers; alerts has no manager class, so the gate sits in the resolver alongside the existing env-var check. Functionally identical, just at a different layer.
  • No new admin mutation for the per-org flag. Direct PG UPDATE is the same operational pattern used by every other flag in the codebase. A platform-wide admin surface for toggling org feature flags is a separate piece of work.

Worth a closer look in review

A few sub-features that touch shared infrastructure or have cost / blast-radius implications and deserve focused attention:

  • Feature flag plumbing. alerts/resolvers/Target.ts, the three mutation files, and workflows/src/lib/metric-alert-evaluator.ts all have to agree on the OR-gate semantics. If a future contributor copies the resolver pattern but forgets to mirror the SQL predicate (or vice versa), you'd get a state where mutations are gated but the cron evaluates rules anyway, or vice versa. Worth confirming the four call sites are consistent.
  • ClickHouse query cost. The evaluator runs evaluateMetricAlertRules every minute against operations_minutely / operations_hourly, batched by (targetId, timeWindowMinutes, savedFilterId). Worst case: every enabled rule with a unique grouping key is its own ClickHouse round-trip. The query is light (covered by the (target, timestamp) index, returns 2 rows), but it's worth keeping an eye on aggregate ClickHouse load when we widen rollout. Consider checking once we've enabled for a handful of orgs whether system.query_log shows acceptable patterns.
  • Workflows service now reaches ClickHouse. packages/services/workflows/src/lib/clickhouse-client.ts is a new dependency edge. The workflows service previously only talked to Postgres. New env vars (CLICKHOUSE_HOST, etc.) are wired through the workflows deployment in deployment/services/workflows.ts. Confirm the Pulumi stack actually has these set in non-dev environments before flipping the env var.
  • State-log retention. The metric_alert_state_log table is the highest-volume new table. Each rule transition writes a row, and the table has plan-gated TTL. The purgeExpiredAlertStateLog cron is the only thing keeping it bounded.
  • Notification fan-out is silent on partial failure. metric-alert-notifier.ts sends to each channel in sequence. If Slack succeeds and the webhook fails, today the failure is logged but doesn't surface to the user...they just get a partial notification. Future observability work will add metrics here, but in the meantime any review feedback on whether we should retry / surface failures more prominently is welcome.

@jonathanawesome jonathanawesome marked this pull request as draft March 30, 2026 19:47
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request outlines a proposal to enhance the alerts and notifications system by adding email support and introducing metric-based alerts for latency, error rates, and traffic. The review identifies several critical technical considerations for the implementation: the inability to run certain PostgreSQL type alterations within transactions, potential inaccuracies in ClickHouse metric calculations due to interval snapping, the need for zero-division handling in percentage change formulas, and a consistency error regarding the proposed evaluation frequency.

Comment thread packages/web/app/.claude/plans/alerts-and-notifications-improvements.md Outdated
Comment thread packages/web/app/.claude/plans/alerts-and-notifications-improvements.md Outdated
Comment thread .claude/plans/alerts-and-notifications-improvements.md Outdated
Comment thread packages/web/app/.claude/plans/alerts-and-notifications-improvements.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 30, 2026

🐋 This PR was built and pushed to the following Docker images:

Targets: build

Platforms: linux/amd64

Image Tag: aab4774221e07d3e3e399d6170edf29f29620a7e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 30, 2026

🚀 Snapshot Release (alpha)

The latest changes of this PR are available as alpha on npm (based on the declared changesets):

Package Version Info
hive 11.1.0-alpha-20260507011536-aab4774221e07d3e3e399d6170edf29f29620a7e npm ↗︎ unpkg ↗︎

@jonathanawesome
Copy link
Copy Markdown
Member Author

Plan-gated state-log retention. Hobby/Pro 7d, Enterprise 30d. expires_at is snapshotted at insert time so plan changes only affect new rows.

This is an interesting decision. If they have 30d retention and change to 7d, then there could be a gap in logs if it's 7-30 days after the change in subscription.

Long term, I feel like we need an "on subscription change" event that we can do a variety of things with

Good idea. Follow up issue added to Linear triage.

@jonathanawesome
Copy link
Copy Markdown
Member Author

Notification fan-out is silent on partial failure. metric-alert-notifier.ts sends to each channel in sequence. If Slack succeeds and the webhook fails, today the failure is logged but doesn't surface to the user...they just get a partial notification. Future observability work will add metrics here, but in the meantime any review feedback on whether we should retry / surface failures more prominently is welcome.

This is the first followup feature I'd address. Firing exactly once is always a challenge. Many of these notification platforms have idempotency built in so we can fire at least once. But we should definitely not partially fulfil

Follow up issue created!

jonathanawesome and others added 27 commits May 5, 2026 15:28
…-panel.tsx

Co-authored-by: jdolle <1841898+jdolle@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants