console-1968-improve-most-valuable-alert-types by jonathanawesome · Pull Request #7935 · graphql-hive/console

jonathanawesome · 2026-03-30T19:47:06Z

This PR adds a metric-based alerting system to Console. Users can define rules that fire when traffic, reliability, or latency on a target breaches a threshold (fixed or % change) and get notified via the existing Slack/webhook/MS Teams channels. Ships behind a two-tier feature flag (cluster kill-switch + per-org enable, both default off) so we can selectively enroll our own org for validation without exposing the feature to other customers, then flip a single Pulumi config value to GA the feature for everyone.

What it ships

Backend

Three new Postgres tables: metric_alert_rules, metric_alert_incidents, metric_alert_state_log
GraphQL: MetricAlertRule + connection-paginated incidents/state-log, three mutations (add/update/deleteMetricAlertRules), Target queries
Workflows: evaluateMetricAlertRules cron (NORMAL to PENDING to FIRING to RECOVERING state machine with confirmationMinutes hold), purgeExpiredAlertStateLog cron, ClickHouse query helper, Slack/webhook/MS Teams notifier

Frontend

Alerts area under Target with subroutes: Activity, Rules, Create, Detail
Detail page: rule conditions panel, state-transitions timeline bar, metric over-time chart, events table
Rules table: activity table with severity-bucketed activity chart
Notification preview: users see what Slack/webhook payloads look like before saving
Refactor of filter components (filter-dropdown to floating/filter-menu)...new reusable base components (button, card, accordion, data-table, form, input, page-lead, description-list, select, etc.) with stories

Two-tier feature flag, mirroring the existing schemaProposals pattern

Cluster-wide kill switch (FEATURE_FLAGS_METRIC_ALERT_RULES_ENABLED): defaults off; flipping on enables the feature cluster-wide.
Per-org enable (organizations.feature_flags.metricAlertRules): defaults false; can be set via direct PG UPDATE to enable for specific orgs (no admin mutation in this PR; same operational pattern as how schemaProposals is enabled today).
OR semantics: enabled = clusterFlag || orgFlag. Matches every existing flag in the codebase. The cluster flag short-circuits...when off, resolvers fall back to checking the org flag.
Wired through Pulumi config so a single featureFlags:metricAlertRulesEnabled stack value flips both API and workflows.
Workflows cron filters rules whose org isn't enrolled when the cluster flag is off; runs everything when it's on. purgeExpiredAlertStateLog runs unconditionally so opted-in orgs' state-log tables stay bounded.

Seed script

scripts/seed-insights-and-alerts/ replaces the old seed-insights.mts with metric-first alert history (per-rule incident windows + matching ClickHouse ops + PG state-log rows)

Notable decisions

Connection-style incident pagination, not offset/limit. Matches existing SavedFilterConnection / AccessTokenConnection. Cursor encodes started_at|id so it's stable under concurrent inserts.
Cross-scope validation in mutations. Channels and saved filters must belong to the same project as the target. The DB FKs allowed cross-project references; explicit checks close that gap with structured error returns.
Plan-gated state-log retention. Hobby/Pro 7d, Enterprise 30d. expires_at is snapshotted at insert time so plan changes only affect new rows.
OR semantics for the feature flag, not AND. Considered AND for "extra safety against an accidental cluster flip" but rejected: every other flag in this codebase uses OR, AND would be a one-off divergence, and OR makes the GA flip a single Pulumi config change with no code follow-up.
Resolver-layer gate, not a manager-layer gate. schemaProposals and appDeployments put their gates in their respective managers; alerts has no manager class, so the gate sits in the resolver alongside the existing env-var check. Functionally identical, just at a different layer.
No new admin mutation for the per-org flag. Direct PG UPDATE is the same operational pattern used by every other flag in the codebase. A platform-wide admin surface for toggling org feature flags is a separate piece of work.

Worth a closer look in review

A few sub-features that touch shared infrastructure or have cost / blast-radius implications and deserve focused attention:

Feature flag plumbing. alerts/resolvers/Target.ts, the three mutation files, and workflows/src/lib/metric-alert-evaluator.ts all have to agree on the OR-gate semantics. If a future contributor copies the resolver pattern but forgets to mirror the SQL predicate (or vice versa), you'd get a state where mutations are gated but the cron evaluates rules anyway, or vice versa. Worth confirming the four call sites are consistent.
ClickHouse query cost. The evaluator runs evaluateMetricAlertRules every minute against operations_minutely / operations_hourly, batched by (targetId, timeWindowMinutes, savedFilterId). Worst case: every enabled rule with a unique grouping key is its own ClickHouse round-trip. The query is light (covered by the (target, timestamp) index, returns 2 rows), but it's worth keeping an eye on aggregate ClickHouse load when we widen rollout. Consider checking once we've enabled for a handful of orgs whether system.query_log shows acceptable patterns.
Workflows service now reaches ClickHouse. packages/services/workflows/src/lib/clickhouse-client.ts is a new dependency edge. The workflows service previously only talked to Postgres. New env vars (CLICKHOUSE_HOST, etc.) are wired through the workflows deployment in deployment/services/workflows.ts. Confirm the Pulumi stack actually has these set in non-dev environments before flipping the env var.
State-log retention. The metric_alert_state_log table is the highest-volume new table. Each rule transition writes a row, and the table has plan-gated TTL. The purgeExpiredAlertStateLog cron is the only thing keeping it bounded.
Notification fan-out is silent on partial failure. metric-alert-notifier.ts sends to each channel in sequence. If Slack succeeds and the webhook fails, today the failure is logged but doesn't surface to the user...they just get a partial notification. Future observability work will add metrics here, but in the meantime any review feedback on whether we should retry / surface failures more prominently is welcome.

gemini-code-assist

Code Review

This pull request outlines a proposal to enhance the alerts and notifications system by adding email support and introducing metric-based alerts for latency, error rates, and traffic. The review identifies several critical technical considerations for the implementation: the inability to run certain PostgreSQL type alterations within transactions, potential inaccuracies in ClickHouse metric calculations due to interval snapping, the need for zero-division handling in percentage change formulas, and a consistency error regarding the proposed evaluation frequency.

github-actions · 2026-03-30T19:50:47Z

🐋 This PR was built and pushed to the following Docker images:

Targets: build

Platforms: linux/amd64

Image Tag: aab4774221e07d3e3e399d6170edf29f29620a7e

github-actions · 2026-03-30T19:58:03Z

🚀 Snapshot Release (`alpha`)

The latest changes of this PR are available as alpha on npm (based on the declared changesets):

Package	Version	Info
`hive`	`11.1.0-alpha-20260507011536-aab4774221e07d3e3e399d6170edf29f29620a7e`	npm ↗︎ unpkg ↗︎

…ic alert evaluator

jonathanawesome · 2026-05-05T20:07:38Z

Plan-gated state-log retention. Hobby/Pro 7d, Enterprise 30d. expires_at is snapshotted at insert time so plan changes only affect new rows.

This is an interesting decision. If they have 30d retention and change to 7d, then there could be a gap in logs if it's 7-30 days after the change in subscription.

Long term, I feel like we need an "on subscription change" event that we can do a variety of things with

Good idea. Follow up issue added to Linear triage.

jonathanawesome · 2026-05-05T20:18:28Z

Notification fan-out is silent on partial failure. metric-alert-notifier.ts sends to each channel in sequence. If Slack succeeds and the webhook fails, today the failure is logged but doesn't surface to the user...they just get a partial notification. Future observability work will add metrics here, but in the meantime any review feedback on whether we should retry / surface failures more prominently is welcome.

This is the first followup feature I'd address. Firing exactly once is always a challenge. Many of these notification platforms have idempotency built in so we can fire at least once. But we should definitely not partially fulfil

Follow up issue created!

…other alert-rule enums

…-out

…-panel.tsx Co-authored-by: jdolle <1841898+jdolle@users.noreply.github.com>

…s://github.com/graphql-hive/console into console-1968-improve-most-valuable-alert-types

…splay

updated plan

7ff3795

jonathanawesome marked this pull request as draft March 30, 2026 19:47

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

jonathanawesome added 3 commits March 30, 2026 14:55

gemini suggestions

0be673a

prettier

a7b778e

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

5503be7

jonathanawesome added 22 commits April 8, 2026 21:07

add additional states to plan: NORMAL, PENDING, FIRING, RECOVERING

4714cbc

rename metric_alerts to metric_alert_rules

ae2c487

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

a1f5472

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

c3552fd

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

e41f725

shuffle email alert channel work, review figma screens

5211d5e

add migration

7102e9e

prettier

a5389dd

add .claude to prettierignore

0481f39

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

b563337

add metric alert rules entity types and storage provider

42e53e7

generate

e38889b

add GraphQL schema and resolvers for metric alert rules

e92f120

add metric alert evaluation engine to workflows service

10af3ff

add metric alert notification sender for Slack, Webhook, and Teams

707234e

address review issues, add plan-based retention, add integration tests

35cc93a

move Project.metricAlertRules to Target, add created_by_user_id

df87dd6

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

b756a8b

add initial metric alerts seed script with 30 days history

9ad4638

scaffold alerts tab and sub routes

54d6c2d

lint new routes scaffold

efdf386

complete insights and alerts seed script

5d0d3ef

add ClickHouse query metrics, dashboard, and alert rules for the metr…

3e138b0

…ic alert evaluator

jonathanawesome and others added 27 commits May 5, 2026 15:28

Uppercase MetricAlertRuleMetric enum values for consistency with the …

d448a3e

…other alert-rule enums

remove db-level defaults

8dc1b46

index ON_DELETEs

cd9d87b

index FK columns, replace enabled partial index to support future fan…

f2612f8

…-out

remove dead updateRuleState method

120eeb1

atomically write rule + channel links in transactions

8d4fc21

allow 0 channels for an alert rule

d3c6bba

prettier

5085ebe

update poorly worded comment

09a6c73

set fields on MetricAlertRule to nullable

95d953f

update Alerts filter to mimic Insights DateRange usage

b69d5f9

Update packages/web/app/src/components/target/alerts/alert-conditions…

f290720

…-panel.tsx Co-authored-by: jdolle <1841898+jdolle@users.noreply.github.com>

add context to logger

a811307

Merge branch 'console-1968-improve-most-valuable-alert-types' of http…

10c0b91

…s://github.com/graphql-hive/console into console-1968-improve-most-valuable-alert-types

fan out alert notifications to per-channel retryable tasks

5724716

anchor metric alert evaluation to job.run_at, not Date.now()

476e764

update comment

f0ed569

adjust x-axis tick count

9afe112

type labels based on api type

37dddb4

pp → %

11cd400

UTc -> local time for charts

1a99ac5

widen Select.options to readonly arrays, drop spread workarounds

8b4826b

expand form chart to 2x selected range, also fix x-axis tick label di…

9fb48a3

…splay

remove string matching, be smart about matching shape

7c91f6d

Merge branch 'main' into console-1968-improve-most-valuable-alert-types

32435a3

generate

9ad5a9d

expect ts error for grafana RuleGroup

aab4774

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

console-1968-improve-most-valuable-alert-types#7935

console-1968-improve-most-valuable-alert-types#7935
jonathanawesome wants to merge 109 commits intomainfrom
console-1968-improve-most-valuable-alert-types

jonathanawesome commented Mar 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

jonathanawesome commented May 5, 2026

Uh oh!

jonathanawesome commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

jonathanawesome commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it ships

Backend

Frontend

Two-tier feature flag, mirroring the existing schemaProposals pattern

Seed script

Notable decisions

Worth a closer look in review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Snapshot Release (alpha)

Uh oh!

jonathanawesome commented May 5, 2026

Uh oh!

jonathanawesome commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

jonathanawesome commented Mar 30, 2026 •

edited

Loading

github-actions Bot commented Mar 30, 2026 •

edited

Loading

github-actions Bot commented Mar 30, 2026 •

edited

Loading

🚀 Snapshot Release (`alpha`)