Skip to content

Cleanup OTEL metrics#718

Open
epompeii wants to merge 3 commits intodevelfrom
u/ep/otel-metrics
Open

Cleanup OTEL metrics#718
epompeii wants to merge 3 commits intodevelfrom
u/ep/otel-metrics

Conversation

@epompeii
Copy link
Member

This changeset cleans up the OTEL metric to make them more self-consistent.

@github-actions
Copy link

🤖 Claude Code Review

PR: #718
Base: devel
Head: u/ep/otel-metrics
Commit: bf176630e694bfb3d29393f3beafd21f5d2c2660


Here's my review of the three commits (priority, metrics_create, metric_names):


PR Review

Summary

This PR refactors OpenTelemetry metrics instrumentation:

  1. Moves determine_priority from a private function in job.rs to PlanKind::priority() method for reuse
  2. Adds Priority dimension to Run and MetricsCreate counters (replacing separate RunClaimed/RunUnclaimed)
  3. Adds MetricsBilled/MetricsBilledFailed counters, renames RunnerMinutesBillingFailed
  4. Removes IntervalKind from Create/CreateMax counters
  5. Adds unit() to all counters
  6. Extracts ApiHistogram into its own module
  7. Adds OTel provider graceful shutdown to flush buffered metrics
  8. Removes bencher_otel dependency from api_run crate (moves instrumentation closer to where priority is known)

Issues

Bug: Missing #[cfg] gate on shutdown in litestream branch (services/api/src/main.rs:121)

The litestream early-return path uses #[cfg(feature = "otel")] (missing the plus gate), while otel_provider is declared under #[cfg(all(feature = "plus", feature = "otel"))]. This will fail to compile when otel is enabled without plus:

#[cfg(feature = "otel")]  // <-- should be #[cfg(all(feature = "plus", feature = "otel"))]
if let Some(provider) = otel_provider

The non-litestream path at line 139 correctly uses #[cfg(all(feature = "plus", feature = "otel"))].

OTel metric name collision risk: Run(_) and ReportCreate now share the same metric name prefix pattern (run.create vs report.create). The MetricsCreate(_) counter uses name "metrics.create" while ReportCreate uses "report.create" — these are fine, but worth noting that the old MetricCreate (singular) is now MetricsCreate (plural) which is a breaking change to any existing dashboards/alerts.

Observations

Good patterns:

  • Moving determine_priority to PlanKind::priority() follows the project's preference for methods on strong types
  • Adding is_claimed to NewRunReport to thread the claim status through properly
  • Graceful OTel shutdown ensures metrics are flushed — important for correctness
  • increment_by for batch metric recording (MetricsCreate) is more efficient than calling increment in a loop per-metric
  • Tests properly updated to use the new method API
  • Destructuring with #[cfg] attributes is used consistently

Minor nits:

  • ApiHistogram methods changed from fn (private) to pub(crate) visibility, and priority_attribute is now pub(crate) — this is correct since it's now in a separate module
  • The unit() method on ApiCounter uses OTel semantic convention curly-brace units ({startup}, {run}, etc.) — good practice
  • SelfHostedServerStats(_) is grouped under "{report}" unit in the new unit() method — verify this is intentional and not a copy-paste error, as server stats aren't necessarily reports

Recommendation

Fix the #[cfg] gate mismatch on line 121 of services/api/src/main.rs — change #[cfg(feature = "otel")] to #[cfg(all(feature = "plus", feature = "otel"))] to match the other shutdown path. Other than that, the changes look clean and well-structured.


Model: claude-opus-4-6

@github-actions
Copy link

🐰 Bencher Report

Branchu/ep/otel-metrics
Testbedubuntu-22.04
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
microseconds (µs)
(Result Δ%)
Upper Boundary
microseconds (µs)
(Limit %)
Adapter::Json📈 view plot
🚷 view threshold
3.51 µs
(-0.57%)Baseline: 3.53 µs
4.67 µs
(75.15%)
Adapter::Magic (JSON)📈 view plot
🚷 view threshold
3.33 µs
(-5.16%)Baseline: 3.51 µs
4.58 µs
(72.56%)
Adapter::Magic (Rust)📈 view plot
🚷 view threshold
22.23 µs
(-12.95%)Baseline: 25.54 µs
31.38 µs
(70.85%)
Adapter::Rust📈 view plot
🚷 view threshold
2.61 µs
(-7.52%)Baseline: 2.82 µs
3.29 µs
(79.38%)
Adapter::RustBench📈 view plot
🚷 view threshold
2.61 µs
(-7.38%)Baseline: 2.82 µs
3.28 µs
(79.70%)
head_version_insert/batch/10📈 view plot
🚷 view threshold
95.53 µs
(-4.86%)Baseline: 100.40 µs
117.94 µs
(81.00%)
head_version_insert/batch/100📈 view plot
🚷 view threshold
244.22 µs
(+2.67%)Baseline: 237.87 µs
265.67 µs
(91.93%)
head_version_insert/batch/255📈 view plot
🚷 view threshold
475.18 µs
(+2.69%)Baseline: 462.71 µs
498.15 µs
(95.39%)
head_version_insert/batch/50📈 view plot
🚷 view threshold
164.69 µs
(+2.16%)Baseline: 161.20 µs
183.26 µs
(89.87%)
threshold_query/join/10📈 view plot
🚷 view threshold
150.36 µs
(+3.71%)Baseline: 144.98 µs
169.35 µs
(88.79%)
threshold_query/join/20📈 view plot
🚷 view threshold
165.03 µs
(+3.42%)Baseline: 159.57 µs
186.63 µs
(88.43%)
threshold_query/join/5📈 view plot
🚷 view threshold
142.28 µs
(+3.74%)Baseline: 137.15 µs
160.16 µs
(88.83%)
threshold_query/join/50📈 view plot
🚷 view threshold
209.38 µs
(+4.23%)Baseline: 200.89 µs
231.70 µs
(90.37%)
🐰 View full continuous benchmarking report in Bencher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant