fix(site-explorer): widen duration histogram buckets by Susanpdl · Pull Request #2704 · NVIDIA/infra-controller

Susanpdl · 2026-06-19T15:34:34Z

Summary

Site explorer iteration latency uses the default OpenTelemetry histogram buckets, which top out at 10 seconds. On production sites most observations land in the +Inf bucket, so the metric is not useful for monitoring.

This change registers explicit millisecond buckets up to one hour for site explorer duration histograms, including iteration latency and per endpoint exploration duration.

Fixes #2352

Test plan

Added unit test that histogram views build successfully
CI cargo test -p carbide-site-explorer
After deploy, confirm carbide_site_explorer_iteration_latency_milliseconds_bucket observations spread across buckets above 10s instead of concentrating in +Inf

Default OpenTelemetry histogram buckets top out at 10 seconds, so site explorer iteration latency observations on production sites land in +Inf. Register explicit millisecond buckets up to one hour for site explorer duration histograms. Fixes NVIDIA#2352 Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

copy-pr-bot · 2026-06-19T15:34:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-19T15:34:55Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f5b4be83-6aea-4146-afa3-fbc5cf5753fd

📥 Commits

Reviewing files that changed from the base of the PR and between c623bed and b6ee2bc.

📒 Files selected for processing (2)

crates/api-core/src/logging/setup.rs
crates/site-explorer/src/metrics.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/api-core/src/logging/setup.rs
crates/site-explorer/src/metrics.rs

Summary by CodeRabbit

New Features
- Expanded site-explorer observability with additional OpenTelemetry latency histograms, including explicit millisecond bucket boundaries and min/max recording.
- Exposed a reusable helper to create latency histogram views, and re-exported the site-explorer latency histogram view at the crate level.
Tests
- Added unit tests covering histogram view creation for specific and wildcard name filters.
- Updated gauge aggregation test setup to include the new site-explorer histogram views.

Walkthrough

This PR resolves a histogram bucket misconfiguration for site-explorer latency metrics by introducing explicit millisecond bucket boundaries. It adds site_explorer_latency_histogram_view, a new public function defining those boundaries, integrates required dependencies, re-exports the function from the crate root, and registers the resulting views in the OTel MeterProvider for both production and test contexts.

Changes

Site Explorer Histogram Bucket Fix

Layer / File(s)	Summary
Histogram view function and dependencies `crates/site-explorer/Cargo.toml`, `crates/site-explorer/src/metrics.rs`	Adds `carbide-metrics-utils` and `opentelemetry_sdk` dependencies. Updates imports to include `Aggregation` and `InstrumentKind` from `opentelemetry_sdk::metrics`. Defines explicit millisecond bucket boundaries and implements `site_explorer_latency_histogram_view` with `Aggregation::ExplicitBucketHistogram` and `record_min_max: true`. Includes unit tests verifying view construction for wildcard and specific name filters.
Public re-export and MeterProvider registration `crates/site-explorer/src/lib.rs`, `crates/api-core/src/logging/setup.rs`	Re-exports `site_explorer_latency_histogram_view` at the `site-explorer` crate root. Registers both the wildcard `site_explorerlatency*` and `carbide_endpoint_exploration_duration` histogram views in the production `create_metrics` MeterProviderBuilder and mirrors the configuration in the test `test_gauge_aggregation` builder.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: widening duration histogram buckets for site-explorer to address the issue of observations concentrating in the +Inf bucket.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the problem, solution, and test plan for the histogram bucket expansion.
Linked Issues check	✅ Passed	The changes directly address issue `#2352` by implementing explicit millisecond histogram buckets for site-explorer duration metrics, replacing inadequate default buckets with a custom configuration extending to one hour.
Out of Scope Changes check	✅ Passed	All modifications are directly scoped to implementing explicit histogram buckets for site-explorer metrics; no extraneous changes are present outside the stated objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR improves observability for carbide-site-explorer by configuring explicit OpenTelemetry histogram buckets (in milliseconds) that extend up to 1 hour, so long-running site explorer operations no longer concentrate in the +Inf bucket and become unusable for monitoring.

Changes:

Add a reusable site_explorer_latency_histogram_view() helper that builds an explicit-bucket histogram view for site explorer duration metrics.
Export the helper from carbide-site-explorer and register the views in carbide-api-core’s metrics setup.
Add dependencies (carbide-metrics-utils, opentelemetry_sdk) and a unit test to ensure the view builder succeeds.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
crates/site-explorer/src/metrics.rs	Adds explicit histogram bucket boundaries + a helper for constructing OTel views; adds a basic build test.
crates/site-explorer/src/lib.rs	Re-exports the view helper for use by other crates.
crates/site-explorer/Cargo.toml	Adds dependencies needed to build OTel metric views.
crates/api-core/src/logging/setup.rs	Registers the new site-explorer histogram views during meter provider construction (and mirrors it in tests).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

🧹 Nitpick comments (1)

crates/site-explorer/src/metrics.rs (1)
769-777: ⚡ Quick win

Prefer table-driven cases for the view-construction test.

This test calls the same fallible operation with multiple input variants; converting it to a scenario table will make extension and failure labeling cleaner.

As per coding guidelines, “Reach for a table whenever two or more tests call the same operation with different inputs.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/metrics.rs` around lines 769 - 777, The test
function site_explorer_latency_histogram_views_build calls the same operation
site_explorer_latency_histogram_view multiple times with different inputs.
Refactor this into a table-driven test by creating a collection of test case
tuples containing the input string and a descriptive label for each variant,
then iterate over this collection to call the function once per case. This will
make the test more maintainable and easier to extend with additional cases in
the future.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/site-explorer/src/metrics.rs`:
- Around line 769-777: The test function
site_explorer_latency_histogram_views_build calls the same operation
site_explorer_latency_histogram_view multiple times with different inputs.
Refactor this into a table-driven test by creating a collection of test case
tuples containing the input string and a descriptive label for each variant,
then iterate over this collection to call the function once per case. This will
make the test more maintainable and easier to extend with additional cases in
the future.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c35b0c86-848f-422b-a183-f8d94cf1ebec

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5b61 and c623bed.

📒 Files selected for processing (4)

crates/api-core/src/logging/setup.rs
crates/site-explorer/Cargo.toml
crates/site-explorer/src/lib.rs
crates/site-explorer/src/metrics.rs

Keep the default millisecond buckets through 10 seconds so sub-second endpoint timings stay useful, then extend the upper range to one hour. Narrow the view filter to carbide_site_explorer_*_latency so histograms that declare seconds units are not given millisecond bucket boundaries. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

Susanpdl requested a review from a team as a code owner June 19, 2026 15:34

Copilot AI review requested due to automatic review settings June 19, 2026 15:34

Copilot started reviewing on behalf of Susanpdl June 19, 2026 15:35 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread crates/site-explorer/src/metrics.rs

Comment thread crates/api-core/src/logging/setup.rs

Comment thread crates/api-core/src/logging/setup.rs

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(site-explorer): widen duration histogram buckets#2704

fix(site-explorer): widen duration histogram buckets#2704
Susanpdl wants to merge 2 commits into
NVIDIA:mainfrom
Susanpdl:fix/site-explorer-histogram-buckets

Susanpdl commented Jun 19, 2026

Uh oh!

copy-pr-bot Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Susanpdl commented Jun 19, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading