Skip to content

Stats api transport for /_plugins/_analytics/stats endpoint#21877

Draft
OVI3D0 wants to merge 6 commits into
opensearch-project:mainfrom
OVI3D0:stats-api-transport
Draft

Stats api transport for /_plugins/_analytics/stats endpoint#21877
OVI3D0 wants to merge 6 commits into
opensearch-project:mainfrom
OVI3D0:stats-api-transport

Conversation

@OVI3D0
Copy link
Copy Markdown
Member

@OVI3D0 OVI3D0 commented May 28, 2026

Description

Based on #21796 , adds TransportNodesAction to broadcast/merge results, since right now this API is per-node only

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

PR Reviewer Guide 🔍

(Review updated until commit 186d905)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The failure path in buildProfilingRowsListener constructs a fallback QueryProfile when exec.getGraph() is null, but the ternary expression is incomplete. If exec is null, exec.getGraph() throws NullPointerException before the null-check on the graph. The condition should be exec != null && exec.getGraph() != null with parentheses or split into nested checks.

QueryProfile qp = exec != null && exec.getGraph() != null
    ? QueryProfileBuilder.snapshot(exec.getGraph(), context, fullPlan, planningTimeMs)
    : new QueryProfile(context.queryId(), java.util.List.of(), planningTimeMs, 0L, java.util.List.of());
Possible Issue

In recordProfile, the code checks profile.stages() == null and returns early, but then iterates profile.stages() without null-guarding individual StageProfile elements. If the list contains a null entry, stage.executionType() will throw NullPointerException. Either filter nulls during iteration or document that the list must not contain nulls.

if (profile.stages() == null) return;
for (StageProfile stage : profile.stages()) {
    StageCounters c = bucket(stage.executionType());
    switch (stage.state()) {
        case "SUCCEEDED" -> c.succeeded.increment();
        case "FAILED" -> c.failed.increment();
        case "CANCELLED" -> c.cancelled.increment();
        default -> {
            // CREATED / RUNNING — stage didn't reach terminal; don't count toward terminal buckets.
        }
    }
    c.started.increment();
    c.rowsProcessedTotal.add(stage.rowsProcessed());
    if (stage.elapsedMs() > 0) {
        c.elapsed.record(stage.elapsedMs());
    }
    if (stage.tasks() == null) continue;

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

PR Code Suggestions ✨

Latest suggestions up to 186d905

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Preserve throwable cause chain

When result.failure() is not an Exception, wrapping it in a generic RuntimeException
loses the original throwable's type and stack trace. Preserve the cause chain by
passing the original throwable to the RuntimeException constructor.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [314-325]

 executeInternal(request.getPlan(), request.getQueryCtx(), false, ActionListener.wrap(result -> {
     if (result.isSuccess()) {
         convertingListener.onResponse(new AnalyticsQueryResponse(result.rows()));
     } else {
-        // executeWithProfile delivers failures via onResponse with a populated
-        // ProfiledResult.failure so the profile is always recorded; surface it
-        // here as a true onFailure so the caller doesn't see a null-rows response.
         convertingListener.onFailure(
-            result.failure() instanceof Exception ex ? ex : new RuntimeException(result.failure())
+            result.failure() instanceof Exception ex ? ex : new RuntimeException("Query execution failed", result.failure())
         );
     }
 }, convertingListener::onFailure));
Suggestion importance[1-10]: 7

__

Why: This is a valid improvement that preserves the original throwable's information when wrapping non-Exception throwables. The suggested change adds a descriptive message and properly chains the cause, which aids debugging.

Medium
Log negative latency recordings

Negative latency values are silently dropped, which could mask timing bugs or clock
skew issues. Consider logging a warning when negative values are encountered to aid
debugging, especially since this affects production metrics.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/AnalyticsStatsCollector.java [139-143]

 void record(long valueMs) {
-    if (valueMs < 0) return;
+    if (valueMs < 0) {
+        logger.warn("Dropping negative latency value: {}ms", valueMs);
+        return;
+    }
     count.increment();
     sumMs.add(valueMs);
 }
Suggestion importance[1-10]: 5

__

Why: Adding logging for negative latency values would help debug timing issues, but the suggestion doesn't account for the missing logger declaration in the class. The improvement is moderate since negative values should be rare in production.

Low
Clarify profile parameter semantics

The profile parameter is passed to buildProfilingRowsListener but the method always
builds a profile regardless of this flag. Consider renaming the parameter to
includeProfileInResponse for clarity, or verify that the profiling overhead is
acceptable for all queries since timing is now always enabled.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [242-250]

+// Renamed for clarity - profile is always built, flag controls response inclusion
 ActionListener<Iterable<Object[]>> rowsListener = buildProfilingRowsListener(
     execRef,
     context,
     fullPlan,
     planningTimeMs,
     statsCollector,
-    profile,
+    profile, // controls response inclusion only
     listener
 );
Suggestion importance[1-10]: 3

__

Why: The suggestion correctly identifies that the profile parameter now only controls response inclusion, but the existing code already has a clear comment explaining this behavior. The suggested comment is slightly clearer but offers minimal improvement.

Low

Previous suggestions

Suggestions up to commit e14ee7a
CategorySuggestion                                                                                                                                    Impact
General
Isolate contributor failures with try-catch

If a SearchStatsContributor throws an exception during contributeSearchStats(), the
entire stats collection fails. Consider wrapping each contributor invocation in a
try-catch block to isolate failures and log them without breaking the stats
endpoint.

server/src/main/java/org/opensearch/indices/IndicesService.java [900-907]

 case Search:
     commonStats.search.add(oldShardsStats.searchStats);
     for (SearchStatsContributor contributor : searchStatsContributors) {
-        SearchStats contributed = contributor.contributeSearchStats();
-        if (contributed != null) {
-            commonStats.search.add(contributed);
+        try {
+            SearchStats contributed = contributor.contributeSearchStats();
+            if (contributed != null) {
+                commonStats.search.add(contributed);
+            }
+        } catch (Exception e) {
+            logger.warn("SearchStatsContributor failed", e);
         }
     }
     break;
Suggestion importance[1-10]: 7

__

Why: This is a valid defensive programming suggestion. If a SearchStatsContributor throws an exception, it would break the entire stats collection. Wrapping each contributor call in try-catch with logging isolates failures and improves robustness of the stats endpoint.

Medium
Preserve original throwable as cause

When result.failure() is not an Exception, wrapping it in a generic RuntimeException
loses the original throwable's type and stack trace. Consider using a more specific
exception type or preserving the original throwable as the cause.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [314-325]

 executeInternal(request.getPlan(), request.getQueryCtx(), false, ActionListener.wrap(result -> {
     if (result.isSuccess()) {
         convertingListener.onResponse(new AnalyticsQueryResponse(result.rows()));
     } else {
-        convertingListener.onFailure(
-            result.failure() instanceof Exception ex ? ex : new RuntimeException(result.failure())
-        );
+        Throwable failure = result.failure();
+        Exception ex = failure instanceof Exception ? (Exception) failure : new RuntimeException("Query execution failed", failure);
+        convertingListener.onFailure(ex);
     }
 }, convertingListener::onFailure));
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that wrapping a non-Exception Throwable in a generic RuntimeException loses context. Preserving the original as a cause improves debugging. However, the scenario of result.failure() being a non-Exception Throwable is likely rare in practice.

Low
Clarify profile parameter intent

The profile parameter is passed to buildProfilingRowsListener but the method always
builds a profile regardless of this flag. Consider renaming the parameter to
includeProfileInResponse to clarify its purpose, or verify that the profiling
overhead is acceptable for all queries since timing is now always enabled.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [242-250]

+// Clarify that profile is always built, flag controls response inclusion only
 ActionListener<Iterable<Object[]>> rowsListener = buildProfilingRowsListener(
     execRef,
     context,
     fullPlan,
     planningTimeMs,
     statsCollector,
-    profile,
+    /* includeProfileInResponse= */ profile,
     listener
 );
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the profile parameter now only controls response inclusion, not whether profiling occurs. However, the code comment at lines 180-183 already explains this behavior clearly. The suggested inline comment adds marginal value.

Low
Log warnings for invalid timing

The method silently drops profiles with zero or negative timing values. This could
mask issues where timing capture failed. Consider logging a warning when timing
values are unexpectedly zero or negative to aid debugging.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/AnalyticsStatsCollector.java [58-65]

 public void recordProfile(QueryProfile profile) {
     if (profile == null) return;
     if (profile.planningTimeMs() > 0) {
         queriesPlanning.record(profile.planningTimeMs());
+    } else if (profile.planningTimeMs() < 0) {
+        logger.warn("Negative planning time in profile: {}", profile.planningTimeMs());
     }
     if (profile.executionTimeMs() > 0) {
         queriesElapsed.record(profile.executionTimeMs());
+    } else if (profile.executionTimeMs() < 0) {
+        logger.warn("Negative execution time in profile: {}", profile.executionTimeMs());
     }
Suggestion importance[1-10]: 4

__

Why: Adding logging for negative timing values could help debugging, but the LatencyTotals.record method already drops negative values at line 140. The suggestion adds defensive logging but doesn't address a critical issue since zero values are legitimate (e.g., very fast operations).

Low
Suggestions up to commit 54f8850
CategorySuggestion                                                                                                                                    Impact
Possible issue
Isolate contributor failures

If a SearchStatsContributor throws an exception during contributeSearchStats(), the
entire _nodes/stats request fails. Wrap the contributor invocation in a try-catch
block to isolate failures and log them without breaking the stats aggregation.

server/src/main/java/org/opensearch/indices/IndicesService.java [900-907]

 case Search:
     commonStats.search.add(oldShardsStats.searchStats);
     for (SearchStatsContributor contributor : searchStatsContributors) {
-        SearchStats contributed = contributor.contributeSearchStats();
-        if (contributed != null) {
-            commonStats.search.add(contributed);
+        try {
+            SearchStats contributed = contributor.contributeSearchStats();
+            if (contributed != null) {
+                commonStats.search.add(contributed);
+            }
+        } catch (Exception e) {
+            logger.warn("SearchStatsContributor failed", e);
         }
     }
     break;
Suggestion importance[1-10]: 7

__

Why: This is a valid suggestion. If a SearchStatsContributor throws an exception, it would fail the entire _nodes/stats request. Wrapping the contributor invocation in a try-catch block with logging ensures that one misbehaving contributor doesn't break the stats aggregation for all nodes, improving system resilience.

Medium
Guard against null stage entries

The method iterates over profile.stages() without null-checking individual
StageProfile elements. If the list contains null entries due to partial profile
construction during failures, this will throw NullPointerException. Add a null check
inside the loop.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/AnalyticsStatsCollector.java [54-92]

 public void recordProfile(QueryProfile profile) {
     if (profile == null) return;
     if (profile.planningTimeMs() > 0) {
         queriesPlanning.record(profile.planningTimeMs());
     }
     if (profile.executionTimeMs() > 0) {
         queriesElapsed.record(profile.executionTimeMs());
     }
     if (profile.stages() == null) return;
     for (StageProfile stage : profile.stages()) {
+        if (stage == null) continue;
         ...
     }
 }
Suggestion importance[1-10]: 5

__

Why: This is a reasonable defensive programming suggestion. While the QueryProfile construction should ensure non-null stage entries, adding a null check inside the loop prevents potential NPEs during partial profile construction in failure scenarios, improving robustness with minimal overhead.

Low
Validate statsCollector before use

The statsCollector is recorded on every query path but never validated for null. If
the Guice injection fails or the component isn't properly wired, this will cause a
NullPointerException during query execution. Add a null check before passing it to
the listener builder.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [242-250]

+if (statsCollector == null) {
+    throw new IllegalStateException("AnalyticsStatsCollector not initialized");
+}
 ActionListener<Iterable<Object[]>> rowsListener = buildProfilingRowsListener(
     execRef,
     context,
     fullPlan,
     planningTimeMs,
     statsCollector,
     profile,
     listener
 );
Suggestion importance[1-10]: 3

__

Why: While adding a null check for statsCollector could prevent potential NPEs, this is a Guice-injected dependency that should always be present if the plugin is properly initialized. The suggestion adds defensive programming but addresses a scenario that indicates a deeper initialization problem rather than a runtime edge case.

Low
Suggestions up to commit c39dc73
CategorySuggestion                                                                                                                                    Impact
Possible issue
Isolate contributor exceptions from stats endpoint

If a SearchStatsContributor throws an exception in contributeSearchStats(), the
entire stats() call fails and breaks the _nodes/stats endpoint. Wrap the contributor
invocation in a try-catch to isolate plugin failures and log them without disrupting
the stats response.

server/src/main/java/org/opensearch/indices/IndicesService.java [900-907]

 case Search:
     commonStats.search.add(oldShardsStats.searchStats);
     for (SearchStatsContributor contributor : searchStatsContributors) {
-        SearchStats contributed = contributor.contributeSearchStats();
-        if (contributed != null) {
-            commonStats.search.add(contributed);
+        try {
+            SearchStats contributed = contributor.contributeSearchStats();
+            if (contributed != null) {
+                commonStats.search.add(contributed);
+            }
+        } catch (Exception e) {
+            // Log and continue - don't let contributor failures break node stats
         }
     }
     break;
Suggestion importance[1-10]: 8

__

Why: This is a critical improvement. If a SearchStatsContributor throws an exception, it would break the entire _nodes/stats endpoint, affecting monitoring and observability. Wrapping the contributor invocation in try-catch ensures plugin failures are isolated and don't disrupt core functionality, which is essential for system stability.

Medium
Add null-checks for list elements

The method iterates over profile.stages() and stage.tasks() without null-checks on
individual elements. If a StageProfile or TaskProfile in the list is null due to a
partial profile build, a NullPointerException will occur. Add null-checks before
accessing stage and task properties.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/AnalyticsStatsCollector.java [54-92]

 public void recordProfile(QueryProfile profile) {
     if (profile == null) return;
     if (profile.planningTimeMs() > 0) {
         queriesPlanning.record(profile.planningTimeMs());
     }
     if (profile.executionTimeMs() > 0) {
         queriesElapsed.record(profile.executionTimeMs());
     }
     if (profile.stages() == null) return;
     for (StageProfile stage : profile.stages()) {
+        if (stage == null) continue;
         ...
+        if (stage.tasks() == null) continue;
+        for (TaskProfile task : stage.tasks()) {
+            if (task == null) continue;
+            ...
+        }
     }
 }
Suggestion importance[1-10]: 6

__

Why: Adding null-checks for individual StageProfile and TaskProfile elements improves robustness against partial or malformed profiles. While the current code may not encounter null elements in practice, defensive programming here prevents potential NullPointerExceptions and makes the code more resilient to future changes.

Low
Add exception handling for histogram recording

The recorder.recordValue(clamped) call can throw ArrayIndexOutOfBoundsException if
the value exceeds the histogram's configured range despite clamping. Add a try-catch
block around the record operation to prevent crashes from unexpected edge cases,
logging the error and continuing gracefully.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/LatencyHistogram.java [49-55]

 void record(long valueMs) {
     if (valueMs < 0) return;
     long clamped = Math.min(valueMs, MAX_TRACKABLE_MS);
-    recorder.recordValue(clamped);
+    try {
+        recorder.recordValue(clamped);
+    } catch (ArrayIndexOutOfBoundsException e) {
+        // Log and continue - don't let histogram failures break stats collection
+        return;
+    }
     count.increment();
-    sumMs.add(valueMs); // record the unclamped sum so totals stay accurate
+    sumMs.add(valueMs);
 }
Suggestion importance[1-10]: 5

__

Why: While adding exception handling is generally good practice, HdrHistogram.Recorder.recordValue() with a clamped value within the configured range should not throw ArrayIndexOutOfBoundsException. The clamping to MAX_TRACKABLE_MS ensures values stay within bounds. This suggestion addresses a theoretical edge case but may not reflect actual risk.

Low
Suggestions up to commit c68d9d6
CategorySuggestion                                                                                                                                    Impact
Possible issue
Compute full plan unconditionally for profiling

The fullPlan variable is computed only when profile is true (line 184), but is now
always passed to buildProfilingRowsListener. When profile is false, fullPlan will be
null, which may cause issues in profile building. Compute fullPlan unconditionally
since profiles are now always built.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [194]

-ActionListener<Iterable<Object[]>> rowsListener = buildProfilingRowsListener(
-    execRef,
-    context,
-    fullPlan,
-    planningTimeMs,
-    statsCollector,
-    profile,
-    listener
-);
+final String fullPlan = org.apache.calcite.plan.RelOptUtil.toString(plan);
+QueryDAG dag = DAGBuilder.build(plan, capabilityRegistry, clusterService, indexNameExpressionResolver);
Suggestion importance[1-10]: 9

__

Why: This is a critical bug. Line 194 shows fullPlan is computed conditionally (profile ? org.apache.calcite.plan.RelOptUtil.toString(plan) : null), but line 242-250 always passes fullPlan to buildProfilingRowsListener. When profile=false, fullPlan will be null, potentially causing issues in profile building since profiles are now always constructed (as per the PR changes).

High
Handle histogram recording exceptions gracefully

The recorder.recordValue(clamped) call can throw ArrayIndexOutOfBoundsException if
the value exceeds the histogram's configured range despite clamping. Add a try-catch
block to handle this edge case gracefully and prevent crashes from unexpected
latency spikes.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/LatencyHistogram.java [49-55]

 void record(long valueMs) {
     if (valueMs < 0) return;
     long clamped = Math.min(valueMs, MAX_TRACKABLE_MS);
-    recorder.recordValue(clamped);
+    try {
+        recorder.recordValue(clamped);
+    } catch (ArrayIndexOutOfBoundsException e) {
+        // Value exceeded histogram range despite clamping; skip recording to histogram
+    }
     count.increment();
-    sumMs.add(valueMs); // record the unclamped sum so totals stay accurate
+    sumMs.add(valueMs);
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion addresses a potential edge case, but HdrHistogram.Recorder.recordValue with proper configuration (MAX_TRACKABLE_MS and clamping) should not throw ArrayIndexOutOfBoundsException. The clamping at line 51 ensures values stay within bounds. Adding a try-catch for an unlikely scenario adds complexity without clear benefit.

Low
General
Preserve throwable cause chain in exception

When result.failure() is not an Exception, wrapping it in RuntimeException loses the
original throwable's stack trace and type information. Use new
RuntimeException(result.failure().getMessage(), result.failure()) to preserve the
cause chain for better debugging.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [314-325]

 executeInternal(request.getPlan(), request.getQueryCtx(), false, ActionListener.wrap(result -> {
     if (result.isSuccess()) {
         convertingListener.onResponse(new AnalyticsQueryResponse(result.rows()));
     } else {
+        Throwable failure = result.failure();
         convertingListener.onFailure(
-            result.failure() instanceof Exception ex ? ex : new RuntimeException(result.failure())
+            failure instanceof Exception ex ? ex : new RuntimeException(failure.getMessage(), failure)
         );
     }
 }, convertingListener::onFailure));
Suggestion importance[1-10]: 7

__

Why: The suggestion improves error handling by preserving the original throwable as the cause when wrapping non-Exception throwables in RuntimeException. This maintains the full stack trace and cause chain, which is valuable for debugging. The current code at line 322 loses this information.

Medium
Record zero-duration queries in statistics

The method records planning and execution times only when they are positive, but
zero-duration queries are valid and should be counted. Remove the > 0 checks to
ensure all completed queries contribute to the count metric, even if their duration
rounds to zero milliseconds.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/AnalyticsStatsCollector.java [54-61]

 public void recordProfile(QueryProfile profile) {
     if (profile == null) return;
-    if (profile.planningTimeMs() > 0) {
+    if (profile.planningTimeMs() >= 0) {
         queriesPlanning.record(profile.planningTimeMs());
     }
-    if (profile.executionTimeMs() > 0) {
+    if (profile.executionTimeMs() >= 0) {
         queriesElapsed.record(profile.executionTimeMs());
     }
     if (profile.stages() == null) return;
     for (StageProfile stage : profile.stages()) {
         ...
     }
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that zero-duration queries should be counted. However, the > 0 checks at lines 56 and 59 prevent recording zero values, which means very fast queries won't contribute to count metrics. Changing to >= 0 would ensure all completed queries are tracked, improving metric accuracy.

Low
Suggestions up to commit f6cafe2
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null tasks list

The method checks profile.stages() == null but doesn't guard against stage.tasks()
being null before iterating. If a StageProfile has a null tasks() list, the
subsequent for (TaskProfile task : stage.tasks()) will throw a NullPointerException.
Add a null check before iterating over tasks.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/AnalyticsStatsCollector.java [78-92]

 public void recordProfile(QueryProfile profile) {
     if (profile == null) return;
     if (profile.planningTimeMs() > 0) {
         queriesPlanning.record(profile.planningTimeMs());
     }
     if (profile.executionTimeMs() > 0) {
         queriesElapsed.record(profile.executionTimeMs());
     }
     if (profile.stages() == null) return;
     for (StageProfile stage : profile.stages()) {
         ...
+        if (stage.tasks() == null) continue;
+        for (TaskProfile task : stage.tasks()) {
+            ...
+        }
     }
 }
Suggestion importance[1-10]: 8

__

Why: This is a valid bug catch. The code checks if (stage.tasks() == null) continue; at line 78 in the diff, but the suggestion correctly identifies that this check is missing and could cause a NullPointerException. The improved code adds the necessary null guard.

Medium
General
Verify profile-building exception handling

The buildProfilingRowsListener is now always invoked (even when profile=false) to
feed the stats collector. If QueryProfileBuilder.snapshot throws an exception, it
will propagate through the listener chain. Ensure that exceptions during profile
building don't fail the query response when profiling is disabled.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/DefaultPlanExecutor.java [242-250]

+ActionListener<Iterable<Object[]>> rowsListener = buildProfilingRowsListener(
+    execRef,
+    context,
+    fullPlan,
+    planningTimeMs,
+    statsCollector,
+    profile,
+    listener
+);
 
-
Suggestion importance[1-10]: 7

__

Why: Valid concern about exception handling when profile=false. The suggestion correctly identifies that buildProfilingRowsListener is now always invoked and exceptions during profile building could affect query responses. However, the 'improved_code' is identical to 'existing_code', making this more of a verification request than a concrete fix.

Medium
Isolate contributor exceptions

If a SearchStatsContributor throws an exception during contributeSearchStats(), the
entire stats() call will fail and the node stats endpoint will return an error. Wrap
the contributor invocation in a try-catch to isolate failures and log them without
breaking the stats response.

server/src/main/java/org/opensearch/indices/IndicesService.java [900-907]

 case Search:
     commonStats.search.add(oldShardsStats.searchStats);
     for (SearchStatsContributor contributor : searchStatsContributors) {
-        SearchStats contributed = contributor.contributeSearchStats();
-        if (contributed != null) {
-            commonStats.search.add(contributed);
+        try {
+            SearchStats contributed = contributor.contributeSearchStats();
+            if (contributed != null) {
+                commonStats.search.add(contributed);
+            }
+        } catch (Exception e) {
+            logger.warn("SearchStatsContributor threw exception", e);
         }
     }
     break;
Suggestion importance[1-10]: 7

__

Why: Good defensive programming suggestion. If a SearchStatsContributor throws an exception, it would break the entire stats endpoint. Wrapping in try-catch with logging isolates failures and maintains availability of the stats API, though the impact depends on contributor implementation quality.

Medium
Log negative latency values

The record method silently drops negative values without logging or tracking them.
This could mask bugs where negative latencies are passed (e.g., from incorrect
timestamp arithmetic). Consider adding a warning log or an assertion to surface
these invalid inputs during development.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/LatencyHistogram.java [49-55]

 void record(long valueMs) {
-    if (valueMs < 0) return;
+    if (valueMs < 0) {
+        logger.warn("Attempted to record negative latency: {} ms", valueMs);
+        return;
+    }
     long clamped = Math.min(valueMs, MAX_TRACKABLE_MS);
     recorder.recordValue(clamped);
     count.increment();
     sumMs.add(valueMs);
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion to log negative values is reasonable for debugging, but silently dropping them is acceptable for production code. The check prevents invalid data from corrupting the histogram. Adding logging would help during development but isn't critical.

Low

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for e5b8d3c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@OVI3D0 OVI3D0 force-pushed the stats-api-transport branch from e5b8d3c to 97ea963 Compare May 29, 2026 01:25
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 97ea963

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 97ea963: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@OVI3D0 OVI3D0 force-pushed the stats-api-transport branch from 97ea963 to f6cafe2 Compare May 29, 2026 05:12
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f6cafe2

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for f6cafe2: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.53%. Comparing base (e0404f4) to head (54f8850).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
server/src/main/java/org/opensearch/node/Node.java 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21877      +/-   ##
============================================
+ Coverage     73.36%   73.53%   +0.16%     
- Complexity    75430    75594     +164     
============================================
  Files          6034     6034              
  Lines        342604   342615      +11     
  Branches      49279    49282       +3     
============================================
+ Hits         251357   251935     +578     
+ Misses        71220    70673     -547     
+ Partials      20027    20007      -20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@OVI3D0 OVI3D0 force-pushed the stats-api-transport branch from f6cafe2 to c68d9d6 Compare May 29, 2026 16:57
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c68d9d6

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for c68d9d6: SUCCESS

@OVI3D0 OVI3D0 force-pushed the stats-api-transport branch from c68d9d6 to c39dc73 Compare May 29, 2026 19:12
@github-actions
Copy link
Copy Markdown
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit c39dc73.

PathLineSeverityDescription
sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/stats/LatencyHistogram.java15lowNew file imports org.HdrHistogram.Histogram and org.HdrHistogram.Recorder (HdrHistogram library) without a corresponding visible build-file dependency change. If this is a new transitive or direct dependency introduction not captured in this diff, it warrants verification that the artifact resolves to the expected org.hdrhistogram:HdrHistogram artifact and has not been tampered with.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c39dc73

@OVI3D0 OVI3D0 force-pushed the stats-api-transport branch from c39dc73 to 54f8850 Compare May 29, 2026 19:21
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 54f8850

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 54f8850: SUCCESS

OVI3D0 and others added 4 commits May 29, 2026 15:15
Data shape for the analytics engine's _stats rollup endpoint. AnalyticsStats
is the immutable snapshot record (queries / stages_by_type / fragments)
carrying cumulative latency totals per bucket: count and sum_ms only.
Mirrors the contract of the analytics DataFusion backend's stats endpoint:
expose raw counters and totals, no derived percentiles or averages — those
are computed by the metrics backend (Tumbler / Prometheus / etc.) from
interval diffs.

Marked @experimentalapi — sandbox plugin, shapes can evolve.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Exposes node-local analytics-engine query / stage / fragment timing
distributions at GET /_plugins/_analytics/stats. Designed for oncall
triage of mustang-specific metrics that don't have a home in the
existing _nodes/stats API: planning time, per-stage-type rollups,
per-fragment timing.

Counters that overlap with core node stats (total queries, success/fail
counts) are intentionally NOT exposed here — those come through the
analytics engine's _nodes/stats integration so dashboards see mustang
queries in the same fields they already scrape.

The collector consumes the QueryProfile each query already produces via
QueryProfileBuilder.snapshot(). DefaultPlanExecutor is wired so every
query (regular path and _explain) builds the profile and feeds it into
the collector at the success/failure terminal; the regular path drops
the profile from the response after recording, the _explain path
returns it inline.

HdrHistogram (3-sigfig precision, 1ms..10min range) backs the latency
distributions. LongAdder counters for stage / fragment counts.
ConcurrentHashMap for per-stage-type buckets, keyed by
StageExecutionType enum name (SHARD_FRAGMENT, COORDINATOR_REDUCE, ...).

Output shape:

  analytics:
    queries:
      elapsed_ms:  { count, sum_ms, max_ms, p50_ms, p95_ms, p99_ms }
      planning_ms: { count, sum_ms, max_ms, p50_ms, p95_ms, p99_ms }
    stages_by_type:
      <ExecutionType>:
        started, succeeded, failed, cancelled, rows_processed_total,
        elapsed_ms: { ... percentiles ... }
    fragments:
      total, succeeded, failed,
      elapsed_ms: { ... percentiles ... }

Per-node only for v1; cluster-wide aggregation will reuse the
_nodes/stats extension points landing in opensearch-project#21809 rather than introduce a
new TransportNodesAction here.

Output types marked @experimentalapi — sandbox plugin, shapes can evolve.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Provisions the calcs dataset, fires 90 PPL queries across three shapes
(project, filter, aggregate) so the per-stage-type buckets see varied
work and percentiles have meaningful spread, then asserts shape and
contents of the response from the busiest node.

REST round-robin makes a single GET land on whichever node the client
picks; the busiest-node helper polls a few times and selects the
snapshot reflecting the most activity so all three buckets reliably
populate.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Introduces a server-side SPI that allows plugins executing searches
outside the standard Lucene path to contribute their query metrics
into the existing node-level SearchStats (query_total, query_time, etc.).

Server changes:
- SearchStatsContributor interface in plugins package
- IndicesService.stats() merges contributed SearchStats during Search flag
- Node.java discovers contributors and passes them to IndicesService

Analytics engine plugin:
- AnalyticsPlugin implements SearchStatsContributor
- contributeSearchStats() reads cumulative count and elapsed time from the
  AnalyticsStatsCollector snapshot already maintained for the
  _plugins/_analytics/stats endpoint, avoiding any duplicate counter

This ensures existing monitoring tooling that reads query_total from
_nodes/stats automatically picks up analytics engine query counts
without any tooling changes.

Co-authored-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
@OVI3D0 OVI3D0 force-pushed the stats-api-transport branch from 54f8850 to e14ee7a Compare May 29, 2026 22:19
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit e14ee7a

OVI3D0 added 2 commits May 29, 2026 16:03
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Fans out the analytics stats rollup to every cluster node and renders one
entry per node, mirroring _nodes/stats's shape:

  {
    "_nodes":   { "total": N, "successful": N, "failed": 0 },
    "cluster_name": "...",
    "nodes": {
      "<node-id>": { "analytics": { "queries": {...}, "stages_by_type": {...}, "fragments": {...} } },
      ...
    }
  }

PR opensearch-project#21796 only returned the rollup for whichever node the REST client
happened to land on. With the request load-balancer round-robining, a
single GET only saw a slice of the cluster's activity. Promoting the
endpoint to a TransportNodesAction makes it cluster-wide so existing
tooling can poll one URL and see every node's distribution.

Each node still computes its own percentiles locally — the coordinator
collects per-node AnalyticsStats and renders them side-by-side. No
histogram merging is required because there is no top-level rollup; if
one is added later it will need the histograms on the wire.

Changes:
- AnalyticsStats and its nested records implement Writeable.
- New transport package with AnalyticsStatsAction, AnalyticsStatsRequest,
  AnalyticsStatsNodeRequest, AnalyticsStatsNodeResponse,
  AnalyticsStatsResponse, and TransportAnalyticsStatsAction.
- RestAnalyticsStatsAction dispatches via NodesResponseRestListener,
  dropping its direct AnalyticsStatsCollector reference.
- AnalyticsPlugin registers the action in getActions() and constructs the
  REST handler without the collector.
- AnalyticsStatsApiIT updated to assert the cluster-wide shape and to
  verify at least one node recorded a query, stage, and fragment.
- AnalyticsStatsTests covers the StreamInput/StreamOutput round-trip for
  every record.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
@OVI3D0 OVI3D0 force-pushed the stats-api-transport branch from e14ee7a to 186d905 Compare May 29, 2026 23:03
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 186d905

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 186d905:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants