Stats api transport for /_plugins/_analytics/stats endpoint#21877
Stats api transport for /_plugins/_analytics/stats endpoint#21877OVI3D0 wants to merge 6 commits into
/_plugins/_analytics/stats endpoint#21877Conversation
PR Reviewer Guide 🔍(Review updated until commit 186d905)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 186d905 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit e14ee7a
Suggestions up to commit 54f8850
Suggestions up to commit c39dc73
Suggestions up to commit c68d9d6
Suggestions up to commit f6cafe2
|
|
❌ Gradle check result for e5b8d3c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
e5b8d3c to
97ea963
Compare
|
Persistent review updated to latest commit 97ea963 |
|
❌ Gradle check result for 97ea963: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
97ea963 to
f6cafe2
Compare
|
Persistent review updated to latest commit f6cafe2 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #21877 +/- ##
============================================
+ Coverage 73.36% 73.53% +0.16%
- Complexity 75430 75594 +164
============================================
Files 6034 6034
Lines 342604 342615 +11
Branches 49279 49282 +3
============================================
+ Hits 251357 251935 +578
+ Misses 71220 70673 -547
+ Partials 20027 20007 -20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f6cafe2 to
c68d9d6
Compare
|
Persistent review updated to latest commit c68d9d6 |
c68d9d6 to
c39dc73
Compare
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit c39dc73.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit c39dc73 |
c39dc73 to
54f8850
Compare
|
Persistent review updated to latest commit 54f8850 |
Data shape for the analytics engine's _stats rollup endpoint. AnalyticsStats is the immutable snapshot record (queries / stages_by_type / fragments) carrying cumulative latency totals per bucket: count and sum_ms only. Mirrors the contract of the analytics DataFusion backend's stats endpoint: expose raw counters and totals, no derived percentiles or averages — those are computed by the metrics backend (Tumbler / Prometheus / etc.) from interval diffs. Marked @experimentalapi — sandbox plugin, shapes can evolve. Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Exposes node-local analytics-engine query / stage / fragment timing
distributions at GET /_plugins/_analytics/stats. Designed for oncall
triage of mustang-specific metrics that don't have a home in the
existing _nodes/stats API: planning time, per-stage-type rollups,
per-fragment timing.
Counters that overlap with core node stats (total queries, success/fail
counts) are intentionally NOT exposed here — those come through the
analytics engine's _nodes/stats integration so dashboards see mustang
queries in the same fields they already scrape.
The collector consumes the QueryProfile each query already produces via
QueryProfileBuilder.snapshot(). DefaultPlanExecutor is wired so every
query (regular path and _explain) builds the profile and feeds it into
the collector at the success/failure terminal; the regular path drops
the profile from the response after recording, the _explain path
returns it inline.
HdrHistogram (3-sigfig precision, 1ms..10min range) backs the latency
distributions. LongAdder counters for stage / fragment counts.
ConcurrentHashMap for per-stage-type buckets, keyed by
StageExecutionType enum name (SHARD_FRAGMENT, COORDINATOR_REDUCE, ...).
Output shape:
analytics:
queries:
elapsed_ms: { count, sum_ms, max_ms, p50_ms, p95_ms, p99_ms }
planning_ms: { count, sum_ms, max_ms, p50_ms, p95_ms, p99_ms }
stages_by_type:
<ExecutionType>:
started, succeeded, failed, cancelled, rows_processed_total,
elapsed_ms: { ... percentiles ... }
fragments:
total, succeeded, failed,
elapsed_ms: { ... percentiles ... }
Per-node only for v1; cluster-wide aggregation will reuse the
_nodes/stats extension points landing in opensearch-project#21809 rather than introduce a
new TransportNodesAction here.
Output types marked @experimentalapi — sandbox plugin, shapes can evolve.
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Provisions the calcs dataset, fires 90 PPL queries across three shapes (project, filter, aggregate) so the per-stage-type buckets see varied work and percentiles have meaningful spread, then asserts shape and contents of the response from the busiest node. REST round-robin makes a single GET land on whichever node the client picks; the busiest-node helper polls a few times and selects the snapshot reflecting the most activity so all three buckets reliably populate. Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Introduces a server-side SPI that allows plugins executing searches outside the standard Lucene path to contribute their query metrics into the existing node-level SearchStats (query_total, query_time, etc.). Server changes: - SearchStatsContributor interface in plugins package - IndicesService.stats() merges contributed SearchStats during Search flag - Node.java discovers contributors and passes them to IndicesService Analytics engine plugin: - AnalyticsPlugin implements SearchStatsContributor - contributeSearchStats() reads cumulative count and elapsed time from the AnalyticsStatsCollector snapshot already maintained for the _plugins/_analytics/stats endpoint, avoiding any duplicate counter This ensures existing monitoring tooling that reads query_total from _nodes/stats automatically picks up analytics engine query counts without any tooling changes. Co-authored-by: Finn Carroll <carrofin@amazon.com> Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
54f8850 to
e14ee7a
Compare
|
Persistent review updated to latest commit e14ee7a |
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Fans out the analytics stats rollup to every cluster node and renders one
entry per node, mirroring _nodes/stats's shape:
{
"_nodes": { "total": N, "successful": N, "failed": 0 },
"cluster_name": "...",
"nodes": {
"<node-id>": { "analytics": { "queries": {...}, "stages_by_type": {...}, "fragments": {...} } },
...
}
}
PR opensearch-project#21796 only returned the rollup for whichever node the REST client
happened to land on. With the request load-balancer round-robining, a
single GET only saw a slice of the cluster's activity. Promoting the
endpoint to a TransportNodesAction makes it cluster-wide so existing
tooling can poll one URL and see every node's distribution.
Each node still computes its own percentiles locally — the coordinator
collects per-node AnalyticsStats and renders them side-by-side. No
histogram merging is required because there is no top-level rollup; if
one is added later it will need the histograms on the wire.
Changes:
- AnalyticsStats and its nested records implement Writeable.
- New transport package with AnalyticsStatsAction, AnalyticsStatsRequest,
AnalyticsStatsNodeRequest, AnalyticsStatsNodeResponse,
AnalyticsStatsResponse, and TransportAnalyticsStatsAction.
- RestAnalyticsStatsAction dispatches via NodesResponseRestListener,
dropping its direct AnalyticsStatsCollector reference.
- AnalyticsPlugin registers the action in getActions() and constructs the
REST handler without the collector.
- AnalyticsStatsApiIT updated to assert the cluster-wide shape and to
verify at least one node recorded a query, stage, and fragment.
- AnalyticsStatsTests covers the StreamInput/StreamOutput round-trip for
every record.
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
e14ee7a to
186d905
Compare
|
Persistent review updated to latest commit 186d905 |
|
❌ Gradle check result for 186d905: Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Based on #21796 , adds
TransportNodesActionto broadcast/merge results, since right now this API is per-node onlyRelated Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.