Refactor datafusion plugin stats to node level metrics#21846
Refactor datafusion plugin stats to node level metrics#21846AjayRajNelapudi wants to merge 3 commits into
Conversation
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 2a99a20.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
Signed-off-by: Ajay Raj Nelapudi <ajnelapu@amazon.com>
0c6b8d0 to
c650d10
Compare
PR Reviewer Guide 🔍(Review updated until commit 2a99a20)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 2a99a20 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 2a99a20
Suggestions up to commit 79f5fd2
Suggestions up to commit c650d10
Suggestions up to commit c650d10
Suggestions up to commit c650d10
|
|
❌ Gradle check result for c650d10: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
| public static final String VALID_STATS = "io_runtime, cpu_runtime, coordinator_reduce, " | ||
| + "query_execution, stream_next, plan_setup, datanode_gate, coordinator_gate"; | ||
|
|
||
| private static final Set<String> VALID_STAT_NAMES = Set.of( |
There was a problem hiding this comment.
nit: can we derive VALID_STATS from VALID_STAT_NAMES so there is single place where edits need to done for adding stats?
There was a problem hiding this comment.
Derived in next commit on same PR
|
Persistent review updated to latest commit c650d10 |
|
❌ Gradle check result for c650d10: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit c650d10 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21846 +/- ##
============================================
+ Coverage 73.34% 73.36% +0.01%
+ Complexity 75417 75392 -25
============================================
Files 6032 6030 -2
Lines 342404 342308 -96
Branches 49235 49231 -4
============================================
- Hits 251142 251139 -3
+ Misses 71272 71234 -38
+ Partials 19990 19935 -55 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Bukhtawar
left a comment
There was a problem hiding this comment.
Can we add ITs for node and cluster level?
Signed-off-by: Ajay Raj Nelapudi <ajnelapu@amazon.com>
|
Persistent review updated to latest commit 79f5fd2 |
Signed-off-by: Ajay Raj Nelapudi <ajnelapu@amazon.com>
76cca8e to
2a99a20
Compare
|
Persistent review updated to latest commit 2a99a20 |
|
❌ Gradle check result for 2a99a20: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 2a99a20 |
|
❌ Gradle check result for 2a99a20: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Refactor datafusion plugin stats to node level metrics
AI generated test report
DataFusion Cluster Stats — Endpoint Test Report
Test 1: All stats from all nodes
Expected: Returns _nodes, cluster_name, nodes with all 8 stat sections. No name/host/transport_address.
Actual:
{ "_nodes": { "total": 1, "successful": 1, "failed": 0 }, "cluster_name": "runTask", "nodes": { "iXuve5W0TZqx77qY_CY9pQ": { "io_runtime": { "workers_count": 16, "total_polls_count": 0, "total_busy_duration_ms": 0, "total_overflow_count": 0, "global_queue_depth": 0, "blocking_queue_depth": 0, "num_alive_tasks": 0, "spawned_tasks_count": 0, "total_local_queue_depth": 0 }, "cpu_runtime": { "workers_count": 8, "total_polls_count": 0, "total_busy_duration_ms": 0, "total_overflow_count": 0, "global_queue_depth": 0, "blocking_queue_depth": 0, "num_alive_tasks": 0, "spawned_tasks_count": 0, "total_local_queue_depth": 0 }, "coordinator_reduce": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "query_execution": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "stream_next": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "plan_setup": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "datanode_gate": { "max_permits": 12, "active_permits": 0, "total_wait_duration_ms": 0, "total_batches_started": 0 }, "coordinator_gate": { "max_permits": 12, "active_permits": 0, "total_wait_duration_ms": 0, "total_batches_started": 0 } } } }Result: ✅ PASS
Test 2: Local node only (_local)
Expected: Returns stats for local node only.
Actual:
{ "_nodes": { "total": 1, "successful": 1, "failed": 0 }, "cluster_name": "runTask", "nodes": { "iXuve5W0TZqx77qY_CY9pQ": { "io_runtime": { "workers_count": 16, "total_polls_count": 0, "total_busy_duration_ms": 0, "total_overflow_count": 0, "global_queue_depth": 0, "blocking_queue_depth": 0, "num_alive_tasks": 0, "spawned_tasks_count": 0, "total_local_queue_depth": 0 }, "cpu_runtime": { "workers_count": 8, "total_polls_count": 0, "total_busy_duration_ms": 0, "total_overflow_count": 0, "global_queue_depth": 0, "blocking_queue_depth": 0, "num_alive_tasks": 0, "spawned_tasks_count": 0, "total_local_queue_depth": 0 }, "coordinator_reduce": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "query_execution": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "stream_next": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "plan_setup": { "total_poll_duration_ms": 0, "total_scheduled_duration_ms": 0, "total_idle_duration_ms": 0 }, "datanode_gate": { "max_permits": 12, "active_permits": 0, "total_wait_duration_ms": 0, "total_batches_started": 0 }, "coordinator_gate": { "max_permits": 12, "active_permits": 0, "total_wait_duration_ms": 0, "total_batches_started": 0 } } } }Result: ✅ PASS
Test 3: Single stat filter (io_runtime)
Expected: Only io_runtime section per node.
Actual:
{ "_nodes": { "total": 1, "successful": 1, "failed": 0 }, "cluster_name": "runTask", "nodes": { "iXuve5W0TZqx77qY_CY9pQ": { "io_runtime": { "workers_count": 16, "total_polls_count": 0, "total_busy_duration_ms": 0, "total_overflow_count": 0, "global_queue_depth": 0, "blocking_queue_depth": 0, "num_alive_tasks": 0, "spawned_tasks_count": 0, "total_local_queue_depth": 0 } } } }Result: ✅ PASS
Test 4: Multiple stat filter (io_runtime,cpu_runtime)
Expected: Only io_runtime and cpu_runtime.
Actual:
{ "_nodes": { "total": 1, "successful": 1, "failed": 0 }, "cluster_name": "runTask", "nodes": { "iXuve5W0TZqx77qY_CY9pQ": { "io_runtime": { "workers_count": 16, "total_polls_count": 0, "total_busy_duration_ms": 0, "total_overflow_count": 0, "global_queue_depth": 0, "blocking_queue_depth": 0, "num_alive_tasks": 0, "spawned_tasks_count": 0, "total_local_queue_depth": 0 }, "cpu_runtime": { "workers_count": 8, "total_polls_count": 0, "total_busy_duration_ms": 0, "total_overflow_count": 0, "global_queue_depth": 0, "blocking_queue_depth": 0, "num_alive_tasks": 0, "spawned_tasks_count": 0, "total_local_queue_depth": 0 } } } }Result: ✅ PASS
Test 5: Node + stat filter (_local/datanode_gate)
Expected: Only datanode_gate for local node.
Actual:
{ "_nodes": { "total": 1, "successful": 1, "failed": 0 }, "cluster_name": "runTask", "nodes": { "iXuve5W0TZqx77qY_CY9pQ": { "datanode_gate": { "max_permits": 12, "active_permits": 0, "total_wait_duration_ms": 0, "total_batches_started": 0 } } } }Result: ✅ PASS
Test 6: Invalid stat name → HTTP 400
Expected: HTTP 400 with error listing valid names.
Actual: HTTP 400
Result: ✅ PASS
Test 7: Mixed valid+invalid → HTTP 400
Expected: HTTP 400 even with one invalid stat.
Actual: HTTP 400
Result: ✅ PASS
Test 8: Deprecated legacy route
Expected: HTTP 200 with same data via legacy prefix.
Actual: HTTP 200
Result: ✅ PASS
Test 9: No IP leakage
Expected: No name/host/transport_address in per-node entries.
Actual: Node keys = ['io_runtime', 'cpu_runtime', 'coordinator_reduce', 'query_execution', 'stream_next', 'plan_setup', 'datanode_gate', 'coordinator_gate']
IP Leak: False
Result: ✅ PASS
Summary
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.