feat: Wire up native cancellation stats from Rust to Java#21802
feat: Wire up native cancellation stats from Rust to Java#21802aravindsagar wants to merge 11 commits into
Conversation
Implement the cancellation stats counters in the Rust native layer: - Add QueryType enum (Shard/Coordinator) to distinguish query origins - Record cancelled_at timestamp when cancel_query() fires - Compute current_count_post_cancel via live registry scan - Increment total_count_post_cancel on Drop when past threshold - Make threshold configurable via df_set_cancel_stats_threshold_ms - Thread QueryType through all QueryTrackingContext call sites Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Aravind Sagar <sagarara@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit 5573fe8)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 5573fe8 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 5573fe8
Suggestions up to commit b44fe1c
Suggestions up to commit 6bfb2c1
Suggestions up to commit 65ee3dd
Suggestions up to commit ae384b9
|
…ion stats - Read total counters BEFORE scanning registry in df_native_node_stats. Prevents a query that drops mid-read from appearing in both the scan (current) and the incremented total counter simultaneously. - Use try_lock instead of lock in count_cancelled_running to avoid holding a Mutex during DashMap iteration. If contended with a concurrent cancel_query call, the entry is skipped (conservative undercount for that instant) rather than risking lock ordering issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Aravind Sagar <sagarara@amazon.com>
|
Persistent review updated to latest commit 78ef24e |
Move the cancelled_at timestamp write after cancellation_token.cancel() so the query is confirmed cancelled before being marked for stats. Avoids relying on DashMap locking as an implicit ordering guarantee. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Aravind Sagar <sagarara@amazon.com>
|
Persistent review updated to latest commit f9e96ae |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21802 +/- ##
============================================
- Coverage 73.51% 73.35% -0.17%
+ Complexity 75582 75455 -127
============================================
Files 6034 6034
Lines 342661 342661
Branches 49294 49294
============================================
- Hits 251918 251353 -565
- Misses 70712 71283 +571
+ Partials 20031 20025 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Aravind Sagar <sagarara@amazon.com>
|
Persistent review updated to latest commit ae384b9 |
Resolve conflicts: - ffm.rs: keep both df_set_cancel_stats_threshold_ms and df_query_registry_top_n_by_current - query_tracker.rs: add QueryType::Shard to new top-N snapshot tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Aravind Sagar <sagarara@amazon.com>
|
Persistent review updated to latest commit 65ee3dd |
|
Persistent review updated to latest commit 6bfb2c1 |
| /** | ||
| * Sets the cancellation stats threshold in milliseconds. | ||
| * Queries cancelled for less than this duration are not counted in stats. | ||
| * Primarily for testing — production uses the default (10 000 ms). | ||
| */ | ||
| public static void setCancelStatsThresholdMs(long millis) { |
There was a problem hiding this comment.
nit: should this setting be configurable?
| /// Data-node shard fragment execution (AnalyticsShardTask on the Java side). | ||
| Shard, | ||
| /// Coordinator-side local reduce execution (AnalyticsQueryTask on the Java side). | ||
| Coordinator, |
There was a problem hiding this comment.
Rename to fragment execution/reduce for consistency
| /// CPU task abort handle, set after the stream is created. | ||
| pub abort_handle: OnceLock<AbortHandle>, | ||
| /// Instant when cancellation was signalled, or None if not cancelled. | ||
| pub cancelled_at: Mutex<Option<Instant>>, |
There was a problem hiding this comment.
Can we replace it with AtomicU64 and CAS to simplify locks
let nanos = PROCESS_START.elapsed().as_nanos() as u64;
tracker.cancelled_at_nanos.compare_exchange(0, nanos, Ordering::Release, Ordering::Relaxed).ok();
Basically 0 denotes not cancelled. >0 denotes nanos since PROCESS_START when cancel was fired
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 72f79f1.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
- Add QueryType::Shard to new QueryTrackingContext::new call in execute_query (benchmark path added on main) - Revert Cargo.toml profile.release back to lto=true, codegen-units=1 (accidentally committed as lto=false during merge) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Aravind Sagar <sagarara@amazon.com>
|
Persistent review updated to latest commit b44fe1c |
Use lock-free AtomicU64 (nanos since process start) with CAS for cancelled_at, eliminating the per-entry Mutex entirely. This removes any deadlock concern between DashMap iteration and the per-entry lock. - cancelled_at_nanos: 0 = not cancelled, >0 = nanos since PROCESS_START - cancel_query uses compare_exchange(0, nanos, Release, Relaxed) - count_cancelled_running is fully lock-free (atomic load per entry) - Drop reads cancelled_at_nanos with Acquire ordering - Remove threshold-sensitive unit tests that were flaky under parallel execution (Drop-to-total path covered by manual cluster testing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Aravind Sagar <sagarara@amazon.com>
|
Persistent review updated to latest commit 5573fe8 |
|
❌ Gradle check result for 5573fe8: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
gradle-check timed out ( |
|
Persistent review updated to latest commit 5573fe8 |
Description
Implement the cancellation stats counters in the Rust native layer:
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.