Skip to content

Fix native memory duress probe#2

Open
Bukhtawar wants to merge 10 commits into
pradeep-L:searchNMBPfrom
Bukhtawar:fix-native-memory-duress-probe
Open

Fix native memory duress probe#2
Bukhtawar wants to merge 10 commits into
pradeep-L:searchNMBPfrom
Bukhtawar:fix-native-memory-duress-probe

Conversation

@Bukhtawar
Copy link
Copy Markdown

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Pradeep L added 9 commits May 19, 2026 02:12
Signed-off-by: Pradeep L <spradeel@amazon.com>

# Conflicts:
#	sandbox/libs/analytics-framework/src/main/java/org/opensearch/analytics/spi/AnalyticsSearchBackendPlugin.java
#	sandbox/plugins/analytics-backend-datafusion/rust/src/query_tracker.rs
#	sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionAnalyticsBackendPlugin.java
#	sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionPlugin.java
#	sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/nativelib/NativeBridge.java

# Conflicts:
#	server/src/main/java/org/opensearch/monitor/os/OsProbe.java
Rename per-task native_memory_bytes_threshold to native_memory_percent_threshold to match its actual semantics. The setting was declared as Setting<Long> with a '_bytes_threshold' key but the value was being interpreted as a percent inside NativeMemoryUsageTracker. Switch to Setting<Double> bounded to [0.0, 1.0], mirroring heap_percent_threshold. Effective per-task byte threshold is now budget * fraction, where budget is the backend-installed native-memory budget.

Also includes in-progress native-memory backpressure plumbing: NativeMemoryUsageTracker budget supplier, AnalyticsShardTask now extends SearchShardTask so SBP observes it, DataFusion plugin wires currentBytesByTaskId snapshot supplier, native registry top-N FFM call, and OsProbe getProcessNativeMemoryBytes helper. Several rough edges flagged in code review remain (see PR description).

Signed-off-by: Pradeep L <spradeel@amazon.com>
Tracing for the FFM rust path and the Java BP tick.
Rust ffm.rs: log enter/ok-exit/err-exit for df_execute_query and df_execute_with_context.
Rust query_tracker.rs: log cancel_query found/not-found and drain_completed_query drained/not-drained.
NativeMemoryUsageTracker.evaluate now logs every code path (skip, inert, below-threshold, exceeds).
NativeMemoryUsageTracker.refresh distinguishes null vs empty supplier and logs heaviest 5 task IDs.
SearchBackpressureService.doRun logs per-tracker reasons-produced count, merge result, limiter outcome.

Signed-off-by: Pradeep L <spradeel@amazon.com>
Signed-off-by: Pradeep L <spradeel@amazon.com>
Signed-off-by: Pradeep L <spradeel@amazon.com>
Signed-off-by: Pradeep L <spradeel@amazon.com>
Signed-off-by: Pradeep L <spradeel@amazon.com>
Signed-off-by: Pradeep L <spradeel@amazon.com>
Signed-off-by: Pradeep L <spradeel@amazon.com>
@Bukhtawar Bukhtawar force-pushed the fix-native-memory-duress-probe branch from 9fc0af4 to 861c6e5 Compare May 19, 2026 17:13
1. Duress probe now uses NativeMemoryUsageService pool usage (sum of
   per-task snapshot) instead of OsProbe.getProcessNativeMemoryBytes().
   Process RSS overcounts — it includes Netty direct buffers, thread
   stacks, mmap, not just DataFusion's pool.

2. Remove NODE_NATIVE_MEMORY_LIMIT_SETTING from NodeDuressSettings —
   redundant with NativeMemoryUsageService.getBudgetBytes() installed
   by the backend plugin. Two budget settings that can diverge is a
   misconfiguration trap.

3. Replace platform gate (isNativeTrackingSupported: Linux/macOS only)
   with hasSnapshotProvider() — feature is active when a backend
   installs a supplier, not based on OS.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@Bukhtawar Bukhtawar force-pushed the fix-native-memory-duress-probe branch from 861c6e5 to e048863 Compare May 19, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant