fix(proxy): exactly-once on ambiguous EDR failure, rollback retry + terminal state#27
Merged
Merged
Conversation
…erminal state Four production-grade fixes in the containment proxy, each confirmed by the prior audit. T63 / CNT-05 (P1): the nonce was released on ANY EdrError, including a transport drop AFTER the request reached the EDR, which could double-execute a containment on retry. EdrError now distinguishes a pre-send transport failure (connection refused, DNS: the action provably did NOT run, safe to release the nonce) from an ambiguous one (timeout after send, lost response: the action MAY have run). reqwest errors are classified at the dispatch boundary via is_connect(). On /execute an ambiguous failure now HOLDS the nonce and returns a distinct needs_reconciliation state (may_have_executed:true) plus an audit entry that makes clear the action may have executed; only a provable non-event releases the nonce for a clean retry. T64 / RB-01 (P1): a transient EDR 5xx / ambiguous timeout on /rollback used to return a generic 502 and lean entirely on the Python pager. /rollback now retries the idempotent contain/lift up to ROLLBACK_MAX_ATTEMPTS with a short linear backoff (4xx and pre-send are not retried), and on exhaustion emits a distinct terminal ROLLBACK_FAILED audit entry + response body the pager keys on. Audit-before-act ordering is preserved. T79 PRX-08 (P3): the proxy booted on a signing secret < 32 bytes with only a warning. It now fails closed (panic) in production, mirroring the no-secret panic, and keeps the warning for dev/CI. The check is extracted as a pure, unit-tested function. T79 PRX-07 (P3): the per-tenant path rebuilt a fresh CrowdstrikeClient per request, so the per-instance token cache never helped and every action paid a full OAuth round-trip. Tokens are now cached process-wide, keyed by credential identity (base_url + client_id + a hash of the secret, never the plaintext secret), so consecutive actions for the same tenant reuse a valid bearer while tenants stay isolated. Adds tests for the pre-send vs ambiguous classification, the nonce hold/release decision, the rollback retry + terminal ROLLBACK_FAILED path, the PRX-08 fail-closed rule, and the PRX-07 cross-request token reuse + isolation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four production-grade fixes in the containment proxy, each confirmed by the prior audit. No new dependencies; only
src/main.rsandsrc/edr.rschange.T63 / CNT-05 (P1) — double-execute on post-send transport failure
The nonce was released on ANY
EdrError, including a transport drop AFTER the request reached the EDR.nonce.rswarns to release "only when sure the side effect did NOT happen", so this was a latent double-execute on retry.EdrErrornow splitsTransportPreSend(connection refused, DNS, TLS handshake: the request never left, the action provably did not run) fromTransport(ambiguous: timeout after send, lost response, body-read error: the action MAY have run)..send()viais_connect(). A failed OAuth token fetch is always pre-send (the device-action call is never made without a bearer), so it is release-safe.EdrError::side_effect_definitely_did_not_happen()drives the nonce-release decision in the handler./execute, an ambiguous failure now HOLDS the nonce (a retry is deduped to 409, never re-run) and returns a distinctneeds_reconciliationbody withmay_have_executed: true, plus aNEEDS_RECONCILIATION_*audit entry that makes the ambiguity explicit. A provable pre-send non-event still releases the nonce and auditsFAILED_*.T64 / RB-01 (P1) — no in-proxy ROLLBACK_FAILED terminal state / no retry
A transient EDR 5xx / timeout on
/rollbackreturned a generic 502 and leaned entirely on the Python pager./rollbacknow retries the idempotent contain/lift up toROLLBACK_MAX_ATTEMPTS(3) with a short linear backoff. Only retry-safe failures (server 5xx, ambiguous transport) are retried; 4xx, pre-send, and unsupported/misconfigured are not.ROLLBACK_FAILEDaudit entry (ROLLBACK_FAILED_*) and arollback_failedresponse body the Python pager can key on.T79 PRX-08 (P3) — weak signing secret
The proxy booted on a signing secret
< 32bytes with only awarn!. It now fails closed (panic) in production, mirroring the existing no-secret panic, and keeps the warning for dev/CI. Extracted as a pure, unit-testedsecret_strength_error.T79 PRX-07 (P3) — dead token cache
The per-tenant path rebuilt a fresh
CrowdstrikeClientper request, so the per-instanceMutex<Option<CachedToken>>never survived a request and every action paid a full OAuth round-trip. Tokens are now cached process-wide inCROWDSTRIKE_TOKEN_CACHE, keyed by credential identity (base_url+client_id+ a SHA-256 hash of the secret, never the plaintext secret). Consecutive actions for the same tenant reuse a valid bearer; distinct credentials never share one.Tests
/executepre-send releases the nonce (retry re-dispatches), ambiguous holds it (retry → 409) and flags reconciliation/rollbackretries a transient 5xx then succeeds; exhausts retries then emits terminalROLLBACK_FAILED(holds nonce on ambiguous 5xx)Gates (all green)
cargo fmt --check— cleancargo clippy --all-targets -- -D warnings— cleancargo test— 80 passed, 0 failedcargo audit— exit 0, no vulnerabilities (one pre-existing unmaintained-crate warning,rustls-pemfileviaaxum-server, unrelated to this PR)Deferred (out of scope for this PR, larger changes)
/audit/exportpagination (the matched slice is still materialised for the JSON array; bounding it needs pagination or per-tenant log shards).Mutexis held briefly across the append; reworking it is a larger change).