Skip to content

fix(proxy): exactly-once on ambiguous EDR failure, rollback retry + terminal state#27

Merged
keirsalterego merged 1 commit into
mainfrom
fix/proxy-exactly-once-and-rollback
Jun 15, 2026
Merged

fix(proxy): exactly-once on ambiguous EDR failure, rollback retry + terminal state#27
keirsalterego merged 1 commit into
mainfrom
fix/proxy-exactly-once-and-rollback

Conversation

@keirsalterego

Copy link
Copy Markdown
Contributor

Four production-grade fixes in the containment proxy, each confirmed by the prior audit. No new dependencies; only src/main.rs and src/edr.rs change.

T63 / CNT-05 (P1) — double-execute on post-send transport failure

The nonce was released on ANY EdrError, including a transport drop AFTER the request reached the EDR. nonce.rs warns to release "only when sure the side effect did NOT happen", so this was a latent double-execute on retry.

  • EdrError now splits TransportPreSend (connection refused, DNS, TLS handshake: the request never left, the action provably did not run) from Transport (ambiguous: timeout after send, lost response, body-read error: the action MAY have run).
  • reqwest errors are classified at each dispatch .send() via is_connect(). A failed OAuth token fetch is always pre-send (the device-action call is never made without a bearer), so it is release-safe.
  • EdrError::side_effect_definitely_did_not_happen() drives the nonce-release decision in the handler.
  • On /execute, an ambiguous failure now HOLDS the nonce (a retry is deduped to 409, never re-run) and returns a distinct needs_reconciliation body with may_have_executed: true, plus a NEEDS_RECONCILIATION_* audit entry that makes the ambiguity explicit. A provable pre-send non-event still releases the nonce and audits FAILED_*.

T64 / RB-01 (P1) — no in-proxy ROLLBACK_FAILED terminal state / no retry

A transient EDR 5xx / timeout on /rollback returned a generic 502 and leaned entirely on the Python pager.

  • /rollback now retries the idempotent contain/lift up to ROLLBACK_MAX_ATTEMPTS (3) with a short linear backoff. Only retry-safe failures (server 5xx, ambiguous transport) are retried; 4xx, pre-send, and unsupported/misconfigured are not.
  • On exhaustion (or a non-retryable failure) it emits a distinct terminal ROLLBACK_FAILED audit entry (ROLLBACK_FAILED_*) and a rollback_failed response body the Python pager can key on.
  • Audit-before-act ordering is preserved: the intent entry is written in step 5, every companion entry is written before the handler returns.

T79 PRX-08 (P3) — weak signing secret

The proxy booted on a signing secret < 32 bytes with only a warn!. It now fails closed (panic) in production, mirroring the existing no-secret panic, and keeps the warning for dev/CI. Extracted as a pure, unit-tested secret_strength_error.

T79 PRX-07 (P3) — dead token cache

The per-tenant path rebuilt a fresh CrowdstrikeClient per request, so the per-instance Mutex<Option<CachedToken>> never survived a request and every action paid a full OAuth round-trip. Tokens are now cached process-wide in CROWDSTRIKE_TOKEN_CACHE, keyed by credential identity (base_url + client_id + a SHA-256 hash of the secret, never the plaintext secret). Consecutive actions for the same tenant reuse a valid bearer; distinct credentials never share one.

Tests

  • pre-send vs ambiguous classification + release-safety predicate
  • /execute pre-send releases the nonce (retry re-dispatches), ambiguous holds it (retry → 409) and flags reconciliation
  • /rollback retries a transient 5xx then succeeds; exhausts retries then emits terminal ROLLBACK_FAILED (holds nonce on ambiguous 5xx)
  • PRX-08 fail-closed only in production
  • PRX-07 cross-request token reuse for the same credential, isolation across distinct credentials, key stability + secret sensitivity

Gates (all green)

  • cargo fmt --check — clean
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo test — 80 passed, 0 failed
  • cargo audit — exit 0, no vulnerabilities (one pre-existing unmaintained-crate warning, rustls-pemfile via axum-server, unrelated to this PR)

Deferred (out of scope for this PR, larger changes)

  • PRX-09/audit/export pagination (the matched slice is still materialised for the JSON array; bounding it needs pagination or per-tenant log shards).
  • PRX-06 — audit-chain mutex (the chain Mutex is held briefly across the append; reworking it is a larger change).

…erminal state

Four production-grade fixes in the containment proxy, each confirmed by the
prior audit.

T63 / CNT-05 (P1): the nonce was released on ANY EdrError, including a
transport drop AFTER the request reached the EDR, which could double-execute a
containment on retry. EdrError now distinguishes a pre-send transport failure
(connection refused, DNS: the action provably did NOT run, safe to release the
nonce) from an ambiguous one (timeout after send, lost response: the action MAY
have run). reqwest errors are classified at the dispatch boundary via
is_connect(). On /execute an ambiguous failure now HOLDS the nonce and returns a
distinct needs_reconciliation state (may_have_executed:true) plus an audit entry
that makes clear the action may have executed; only a provable non-event
releases the nonce for a clean retry.

T64 / RB-01 (P1): a transient EDR 5xx / ambiguous timeout on /rollback used to
return a generic 502 and lean entirely on the Python pager. /rollback now
retries the idempotent contain/lift up to ROLLBACK_MAX_ATTEMPTS with a short
linear backoff (4xx and pre-send are not retried), and on exhaustion emits a
distinct terminal ROLLBACK_FAILED audit entry + response body the pager keys on.
Audit-before-act ordering is preserved.

T79 PRX-08 (P3): the proxy booted on a signing secret < 32 bytes with only a
warning. It now fails closed (panic) in production, mirroring the no-secret
panic, and keeps the warning for dev/CI. The check is extracted as a pure,
unit-tested function.

T79 PRX-07 (P3): the per-tenant path rebuilt a fresh CrowdstrikeClient per
request, so the per-instance token cache never helped and every action paid a
full OAuth round-trip. Tokens are now cached process-wide, keyed by credential
identity (base_url + client_id + a hash of the secret, never the plaintext
secret), so consecutive actions for the same tenant reuse a valid bearer while
tenants stay isolated.

Adds tests for the pre-send vs ambiguous classification, the nonce
hold/release decision, the rollback retry + terminal ROLLBACK_FAILED path, the
PRX-08 fail-closed rule, and the PRX-07 cross-request token reuse + isolation.
@keirsalterego keirsalterego merged commit 4fbdbbe into main Jun 15, 2026
2 checks passed
@keirsalterego keirsalterego deleted the fix/proxy-exactly-once-and-rollback branch June 15, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant