Skip to content

perf(proxy): open audit file outside the chain lock (T72)#28

Merged
keirsalterego merged 2 commits into
mainfrom
perf/audit-open-outside-lock
Jun 15, 2026
Merged

perf(proxy): open audit file outside the chain lock (T72)#28
keirsalterego merged 2 commits into
mainfrom
perf/audit-open-outside-lock

Conversation

@keirsalterego

Copy link
Copy Markdown
Contributor

append_audit held the chain mutex across open + write + fsync, so a slow open serialized behind the lock and stalled other tenants' actions. Move OpenOptions::open ahead of the lock; only the hash-link, write, and fsync stay inside it, so on-disk order still matches chain order. Concurrent appends open their own append-mode handles and serialize only for the durable write. cargo fmt/clippy/test green (80 tests).

…erminal state

Four production-grade fixes in the containment proxy, each confirmed by the
prior audit.

T63 / CNT-05 (P1): the nonce was released on ANY EdrError, including a
transport drop AFTER the request reached the EDR, which could double-execute a
containment on retry. EdrError now distinguishes a pre-send transport failure
(connection refused, DNS: the action provably did NOT run, safe to release the
nonce) from an ambiguous one (timeout after send, lost response: the action MAY
have run). reqwest errors are classified at the dispatch boundary via
is_connect(). On /execute an ambiguous failure now HOLDS the nonce and returns a
distinct needs_reconciliation state (may_have_executed:true) plus an audit entry
that makes clear the action may have executed; only a provable non-event
releases the nonce for a clean retry.

T64 / RB-01 (P1): a transient EDR 5xx / ambiguous timeout on /rollback used to
return a generic 502 and lean entirely on the Python pager. /rollback now
retries the idempotent contain/lift up to ROLLBACK_MAX_ATTEMPTS with a short
linear backoff (4xx and pre-send are not retried), and on exhaustion emits a
distinct terminal ROLLBACK_FAILED audit entry + response body the pager keys on.
Audit-before-act ordering is preserved.

T79 PRX-08 (P3): the proxy booted on a signing secret < 32 bytes with only a
warning. It now fails closed (panic) in production, mirroring the no-secret
panic, and keeps the warning for dev/CI. The check is extracted as a pure,
unit-tested function.

T79 PRX-07 (P3): the per-tenant path rebuilt a fresh CrowdstrikeClient per
request, so the per-instance token cache never helped and every action paid a
full OAuth round-trip. Tokens are now cached process-wide, keyed by credential
identity (base_url + client_id + a hash of the secret, never the plaintext
secret), so consecutive actions for the same tenant reuse a valid bearer while
tenants stay isolated.

Adds tests for the pre-send vs ambiguous classification, the nonce
hold/release decision, the rollback retry + terminal ROLLBACK_FAILED path, the
PRX-08 fail-closed rule, and the PRX-07 cross-request token reuse + isolation.
append_audit held the chain mutex across open + write + fsync, so a slow
file open serialized behind the lock and stalled other tenants' actions.
Move the OpenOptions::open ahead of the lock; only the hash-link, write,
and fsync stay inside it, so on-disk order still matches chain order.
Concurrent appends open their own append-mode handles (the kernel append
flag keeps writes from stomping) and serialize only for the durable write.

cargo fmt/clippy/test all green (80 tests).
@keirsalterego keirsalterego merged commit c161746 into main Jun 15, 2026
2 checks passed
@keirsalterego keirsalterego deleted the perf/audit-open-outside-lock branch June 15, 2026 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant