Close out #707: mirror ingest invariants on every seam, enforce quarantine, bound the deferred-peel lifecycle by erskingardner · Pull Request #726 · marmot-protocol/mdk

erskingardner · 2026-07-04T14:36:52Z

Closes the remaining open children of tracking issue #707 (one convergence contract: every application-visible invariant enforced identically on the direct-ingest, stored-convergence/replay, and fork-recovery seams). Four commits, one per cluster.

Fixes #383
Fixes #394
Fixes #369
Fixes #364
Fixes #365
Fixes #339
Fixes #707

Commit 1 — seam parity (#383)

The replay seam could emit MessageReceived with an empty, unauthenticated sender (S3 regression): the ApplicationMessage arm defaulted an unresolvable MLS sender to empty bytes and validated the inner event against hex(""), so a crafted MarmotAppEvent with pubkey: "" surfaced with a blank author. The per-seam validator pairs are collapsed into the shared chokepoints (identity::member_id_of_sender, typed app_payload::validate_app_payload_for_sender — which now rejects an empty MemberId outright), the replay arm pushes Ignored for anything that fails attribution (terminal via existing canonicalization), and emit_application_replay_events skips empty senders as a backstop. The mirror-on-all-seams contract is now documented as a hard invariant in crates/cgka-engine/AGENTS.md.

Commit 2 — fail-closed hardening (#394, #369)

The fork-recovery missing-snapshot branch no longer panics via unreachable!() on an invariant break reachable from attacker-delivered same-epoch fork commits; both arms share one fail-closed helper (detect_fork + audit + EpochInvalidated + typed ForkedEpoch).
persist_stored_message_payload's idempotency short-circuit now compares the encoded payload bytes, so a same-(group, epoch, state) persist under a different StoredMessagePayload variant overwrites instead of silently keeping the old variant. Verified independent of the Materialize own confirmed commits as pre-validated convergence candidates (#706) #723 OwnCommitWire enrichment (which writes via put_message directly).

Commit 3 — quarantine enforcement (#364, #365)

quarantined_groups is now enforced on every live surface through one chokepoint (Engine::ensure_group_live): all durable/MLS accessors return UnknownGroup (matching epoch() and the documented "vanish" contract — including group_context(), which previously exported a quarantined group's exporter secrets); send, convergence, and queued-intent drains refuse to run; inbound input is retained durably (PeelDeferred) and classified with the new StaleReason::Quarantined; nothing can set_stable a quarantined group out of band. Repair via retry_hydrate_quarantined_group or an authenticated re-join welcome clears the quarantine and replays the retained input.

Commit 4 — deferred-peel lifecycle (#339, full scope)

PeelDeferred rows were retried forever at O(rows × snapshots) per convergence pass and grew the durable store without bound under a re-wrap flood (raw transport ids are attacker-controllable). The lifecycle is now: context-fingerprint gating (retry only when the (epoch, snapshot-set) peel context actually changes), a per-row retry budget → terminal permanently_undecryptable, a per-group cap of 256 retained rows (overflow dropped unpersisted), bounded 64-row sweeps so backlog never starves current events, and post-peel pre-membership classification via the new Group.join_epoch (pre_membership_event vs valid_history_snapshot_missing). One MessageDisposition taxonomy supplies the audit reason tags; MessageStateChanged gains optional retry_count/sweeps_waited, and each sweep emits an aggregate privacy-safe tracing line with queue depth and duration.

Note on #339's "identify pre-membership before peel" criterion: the transport wrapper carries no epoch, so pre-peel classification is physically impossible; the fix achieves the criterion's intent (no unbounded retries, no cryptographic amplification, terminal classification) via the gate + budget + post-peel classification.

Verification

just fast-ci green; full workspace suite green (1992 tests, 0 failed); clippy clean across the workspace. New coverage: app_payload unit tests + convergence_app_message_with_unresolvable_sender_emits_no_event_and_lands_terminal (#383), same_key_persist_with_different_payload_variant_overwrites (#369), five quarantine-enforcement tests incl. the #707 acceptance-criterion test quarantined_group_rejects_valid_inbound_commit_on_do_ingest (#364/#365), and six lifecycle tests in the new tests/deferred_peel_lifecycle.rs (#339) with send_preflight_retries_deferred_peels_after_convergence_apply unmodified.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added hydration-quarantine handling: quarantined groups are blocked until recovered via retry hydration or authenticated re-join, with replay resuming afterward.
- Enhanced deferred-peel lifecycle for more reliable retry/recovery, including bounded retries and improved convergence behavior.
Bug Fixes
- Empty or unresolvable senders are now rejected consistently, preventing invalid application events from being accepted or surfaced in group events.
- Improved message storage idempotency so updated payload variants are preserved correctly.
- Added more accurate group join-epoch metadata to improve join, replay, and recovery consistency.

…eam (#383) The stored-convergence replay path could emit MessageReceived with an empty, unauthenticated MemberId: the ApplicationMessage arm defaulted an unresolvable MLS sender to empty bytes and validated the inner event against hex("") — so a crafted MarmotAppEvent with pubkey = "" surfaced to the application with a blank author, regressing audit item S3 that hardened the direct ingest seam against exactly this. Collapse the per-seam validator pairs into the shared chokepoints the mirror-on-all-seams contract requires: - Delete openmls_projection::sender_member_id; the replay path now uses identity::member_id_of_sender like the direct path. - Delete the bool app_payload_is_valid_for_sender; both seams share the typed validate_app_payload_for_sender, which now rejects an empty MemberId outright so an unattributable message can never validate. - The replay ApplicationMessage arm pushes ApplicationProcessed only for a resolvable, validated sender; anything else is Ignored, which keeps the message out of app_messages_by_id and lands it terminal (EpochInvalidated) once its epoch is at or below the settled tip. - emit_application_replay_events skips empty-sender observations as a belt-and-suspenders backstop. Document the mirror-on-all-seams convention (direct ingest, stored-convergence/replay, fork recovery + the shared chokepoints) as a hard invariant in AGENTS.md and update audit item S3. Tests: app_payload unit tests pin the chokepoint (empty sender rejected even when the inner pubkey is empty — the exploit precondition); convergence_app_message_with_unresolvable_sender_emits_no_event_and_lands_terminal drives the forged payload through the real ingest → convergence → replay path and asserts no MessageReceived with an empty MemberId and a terminal stored state. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…sage idempotency (#394, #369) Two independent fail-closed/correctness fixes on the ingest path: - #394: the fork-recovery missing-snapshot branch panicked via unreachable!() if resolve_fork_candidate ever returned a resolution while recovery_snapshot_name_for_fork saw no snapshot — an implicit coupling across two fork_recovery lookups, on a path reachable from attacker-delivered same-epoch fork commits. The duplicated MissingSnapshot handling is factored into forked_epoch_fail_closed (detect fork, audit, invalidate the stored message, typed EngineError::ForkedEpoch), and the former-unreachable arm now fails closed through it with a privacy-safe warning instead of panicking. Matches the documented "ForkedEpoch is the fallback for missing snapshots or unrecoverable shapes" rule; behavior on the real MissingSnapshot arms is unchanged (covered by tests/fork_detection.rs). - #369: persist_stored_message_payload short-circuited on a (group, epoch, state) match without comparing the stored payload, so a second persist of the same id under a different StoredMessagePayload variant (RawTransport vs OpenMlsWire) was silently dropped and a reader on the other processing path lost the record. The short-circuit now also compares the encoded payload bytes and overwrites on any difference. Verified unaffected: the #723 OwnCommitWire enrichment writes through storage.put_message directly and never routes through this helper. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

quarantined_groups was consulted only at session-open hydration and in retry_hydrate_quarantined_group. The live data path never checked it: a quarantined group with loadable OpenMLS state kept ingesting commits (can_ingest treats a missing epoch-state entry as ingestible), the merge called set_stable and silently re-activated the group out of band, the convergence path fell back to the durable record epoch, the outbound send/drain path could stage and confirm commits, and every durable-record accessor leaked the "vanished" group's data — members() returned the full roster while epoch() said UnknownGroup, and group_context() would export a quarantined group's exporter secrets. Enforce the documented "vanish" contract through one grep-auditable chokepoint, Engine::ensure_group_live: - Accessors (members, group_record, group_context, admin_pubkeys, app_component, feature_status, upgradeable/upgrade capabilities, the safe-export family, own_leaf_index) return UnknownGroup, matching epoch(). - do_send and converge_and_drain_queued_outbound_intents refuse to run; converge_stored_openmls_messages reports a Blocked no-op run; retry_deferred_peels skips the group (plus a defensive match arm so the catch-all "mark Processed" arm can never retire the replay buffer). - ingest_group_message classifies input for a quarantined group as the new StaleReason::Quarantined and retains the raw transport durably as PeelDeferred — excluded from settlement math, replayed automatically once repair clears the quarantine. - hydrate_one_stored_group indexes the transport routing id right after the MLS load (before validation) so inbound for a quarantined-but-loadable group still resolves and can be retained. - retry_hydrate_quarantined_group schedules the recovered group for the app's convergence drain, which replays the retained input. - An authenticated re-join welcome for a quarantined id clears the quarantine (it re-validates every leaf — strictly stronger than retry-hydrate) and emits GroupHydrationRecovered, so an unrepairable group can always be recovered by re-invite. Tests cover the #707 acceptance criterion (quarantined group rejects a valid inbound commit on ingest AND convergence, stays quarantined until explicit repair, epoch()/members() agree), the full accessor sweep, the post-repair replay catch-up, and the re-join recovery path. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

PeelDeferred rows were retried forever: every convergence drain re-ran every retained row through a full rollback+peel probe per retained snapshot (O(rows x snapshots) per pass, up to 16 passes), the only terminal exit was the post-peel too_distant_in_the_past classification that garbage never reaches, and rows are keyed on attacker-controllable raw transport ids, so a peer flooding undecryptable group-routed events grew the durable store without bound. Pre-membership history hit the same treadmill: nothing distinguished never-decryptable-by-design from recoverable-historical. Layered lifecycle, all consistent with the physics that a failed peel can only start succeeding when the (epoch, retained-snapshot-set) peel context changes: - Event-driven retries: a per-group context fingerprint is armed after a full unproductive cycle over the backlog and whole sweeps are skipped until the context actually changes. New deferrals don't clear it — a row is only deferred after failing a live peel under the same context. - Retry budget: a row that exhausts MAX_DEFERRED_PEEL_ATTEMPTS (32, test-tunable via set_deferred_peel_retry_budget) without ever peeling goes terminal Failed (permanently_undecryptable). Redelivery under a fresh transport id remains the safety valve. - Flood cap: at MAX_PEEL_DEFERRED_ROWS_PER_GROUP (256) retained rows, new undecryptable input is dropped unpersisted (audit Rejection, in-process dedup) — no unbounded durable growth. Applies to the quarantine replay buffer too. - Bounded sweeps: at most 64 rows per sweep with a resume cursor, so a large historical backlog never starves current-event processing. - Pre-membership classification: new Group.join_epoch (serde default; no SQL migration — records are JSON blobs) written at create/join. An application message whose MLS epoch precedes it is terminal before any OpenMLS processing (pre_membership_event); the too_distant_in_the_past site refines to pre_membership_event vs valid_history_snapshot_missing. Creators record EpochId(0) = no bound. - One disposition taxonomy (message_disposition.rs) supplies the stable audit reason tags across both seams. - Telemetry: MessageStateChanged gains optional retry_count and sweeps_waited (schema updated; per-event data stays in the local forensic audit log); each sweep emits one aggregate privacy-safe tracing line (rows_attempted, backlog, progressed, terminal, queue_depth, sweep_duration_ms). Tests in tests/deferred_peel_lifecycle.rs cover the unchanged-context gate, retry-after-epoch-advance, budget-terminal, flood cap, terminal pre-membership classification, and join-epoch bookkeeping; the existing send_preflight_retries_deferred_peels_after_convergence_apply stays green unmodified. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor · 2026-07-04T14:36:56Z

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

coderabbitai · 2026-07-04T14:37:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 37137fc3-34f6-4e32-ad96-3a0dd19edc52

📥 Commits

Reviewing files that changed from the base of the PR and between 974c2de and 9701fa8.

📒 Files selected for processing (5)

crates/cgka-engine/src/engine.rs
crates/cgka-engine/src/message_processor/ingest.rs
crates/cgka-engine/src/message_processor/mod.rs
crates/cgka-engine/tests/distributed_convergence.rs
crates/cgka-engine/tests/hydration_quarantine.rs

🚧 Files skipped from review as they are similar to previous changes (3)

crates/cgka-engine/tests/hydration_quarantine.rs
crates/cgka-engine/src/message_processor/mod.rs
crates/cgka-engine/src/engine.rs

Walkthrough

This PR adds join-epoch tracking to groups, enforces hydration quarantine across live access, ingest, and convergence paths, introduces deferred-peel retry state and audit fields, aligns sender validation across ingest and replay seams, and fixes stored-message payload overwrite handling.

Changes

Engine hardening: quarantine, deferred-peel retry, sender parity

Layer / File(s)	Summary
Group join epoch tracking `crates/traits/src/group.rs`, `crates/cgka-engine/src/group_lifecycle.rs`, `crates/cgka-conformance-simulator/tests/openmls_replay_probe.rs`, `crates/marmot-account/tests/runtime.rs`, `crates/storage-sqlite/src/storage/test_support.rs`, `crates/traits/tests/snapshots.rs`	Adds `join_epoch` to `Group` and populates it in creation/join paths, fixtures, and snapshots.
Hydration quarantine enforcement `crates/cgka-engine/src/engine.rs`, `crates/cgka-engine/src/message_processor/mod.rs`, `crates/cgka-engine/src/message_processor/ingest.rs`, `crates/cgka-engine/src/distributed_convergence.rs`, `crates/cgka-engine/src/group_lifecycle.rs`, `crates/traits/src/ingest.rs`, `crates/cgka-engine/AGENTS.md`, `crates/cgka-engine/tests/hydration_quarantine.rs`	Adds quarantine helpers, gates live accessors and outbound/ingest/convergence paths, clears quarantine on repair or welcome re-join, and updates quarantine-focused tests and guidance.
Deferred-peel retry lifecycle `crates/cgka-engine/src/message_disposition.rs`, `crates/cgka-engine/src/message_processor/mod.rs`, `crates/cgka-engine/src/message_processor/ingest.rs`, `crates/cgka-engine/src/audit_helpers.rs`, `crates/marmot-forensics/src/audit.rs`, `crates/marmot-forensics/schema/audit-log-event.v2.schema.json`, `crates/marmot-forensics/src/audit/tests.rs`, `crates/cgka-engine/tests/deferred_peel_lifecycle.rs`	Adds deferred-peel classification, retry budgeting, sweep gating, retirement bookkeeping, audit fields, schema updates, and lifecycle coverage.
Sender validation parity and fail-closed replay `crates/cgka-engine/src/app_payload.rs`, `crates/cgka-engine/src/openmls_projection.rs`, `crates/cgka-engine/src/distributed_convergence.rs`, `crates/cgka-engine/src/message_processor/ingest.rs`, `crates/cgka-engine/AGENTS.md`, `crates/cgka-engine/tests/distributed_convergence.rs`	Rejects empty authenticated senders, reuses resolved sender identity in replay, skips empty-sender replay events, and routes fork recovery through a fail-closed error path.
Stored payload overwrite handling `crates/cgka-engine/src/message_processor/store.rs`	Changes the idempotency short-circuit to compare encoded payload bytes and adds a regression test for differing payload variants.

Estimated code review effort: 5 (Critical) | ~120 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the PR’s main themes: mirrored ingest invariants, quarantine enforcement, and deferred-peel bounding.
Linked Issues check	✅ Passed	The changes cover the linked fixes: replay sender validation, fail-closed fork recovery, payload overwrite semantics, quarantine enforcement, and bounded deferred-peel retries.
Out of Scope Changes check	✅ Passed	The modified files all support the same tracking-issue work, with tests and docs added alongside the functional fixes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/strange-rubin-4a814d

_{Comment @coderabbitai help to get the list of available commands.}

stage-review · 2026-07-04T14:37:35Z

Ready to review this PR? Stage has broken it down into 6 individual chapters for you:

	Title
1	Define shared message disposition taxonomy
2	Unify application payload validation across seams
3	Harden fail-closed logic and idempotency
4	Enforce hydration quarantine on live surfaces
5	Implement bounded deferred-peel lifecycle
6	Other changes

_{Chapters generated by Stage for commit 9701fa8 on Jul 4, 2026 3:43pm UTC.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/cgka-engine/src/engine.rs (1)

1365-1376: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Misplaced doc comment: set_recorder's doc now describes set_deferred_peel_retry_budget.

The new method was inserted between set_recorder's doc block (Lines 1365-1368) and set_recorder itself. Consecutive /// lines coalesce, so set_deferred_peel_retry_budget now carries two contradictory doc paragraphs (recorder replacement + retry budget) and set_recorder at Line 1377 is left undocumented.

📝 Proposed fix: move the recorder doc back onto set_recorder

-    /// Replace the installed forensic recorder on a live engine. Dropping the
-    /// prior recorder flushes and closes any file it held. Used to start or
-    /// stop audit logging in place when the audit switch is toggled, without
-    /// rebuilding the engine. Pass [`NoopRecorder`] to stop recording.
     /// Override the deferred-peel retry budget (mdk#339). Rows that
     /// exceed it without ever peeling transition to terminal `Failed`
     /// (`permanently_undecryptable`). Exposed so tests can exhaust the budget
     /// quickly; production uses the default.
     pub fn set_deferred_peel_retry_budget(&mut self, budget: u32) {
         self.deferred_peel_retry_budget = budget.max(1);
     }
 
+    /// Replace the installed forensic recorder on a live engine. Dropping the
+    /// prior recorder flushes and closes any file it held. Used to start or
+    /// stop audit logging in place when the audit switch is toggled, without
+    /// rebuilding the engine. Pass [`NoopRecorder`] to stop recording.
     pub fn set_recorder(&mut self, recorder: Box<dyn ForensicRecorder>) {
         self.recorder = recorder;
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/src/engine.rs` around lines 1365 - 1376, The doc block
before set_deferred_peel_retry_budget is accidentally inheriting set_recorder’s
documentation, leaving set_recorder undocumented and giving
set_deferred_peel_retry_budget mixed semantics. Move the recorder-related ///
lines back so they immediately precede set_recorder, and keep only the
retry-budget paragraph attached to set_deferred_peel_retry_budget. Use the
set_recorder and set_deferred_peel_retry_budget method names to verify each doc
comment matches the correct function.

🧹 Nitpick comments (3)

crates/cgka-engine/tests/hydration_quarantine.rs (1)
605-645: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate test helpers across files.

route, welcome_for, and evolution are byte-for-byte duplicates of the same-named helpers in crates/cgka-engine/tests/deferred_peel_lifecycle.rs. A shared tests/support/mod.rs module already exists (e.g. proof_signer) — consider moving these there to avoid drift between the two copies.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/tests/hydration_quarantine.rs` around lines 605 - 645, The
test helpers `route`, `welcome_for`, and `evolution` are duplicated in this test
module and `deferred_peel_lifecycle.rs`, so move the shared logic into the
existing `tests/support/mod.rs` alongside `proof_signer`. Update both test files
to import and use the shared helpers, keeping the function names and behavior
consistent to prevent future drift.
crates/cgka-engine/src/message_processor/ingest.rs (2)
452-483: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate pre-membership classification condition.

join_epoch.0 > 0 && msg_epoch < join_epoch is repeated verbatim in both the early pre-processing short-circuit (Lines 458-460) and the post-TooDistantInThePast refinement (Line 551). Extracting a small shared helper (e.g. fn is_pre_membership(join_epoch, msg_epoch) -> bool) would prevent the two call sites from silently diverging if the bound logic changes later.
♻️ Proposed refactor
+    fn is_pre_membership_event(join_epoch: EpochId, msg_epoch: EpochId) -> bool {
+        join_epoch.0 > 0 && msg_epoch < join_epoch
+    }
+
     // at line ~459
-                let join_epoch = self.group_join_epoch(&group_id);
-                if join_epoch.0 > 0 && msg_epoch < join_epoch {
+                let join_epoch = self.group_join_epoch(&group_id);
+                if Self::is_pre_membership_event(join_epoch, msg_epoch) {

     // at line ~551
-                    let join_epoch = self.group_join_epoch(&group_id);
-                    let tag = if join_epoch.0 > 0 && msg_epoch < join_epoch {
+                    let join_epoch = self.group_join_epoch(&group_id);
+                    let tag = if Self::is_pre_membership_event(join_epoch, msg_epoch) {
Also applies to: 546-558
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/src/message_processor/ingest.rs` around lines 452 - 483,
The pre-membership check in ingest::ingest_message is duplicated, so the logic
can drift between the early short-circuit and the later TooDistantInThePast
refinement. Factor the repeated join_epoch.0 > 0 && msg_epoch < join_epoch
condition into a small shared helper (for example, a method near
group_join_epoch or a private predicate in message_processor::ingest) and call
it from both sites to keep the classification behavior consistent.
280-308: 🚀 Performance & Scalability | 🔵 Trivial

LGTM on the flood-cap logic itself — correctly keyed on the pre-shadow raw transport id.

Separately: the Rejection audit event fired here (and at the parallel quarantine-cap path, lines 123-131) has no rate limiting. Since raw transport ids are attacker-controlled (per the adjacent comment), a sustained flood of undecryptable/quarantined-group input past the cap will still generate one audit write per rejected message indefinitely, even though durable storage growth is bounded. Worth considering a per-group/window throttle on Rejection emission for this path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/src/message_processor/ingest.rs` around lines 280 - 308,
The flood-cap logic is fine, but the `Rejection` audit emission in
`ingest_group_message` and the matching quarantine-cap path is unthrottled, so
attacker-controlled raw transport ids can still trigger unlimited audit writes.
Add a per-group/window rate limit or suppression gate around
`self.audit_group(... Rejection ...)` in these cap-rejection branches, while
preserving the existing `seen_message_ids` and `StaleReason::PeelFailed`
behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/cgka-engine/src/engine.rs`:
- Around line 1365-1376: The doc block before set_deferred_peel_retry_budget is
accidentally inheriting set_recorder’s documentation, leaving set_recorder
undocumented and giving set_deferred_peel_retry_budget mixed semantics. Move the
recorder-related /// lines back so they immediately precede set_recorder, and
keep only the retry-budget paragraph attached to set_deferred_peel_retry_budget.
Use the set_recorder and set_deferred_peel_retry_budget method names to verify
each doc comment matches the correct function.

---

Nitpick comments:
In `@crates/cgka-engine/src/message_processor/ingest.rs`:
- Around line 452-483: The pre-membership check in ingest::ingest_message is
duplicated, so the logic can drift between the early short-circuit and the later
TooDistantInThePast refinement. Factor the repeated join_epoch.0 > 0 &&
msg_epoch < join_epoch condition into a small shared helper (for example, a
method near group_join_epoch or a private predicate in
message_processor::ingest) and call it from both sites to keep the
classification behavior consistent.
- Around line 280-308: The flood-cap logic is fine, but the `Rejection` audit
emission in `ingest_group_message` and the matching quarantine-cap path is
unthrottled, so attacker-controlled raw transport ids can still trigger
unlimited audit writes. Add a per-group/window rate limit or suppression gate
around `self.audit_group(... Rejection ...)` in these cap-rejection branches,
while preserving the existing `seen_message_ids` and `StaleReason::PeelFailed`
behavior.

In `@crates/cgka-engine/tests/hydration_quarantine.rs`:
- Around line 605-645: The test helpers `route`, `welcome_for`, and `evolution`
are duplicated in this test module and `deferred_peel_lifecycle.rs`, so move the
shared logic into the existing `tests/support/mod.rs` alongside `proof_signer`.
Update both test files to import and use the shared helpers, keeping the
function names and behavior consistent to prevent future drift.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7460494-3c03-4110-8a93-6f11e236d4b4

📥 Commits

Reviewing files that changed from the base of the PR and between e712232 and 974c2de.

⛔ Files ignored due to path filters (2)

crates/traits/tests/snapshots/snapshots__group.snap is excluded by !**/*.snap
crates/traits/tests/snapshots/snapshots__stale_quarantined.snap is excluded by !**/*.snap

📒 Files selected for processing (24)

crates/cgka-conformance-simulator/tests/openmls_replay_probe.rs
crates/cgka-engine/AGENTS.md
crates/cgka-engine/src/app_payload.rs
crates/cgka-engine/src/audit_helpers.rs
crates/cgka-engine/src/distributed_convergence.rs
crates/cgka-engine/src/engine.rs
crates/cgka-engine/src/group_lifecycle.rs
crates/cgka-engine/src/lib.rs
crates/cgka-engine/src/message_disposition.rs
crates/cgka-engine/src/message_processor/ingest.rs
crates/cgka-engine/src/message_processor/mod.rs
crates/cgka-engine/src/message_processor/store.rs
crates/cgka-engine/src/openmls_projection.rs
crates/cgka-engine/tests/deferred_peel_lifecycle.rs
crates/cgka-engine/tests/distributed_convergence.rs
crates/cgka-engine/tests/hydration_quarantine.rs
crates/marmot-account/tests/runtime.rs
crates/marmot-forensics/schema/audit-log-event.v2.schema.json
crates/marmot-forensics/src/audit.rs
crates/marmot-forensics/src/audit/tests.rs
crates/storage-sqlite/src/storage/test_support.rs
crates/traits/src/group.rs
crates/traits/src/ingest.rs
crates/traits/tests/snapshots.rs

erskingardner · 2026-07-04T15:02:59Z

Adversarial review finding 1: crates/cgka-engine/src/message_processor/mod.rs:793-815

Including MessageState::PeelDeferred in replay_buffered_messages makes the re-join recovery path replay quarantine-retained inputs, but the success path still falls through Ok(_) => {}. If a retained raw row is successfully applied or terminally classified during do_join_welcome replay, the content-derived row is updated, but the original raw replay-buffer row remains durable PeelDeferred and its cap slot is never released. That leaves stale quarantine input in the retry lifecycle and can permanently consume the per-group cap after a re-join recovery.

Please mirror the retirement logic from retry_deferred_peels: when was_peel_deferred and the replay result is no longer buffered/quarantined, mark the raw row processed or otherwise terminal and call note_peel_deferred_row_retired.

erskingardner · 2026-07-04T15:03:06Z

Adversarial review finding 2: crates/cgka-engine/src/message_processor/mod.rs:535-564

retry_deferred_peels leaves every StaleReason::PeelFailed result in PeelDeferred, but ingest_group_message uses PeelFailed for more than "still cannot peel". It also returns PeelFailed after the raw row has successfully peeled and the content-derived message is terminally rejected, for example invalid/unattributable application payloads. In those cases this retry loop consumes an attempt, leaves the raw row durable as PeelDeferred, records no progress, and can arm the unchanged-context gate, so terminal input remains retained until some unrelated context change or eventual budget exhaustion.

That undermines the new lifecycle promise that post-peel terminal classifications leave the deferred queue. Please distinguish "peel still failed" from "peeled and terminally rejected" here, or make the terminal PeelFailed paths mark/retire the raw deferred row the same way the pre-membership/too-distant paths already call mark_raw_transport_message_failed_if_deferred.

…e, throttle cap-rejection audit - Move set_recorder's doc block back onto set_recorder; the new set_deferred_peel_retry_budget had been inserted between them, leaving set_recorder undocumented and giving the budget setter two contradictory doc paragraphs. - Replace group_join_epoch with a shared msg_is_pre_membership predicate so the pre-processing short-circuit and the too_distant_in_the_past refinement can't diverge on the join-epoch bound. - Throttle the deferred-peel / quarantine flood-cap Rejection audit to once per cap-full episode (re-armed when the backlog drops below the cap), so attacker-controlled transport ids flooding past the cap no longer emit one audit write per rejected message. Skipped (CodeRabbit nit): consolidating route/welcome_for/evolution into tests/support/mod.rs — that module is compiled into 17 integration-test binaries with no #![allow(dead_code)], so shared-but-unused helpers would fail the -D warnings gate; the duplication is contained to test scaffolding. just fast-ci green; cgka-engine suite green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor · 2026-07-04T15:12:46Z

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

erskingardner · 2026-07-04T15:12:58Z

Addressed the CodeRabbit review in ffaee58:

Misplaced doc comment (engine.rs, Minor) — fixed. set_recorder's doc block was split from it by the inserted set_deferred_peel_retry_budget; moved it back so each method carries its own doc.
Duplicate pre-membership condition (ingest.rs) — fixed. Replaced group_join_epoch with a shared msg_is_pre_membership predicate used by both the pre-processing short-circuit and the too_distant_in_the_past refinement, so the join-epoch bound can't diverge.
Unthrottled Rejection audit under flood (ingest.rs) — fixed. The flood-cap Rejection is now audited once per cap-full episode (re-armed via note_peel_deferred_row_retired once the backlog drops below the cap), so an attacker-paced flood past the cap no longer emits one audit write per rejected message. Applied to both the deferral-cap and quarantine-cap paths.
Duplicate test helpers (route/welcome_for/evolution) — skipped, with reason. tests/support/mod.rs is compiled into 17 integration-test binaries and has no #![allow(dead_code)], so shared-but-unused helpers would fail the -D warnings gate in the binaries that don't use them. The duplication is contained to test scaffolding, so I left the local copies in place rather than regress CI.

just fast-ci green; full cgka-engine suite green.

…#726 review) Two cap-slot leaks in the deferred-peel lifecycle, both keeping a raw PeelDeferred row (and its per-group cap slot) alive after the message was actually resolved: - Finding 2 (retry seam): ingest_group_message returns StaleReason::PeelFailed both when a peel genuinely still fails AND when the row peeled successfully but the content was terminally rejected (unattributable sender, invalid app payload, non-MLS body, welcome in a group envelope, stale-epoch-no-snapshot). retry_deferred_peels treated every PeelFailed as "still cannot peel" (no-op), so the terminal cases stayed durably PeelDeferred. Each terminal-after-peel site now calls mark_raw_transport_message_failed_if_deferred (mirroring the existing pre-membership / too-distant paths), so the raw row is marked Failed and released from the retry lifecycle; a no-op on the direct path. - Finding 1 (replay seam): replay_buffered_messages left a successfully applied (Ok(Processed)) or terminally reclassified PeelDeferred row in place via the Ok(_) => {} arm, leaking its cap slot after re-join / publish-confirm replay. The match now mirrors retry_deferred_peels: deferred rows are retired (marked Processed + note_peel_deferred_row_retired) on apply/buffer and terminal reclassification, and only left in place for still-un-peelable PeelFailed or Quarantined. Non-deferred (Created/ Retryable) rows keep their prior behavior. Tests: deferred_row_terminally_rejected_after_peel_leaves_the_deferred_queue (a forged unattributable payload deferred at a future epoch, then re-peeled and rejected, ends Failed not PeelDeferred) and rejoin_welcome_replays_and_retires_quarantine_retained_input (quarantine- retained input replayed by do_join_welcome leaves zero PeelDeferred rows). cgka-engine suite green; clippy + fmt clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor · 2026-07-04T15:42:30Z

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

erskingardner · 2026-07-04T15:42:47Z

Addressed both adversarial-review findings in 9701fa8. Both were real per-group cap-slot leaks — a raw PeelDeferred row kept alive (holding a cap slot) after the message was actually resolved.

Finding 2 (retry seam) — fixed at the root. ingest_group_message returns StaleReason::PeelFailed both when a peel genuinely still fails and when the row peeled successfully but the content was terminally rejected (unattributable sender, invalid app payload, non-MLS body, welcome-in-group-envelope, stale-epoch-no-snapshot). Each of those terminal-after-peel sites now calls mark_raw_transport_message_failed_if_deferred — exactly as the pre-membership / too-distant paths already do — so the raw row is marked Failed and released from the retry lifecycle (a no-op on the direct path). With that in place, retry_deferred_peels's PeelFailed no-op arm is correct: by the time it sees PeelFailed, a peeled-and-terminal row is already retired, and a genuinely-still-deferred row is correctly left in place.

Finding 1 (replay seam) — fixed. replay_buffered_messages's Ok(_) => {} arm now mirrors retry_deferred_peels: a was_peel_deferred row is retired (marked Processed + note_peel_deferred_row_retired) on apply/buffer and on terminal reclassification, and is left in place only for still-un-peelable PeelFailed or Quarantined. Non-deferred (Created/Retryable) rows keep their prior behavior.

Regression tests:

deferred_row_terminally_rejected_after_peel_leaves_the_deferred_queue — a forged unattributable payload deferred at a future epoch, then re-peeled after catch-up and rejected, ends Failed (not stuck PeelDeferred).
rejoin_welcome_replays_and_retires_quarantine_retained_input — input retained while quarantined and replayed by do_join_welcome leaves zero PeelDeferred rows.

cgka-engine suite green; clippy + fmt clean.

erskingardner and others added 4 commits July 4, 2026 14:40

This was referenced Jul 4, 2026

[Tracking] One convergence contract: mirror every ingest invariant on the stored-convergence/replay seam and rehydrate epoch state at restart #707

Closed

cgka-engine: third sender resolver in app_components.rs bypasses validate_credential_identity #728

Open

coderabbitai Bot reviewed Jul 4, 2026

View reviewed changes

erskingardner merged commit 3ab4ec4 into master Jul 4, 2026
21 checks passed

erskingardner deleted the claude/strange-rubin-4a814d branch July 4, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Close out #707: mirror ingest invariants on every seam, enforce quarantine, bound the deferred-peel lifecycle#726

Close out #707: mirror ingest invariants on every seam, enforce quarantine, bound the deferred-peel lifecycle#726
erskingardner merged 6 commits into
masterfrom
claude/strange-rubin-4a814d

erskingardner commented Jul 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

cursor Bot commented Jul 4, 2026

Uh oh!

coderabbitai Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

stage-review Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

cursor Bot commented Jul 4, 2026

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

cursor Bot commented Jul 4, 2026

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

erskingardner commented Jul 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit 1 — seam parity (#383)

Commit 2 — fail-closed hardening (#394, #369)

Commit 3 — quarantine enforcement (#364, #365)

Commit 4 — deferred-peel lifecycle (#339, full scope)

Verification

Summary by CodeRabbit

Uh oh!

cursor Bot commented Jul 4, 2026

Uh oh!

coderabbitai Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

stage-review Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

cursor Bot commented Jul 4, 2026

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

cursor Bot commented Jul 4, 2026

Uh oh!

erskingardner commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erskingardner commented Jul 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 4, 2026 •

edited

Loading

stage-review Bot commented Jul 4, 2026 •

edited

Loading