Skip to content

Close out #707: mirror ingest invariants on every seam, enforce quarantine, bound the deferred-peel lifecycle#726

Merged
erskingardner merged 6 commits into
masterfrom
claude/strange-rubin-4a814d
Jul 4, 2026
Merged

Close out #707: mirror ingest invariants on every seam, enforce quarantine, bound the deferred-peel lifecycle#726
erskingardner merged 6 commits into
masterfrom
claude/strange-rubin-4a814d

Conversation

@erskingardner

@erskingardner erskingardner commented Jul 4, 2026

Copy link
Copy Markdown
Member

Closes the remaining open children of tracking issue #707 (one convergence contract: every application-visible invariant enforced identically on the direct-ingest, stored-convergence/replay, and fork-recovery seams). Four commits, one per cluster.

Fixes #383
Fixes #394
Fixes #369
Fixes #364
Fixes #365
Fixes #339
Fixes #707

Commit 1 — seam parity (#383)

The replay seam could emit MessageReceived with an empty, unauthenticated sender (S3 regression): the ApplicationMessage arm defaulted an unresolvable MLS sender to empty bytes and validated the inner event against hex(""), so a crafted MarmotAppEvent with pubkey: "" surfaced with a blank author. The per-seam validator pairs are collapsed into the shared chokepoints (identity::member_id_of_sender, typed app_payload::validate_app_payload_for_sender — which now rejects an empty MemberId outright), the replay arm pushes Ignored for anything that fails attribution (terminal via existing canonicalization), and emit_application_replay_events skips empty senders as a backstop. The mirror-on-all-seams contract is now documented as a hard invariant in crates/cgka-engine/AGENTS.md.

Commit 2 — fail-closed hardening (#394, #369)

  • The fork-recovery missing-snapshot branch no longer panics via unreachable!() on an invariant break reachable from attacker-delivered same-epoch fork commits; both arms share one fail-closed helper (detect_fork + audit + EpochInvalidated + typed ForkedEpoch).
  • persist_stored_message_payload's idempotency short-circuit now compares the encoded payload bytes, so a same-(group, epoch, state) persist under a different StoredMessagePayload variant overwrites instead of silently keeping the old variant. Verified independent of the Materialize own confirmed commits as pre-validated convergence candidates (#706) #723 OwnCommitWire enrichment (which writes via put_message directly).

Commit 3 — quarantine enforcement (#364, #365)

quarantined_groups is now enforced on every live surface through one chokepoint (Engine::ensure_group_live): all durable/MLS accessors return UnknownGroup (matching epoch() and the documented "vanish" contract — including group_context(), which previously exported a quarantined group's exporter secrets); send, convergence, and queued-intent drains refuse to run; inbound input is retained durably (PeelDeferred) and classified with the new StaleReason::Quarantined; nothing can set_stable a quarantined group out of band. Repair via retry_hydrate_quarantined_group or an authenticated re-join welcome clears the quarantine and replays the retained input.

Commit 4 — deferred-peel lifecycle (#339, full scope)

PeelDeferred rows were retried forever at O(rows × snapshots) per convergence pass and grew the durable store without bound under a re-wrap flood (raw transport ids are attacker-controllable). The lifecycle is now: context-fingerprint gating (retry only when the (epoch, snapshot-set) peel context actually changes), a per-row retry budget → terminal permanently_undecryptable, a per-group cap of 256 retained rows (overflow dropped unpersisted), bounded 64-row sweeps so backlog never starves current events, and post-peel pre-membership classification via the new Group.join_epoch (pre_membership_event vs valid_history_snapshot_missing). One MessageDisposition taxonomy supplies the audit reason tags; MessageStateChanged gains optional retry_count/sweeps_waited, and each sweep emits an aggregate privacy-safe tracing line with queue depth and duration.

Note on #339's "identify pre-membership before peel" criterion: the transport wrapper carries no epoch, so pre-peel classification is physically impossible; the fix achieves the criterion's intent (no unbounded retries, no cryptographic amplification, terminal classification) via the gate + budget + post-peel classification.

Verification

just fast-ci green; full workspace suite green (1992 tests, 0 failed); clippy clean across the workspace. New coverage: app_payload unit tests + convergence_app_message_with_unresolvable_sender_emits_no_event_and_lands_terminal (#383), same_key_persist_with_different_payload_variant_overwrites (#369), five quarantine-enforcement tests incl. the #707 acceptance-criterion test quarantined_group_rejects_valid_inbound_commit_on_do_ingest (#364/#365), and six lifecycle tests in the new tests/deferred_peel_lifecycle.rs (#339) with send_preflight_retries_deferred_peels_after_convergence_apply unmodified.

🤖 Generated with Claude Code


Open in Stage

Summary by CodeRabbit

  • New Features
    • Added hydration-quarantine handling: quarantined groups are blocked until recovered via retry hydration or authenticated re-join, with replay resuming afterward.
    • Enhanced deferred-peel lifecycle for more reliable retry/recovery, including bounded retries and improved convergence behavior.
  • Bug Fixes
    • Empty or unresolvable senders are now rejected consistently, preventing invalid application events from being accepted or surfaced in group events.
    • Improved message storage idempotency so updated payload variants are preserved correctly.
    • Added more accurate group join-epoch metadata to improve join, replay, and recovery consistency.

erskingardner and others added 4 commits July 4, 2026 14:40
…eam (#383)

The stored-convergence replay path could emit MessageReceived with an
empty, unauthenticated MemberId: the ApplicationMessage arm defaulted an
unresolvable MLS sender to empty bytes and validated the inner event
against hex("") — so a crafted MarmotAppEvent with pubkey = "" surfaced
to the application with a blank author, regressing audit item S3 that
hardened the direct ingest seam against exactly this.

Collapse the per-seam validator pairs into the shared chokepoints the
mirror-on-all-seams contract requires:

- Delete openmls_projection::sender_member_id; the replay path now uses
  identity::member_id_of_sender like the direct path.
- Delete the bool app_payload_is_valid_for_sender; both seams share the
  typed validate_app_payload_for_sender, which now rejects an empty
  MemberId outright so an unattributable message can never validate.
- The replay ApplicationMessage arm pushes ApplicationProcessed only for
  a resolvable, validated sender; anything else is Ignored, which keeps
  the message out of app_messages_by_id and lands it terminal
  (EpochInvalidated) once its epoch is at or below the settled tip.
- emit_application_replay_events skips empty-sender observations as a
  belt-and-suspenders backstop.

Document the mirror-on-all-seams convention (direct ingest,
stored-convergence/replay, fork recovery + the shared chokepoints) as a
hard invariant in AGENTS.md and update audit item S3.

Tests: app_payload unit tests pin the chokepoint (empty sender rejected
even when the inner pubkey is empty — the exploit precondition);
convergence_app_message_with_unresolvable_sender_emits_no_event_and_lands_terminal
drives the forged payload through the real ingest → convergence → replay
path and asserts no MessageReceived with an empty MemberId and a
terminal stored state.

Part of #707.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sage idempotency (#394, #369)

Two independent fail-closed/correctness fixes on the ingest path:

- #394: the fork-recovery missing-snapshot branch panicked via
  unreachable!() if resolve_fork_candidate ever returned a resolution
  while recovery_snapshot_name_for_fork saw no snapshot — an implicit
  coupling across two fork_recovery lookups, on a path reachable from
  attacker-delivered same-epoch fork commits. The duplicated
  MissingSnapshot handling is factored into forked_epoch_fail_closed
  (detect fork, audit, invalidate the stored message, typed
  EngineError::ForkedEpoch), and the former-unreachable arm now fails
  closed through it with a privacy-safe warning instead of panicking.
  Matches the documented "ForkedEpoch is the fallback for missing
  snapshots or unrecoverable shapes" rule; behavior on the real
  MissingSnapshot arms is unchanged (covered by tests/fork_detection.rs).

- #369: persist_stored_message_payload short-circuited on a
  (group, epoch, state) match without comparing the stored payload, so a
  second persist of the same id under a different StoredMessagePayload
  variant (RawTransport vs OpenMlsWire) was silently dropped and a
  reader on the other processing path lost the record. The short-circuit
  now also compares the encoded payload bytes and overwrites on any
  difference. Verified unaffected: the #723 OwnCommitWire enrichment
  writes through storage.put_message directly and never routes through
  this helper.

Part of #707.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
quarantined_groups was consulted only at session-open hydration and in
retry_hydrate_quarantined_group. The live data path never checked it: a
quarantined group with loadable OpenMLS state kept ingesting commits
(can_ingest treats a missing epoch-state entry as ingestible), the merge
called set_stable and silently re-activated the group out of band, the
convergence path fell back to the durable record epoch, the outbound
send/drain path could stage and confirm commits, and every
durable-record accessor leaked the "vanished" group's data — members()
returned the full roster while epoch() said UnknownGroup, and
group_context() would export a quarantined group's exporter secrets.

Enforce the documented "vanish" contract through one grep-auditable
chokepoint, Engine::ensure_group_live:

- Accessors (members, group_record, group_context, admin_pubkeys,
  app_component, feature_status, upgradeable/upgrade capabilities, the
  safe-export family, own_leaf_index) return UnknownGroup, matching
  epoch().
- do_send and converge_and_drain_queued_outbound_intents refuse to run;
  converge_stored_openmls_messages reports a Blocked no-op run;
  retry_deferred_peels skips the group (plus a defensive match arm so
  the catch-all "mark Processed" arm can never retire the replay
  buffer).
- ingest_group_message classifies input for a quarantined group as the
  new StaleReason::Quarantined and retains the raw transport durably as
  PeelDeferred — excluded from settlement math, replayed automatically
  once repair clears the quarantine.
- hydrate_one_stored_group indexes the transport routing id right after
  the MLS load (before validation) so inbound for a
  quarantined-but-loadable group still resolves and can be retained.
- retry_hydrate_quarantined_group schedules the recovered group for the
  app's convergence drain, which replays the retained input.
- An authenticated re-join welcome for a quarantined id clears the
  quarantine (it re-validates every leaf — strictly stronger than
  retry-hydrate) and emits GroupHydrationRecovered, so an unrepairable
  group can always be recovered by re-invite.

Tests cover the #707 acceptance criterion (quarantined group rejects a
valid inbound commit on ingest AND convergence, stays quarantined until
explicit repair, epoch()/members() agree), the full accessor sweep, the
post-repair replay catch-up, and the re-join recovery path.

Part of #707.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PeelDeferred rows were retried forever: every convergence drain re-ran
every retained row through a full rollback+peel probe per retained
snapshot (O(rows x snapshots) per pass, up to 16 passes), the only
terminal exit was the post-peel too_distant_in_the_past classification
that garbage never reaches, and rows are keyed on attacker-controllable
raw transport ids, so a peer flooding undecryptable group-routed events
grew the durable store without bound. Pre-membership history hit the
same treadmill: nothing distinguished never-decryptable-by-design from
recoverable-historical.

Layered lifecycle, all consistent with the physics that a failed peel
can only start succeeding when the (epoch, retained-snapshot-set) peel
context changes:

- Event-driven retries: a per-group context fingerprint is armed after a
  full unproductive cycle over the backlog and whole sweeps are skipped
  until the context actually changes. New deferrals don't clear it — a
  row is only deferred after failing a live peel under the same context.
- Retry budget: a row that exhausts MAX_DEFERRED_PEEL_ATTEMPTS (32,
  test-tunable via set_deferred_peel_retry_budget) without ever peeling
  goes terminal Failed (permanently_undecryptable). Redelivery under a
  fresh transport id remains the safety valve.
- Flood cap: at MAX_PEEL_DEFERRED_ROWS_PER_GROUP (256) retained rows,
  new undecryptable input is dropped unpersisted (audit Rejection,
  in-process dedup) — no unbounded durable growth. Applies to the
  quarantine replay buffer too.
- Bounded sweeps: at most 64 rows per sweep with a resume cursor, so a
  large historical backlog never starves current-event processing.
- Pre-membership classification: new Group.join_epoch (serde default;
  no SQL migration — records are JSON blobs) written at create/join.
  An application message whose MLS epoch precedes it is terminal before
  any OpenMLS processing (pre_membership_event); the
  too_distant_in_the_past site refines to pre_membership_event vs
  valid_history_snapshot_missing. Creators record EpochId(0) = no bound.
- One disposition taxonomy (message_disposition.rs) supplies the stable
  audit reason tags across both seams.
- Telemetry: MessageStateChanged gains optional retry_count and
  sweeps_waited (schema updated; per-event data stays in the local
  forensic audit log); each sweep emits one aggregate privacy-safe
  tracing line (rows_attempted, backlog, progressed, terminal,
  queue_depth, sweep_duration_ms).

Tests in tests/deferred_peel_lifecycle.rs cover the unchanged-context
gate, retry-after-epoch-advance, budget-terminal, flood cap, terminal
pre-membership classification, and join-epoch bookkeeping; the existing
send_preflight_retries_deferred_peels_after_convergence_apply stays
green unmodified.

Part of #707.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 37137fc3-34f6-4e32-ad96-3a0dd19edc52

📥 Commits

Reviewing files that changed from the base of the PR and between 974c2de and 9701fa8.

📒 Files selected for processing (5)
  • crates/cgka-engine/src/engine.rs
  • crates/cgka-engine/src/message_processor/ingest.rs
  • crates/cgka-engine/src/message_processor/mod.rs
  • crates/cgka-engine/tests/distributed_convergence.rs
  • crates/cgka-engine/tests/hydration_quarantine.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/cgka-engine/tests/hydration_quarantine.rs
  • crates/cgka-engine/src/message_processor/mod.rs
  • crates/cgka-engine/src/engine.rs

Walkthrough

This PR adds join-epoch tracking to groups, enforces hydration quarantine across live access, ingest, and convergence paths, introduces deferred-peel retry state and audit fields, aligns sender validation across ingest and replay seams, and fixes stored-message payload overwrite handling.

Changes

Engine hardening: quarantine, deferred-peel retry, sender parity

Layer / File(s) Summary
Group join epoch tracking
crates/traits/src/group.rs, crates/cgka-engine/src/group_lifecycle.rs, crates/cgka-conformance-simulator/tests/openmls_replay_probe.rs, crates/marmot-account/tests/runtime.rs, crates/storage-sqlite/src/storage/test_support.rs, crates/traits/tests/snapshots.rs
Adds join_epoch to Group and populates it in creation/join paths, fixtures, and snapshots.
Hydration quarantine enforcement
crates/cgka-engine/src/engine.rs, crates/cgka-engine/src/message_processor/mod.rs, crates/cgka-engine/src/message_processor/ingest.rs, crates/cgka-engine/src/distributed_convergence.rs, crates/cgka-engine/src/group_lifecycle.rs, crates/traits/src/ingest.rs, crates/cgka-engine/AGENTS.md, crates/cgka-engine/tests/hydration_quarantine.rs
Adds quarantine helpers, gates live accessors and outbound/ingest/convergence paths, clears quarantine on repair or welcome re-join, and updates quarantine-focused tests and guidance.
Deferred-peel retry lifecycle
crates/cgka-engine/src/message_disposition.rs, crates/cgka-engine/src/message_processor/mod.rs, crates/cgka-engine/src/message_processor/ingest.rs, crates/cgka-engine/src/audit_helpers.rs, crates/marmot-forensics/src/audit.rs, crates/marmot-forensics/schema/audit-log-event.v2.schema.json, crates/marmot-forensics/src/audit/tests.rs, crates/cgka-engine/tests/deferred_peel_lifecycle.rs
Adds deferred-peel classification, retry budgeting, sweep gating, retirement bookkeeping, audit fields, schema updates, and lifecycle coverage.
Sender validation parity and fail-closed replay
crates/cgka-engine/src/app_payload.rs, crates/cgka-engine/src/openmls_projection.rs, crates/cgka-engine/src/distributed_convergence.rs, crates/cgka-engine/src/message_processor/ingest.rs, crates/cgka-engine/AGENTS.md, crates/cgka-engine/tests/distributed_convergence.rs
Rejects empty authenticated senders, reuses resolved sender identity in replay, skips empty-sender replay events, and routes fork recovery through a fail-closed error path.
Stored payload overwrite handling
crates/cgka-engine/src/message_processor/store.rs
Changes the idempotency short-circuit to compare encoded payload bytes and adds a regression test for differing payload variants.

Estimated code review effort: 5 (Critical) | ~120 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the PR’s main themes: mirrored ingest invariants, quarantine enforcement, and deferred-peel bounding.
Linked Issues check ✅ Passed The changes cover the linked fixes: replay sender validation, fail-closed fork recovery, payload overwrite semantics, quarantine enforcement, and bounded deferred-peel retries.
Out of Scope Changes check ✅ Passed The modified files all support the same tracking-issue work, with tests and docs added alongside the functional fixes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/strange-rubin-4a814d

Comment @coderabbitai help to get the list of available commands.

@stage-review

stage-review Bot commented Jul 4, 2026

Copy link
Copy Markdown

Ready to review this PR? Stage has broken it down into 6 individual chapters for you:

Title
1 Define shared message disposition taxonomy
2 Unify application payload validation across seams
3 Harden fail-closed logic and idempotency
4 Enforce hydration quarantine on live surfaces
5 Implement bounded deferred-peel lifecycle
6 Other changes
Open in Stage

Chapters generated by Stage for commit 9701fa8 on Jul 4, 2026 3:43pm UTC.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/cgka-engine/src/engine.rs (1)

1365-1376: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Misplaced doc comment: set_recorder's doc now describes set_deferred_peel_retry_budget.

The new method was inserted between set_recorder's doc block (Lines 1365-1368) and set_recorder itself. Consecutive /// lines coalesce, so set_deferred_peel_retry_budget now carries two contradictory doc paragraphs (recorder replacement + retry budget) and set_recorder at Line 1377 is left undocumented.

📝 Proposed fix: move the recorder doc back onto set_recorder
-    /// Replace the installed forensic recorder on a live engine. Dropping the
-    /// prior recorder flushes and closes any file it held. Used to start or
-    /// stop audit logging in place when the audit switch is toggled, without
-    /// rebuilding the engine. Pass [`NoopRecorder`] to stop recording.
     /// Override the deferred-peel retry budget (mdk#339). Rows that
     /// exceed it without ever peeling transition to terminal `Failed`
     /// (`permanently_undecryptable`). Exposed so tests can exhaust the budget
     /// quickly; production uses the default.
     pub fn set_deferred_peel_retry_budget(&mut self, budget: u32) {
         self.deferred_peel_retry_budget = budget.max(1);
     }
 
+    /// Replace the installed forensic recorder on a live engine. Dropping the
+    /// prior recorder flushes and closes any file it held. Used to start or
+    /// stop audit logging in place when the audit switch is toggled, without
+    /// rebuilding the engine. Pass [`NoopRecorder`] to stop recording.
     pub fn set_recorder(&mut self, recorder: Box<dyn ForensicRecorder>) {
         self.recorder = recorder;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/src/engine.rs` around lines 1365 - 1376, The doc block
before set_deferred_peel_retry_budget is accidentally inheriting set_recorder’s
documentation, leaving set_recorder undocumented and giving
set_deferred_peel_retry_budget mixed semantics. Move the recorder-related ///
lines back so they immediately precede set_recorder, and keep only the
retry-budget paragraph attached to set_deferred_peel_retry_budget. Use the
set_recorder and set_deferred_peel_retry_budget method names to verify each doc
comment matches the correct function.
🧹 Nitpick comments (3)
crates/cgka-engine/tests/hydration_quarantine.rs (1)

605-645: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate test helpers across files.

route, welcome_for, and evolution are byte-for-byte duplicates of the same-named helpers in crates/cgka-engine/tests/deferred_peel_lifecycle.rs. A shared tests/support/mod.rs module already exists (e.g. proof_signer) — consider moving these there to avoid drift between the two copies.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/tests/hydration_quarantine.rs` around lines 605 - 645, The
test helpers `route`, `welcome_for`, and `evolution` are duplicated in this test
module and `deferred_peel_lifecycle.rs`, so move the shared logic into the
existing `tests/support/mod.rs` alongside `proof_signer`. Update both test files
to import and use the shared helpers, keeping the function names and behavior
consistent to prevent future drift.
crates/cgka-engine/src/message_processor/ingest.rs (2)

452-483: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate pre-membership classification condition.

join_epoch.0 > 0 && msg_epoch < join_epoch is repeated verbatim in both the early pre-processing short-circuit (Lines 458-460) and the post-TooDistantInThePast refinement (Line 551). Extracting a small shared helper (e.g. fn is_pre_membership(join_epoch, msg_epoch) -> bool) would prevent the two call sites from silently diverging if the bound logic changes later.

♻️ Proposed refactor
+    fn is_pre_membership_event(join_epoch: EpochId, msg_epoch: EpochId) -> bool {
+        join_epoch.0 > 0 && msg_epoch < join_epoch
+    }
+
     // at line ~459
-                let join_epoch = self.group_join_epoch(&group_id);
-                if join_epoch.0 > 0 && msg_epoch < join_epoch {
+                let join_epoch = self.group_join_epoch(&group_id);
+                if Self::is_pre_membership_event(join_epoch, msg_epoch) {

     // at line ~551
-                    let join_epoch = self.group_join_epoch(&group_id);
-                    let tag = if join_epoch.0 > 0 && msg_epoch < join_epoch {
+                    let join_epoch = self.group_join_epoch(&group_id);
+                    let tag = if Self::is_pre_membership_event(join_epoch, msg_epoch) {

Also applies to: 546-558

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/src/message_processor/ingest.rs` around lines 452 - 483,
The pre-membership check in ingest::ingest_message is duplicated, so the logic
can drift between the early short-circuit and the later TooDistantInThePast
refinement. Factor the repeated join_epoch.0 > 0 && msg_epoch < join_epoch
condition into a small shared helper (for example, a method near
group_join_epoch or a private predicate in message_processor::ingest) and call
it from both sites to keep the classification behavior consistent.

280-308: 🚀 Performance & Scalability | 🔵 Trivial

LGTM on the flood-cap logic itself — correctly keyed on the pre-shadow raw transport id.

Separately: the Rejection audit event fired here (and at the parallel quarantine-cap path, lines 123-131) has no rate limiting. Since raw transport ids are attacker-controlled (per the adjacent comment), a sustained flood of undecryptable/quarantined-group input past the cap will still generate one audit write per rejected message indefinitely, even though durable storage growth is bounded. Worth considering a per-group/window throttle on Rejection emission for this path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/src/message_processor/ingest.rs` around lines 280 - 308,
The flood-cap logic is fine, but the `Rejection` audit emission in
`ingest_group_message` and the matching quarantine-cap path is unthrottled, so
attacker-controlled raw transport ids can still trigger unlimited audit writes.
Add a per-group/window rate limit or suppression gate around
`self.audit_group(... Rejection ...)` in these cap-rejection branches, while
preserving the existing `seen_message_ids` and `StaleReason::PeelFailed`
behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/cgka-engine/src/engine.rs`:
- Around line 1365-1376: The doc block before set_deferred_peel_retry_budget is
accidentally inheriting set_recorder’s documentation, leaving set_recorder
undocumented and giving set_deferred_peel_retry_budget mixed semantics. Move the
recorder-related /// lines back so they immediately precede set_recorder, and
keep only the retry-budget paragraph attached to set_deferred_peel_retry_budget.
Use the set_recorder and set_deferred_peel_retry_budget method names to verify
each doc comment matches the correct function.

---

Nitpick comments:
In `@crates/cgka-engine/src/message_processor/ingest.rs`:
- Around line 452-483: The pre-membership check in ingest::ingest_message is
duplicated, so the logic can drift between the early short-circuit and the later
TooDistantInThePast refinement. Factor the repeated join_epoch.0 > 0 &&
msg_epoch < join_epoch condition into a small shared helper (for example, a
method near group_join_epoch or a private predicate in
message_processor::ingest) and call it from both sites to keep the
classification behavior consistent.
- Around line 280-308: The flood-cap logic is fine, but the `Rejection` audit
emission in `ingest_group_message` and the matching quarantine-cap path is
unthrottled, so attacker-controlled raw transport ids can still trigger
unlimited audit writes. Add a per-group/window rate limit or suppression gate
around `self.audit_group(... Rejection ...)` in these cap-rejection branches,
while preserving the existing `seen_message_ids` and `StaleReason::PeelFailed`
behavior.

In `@crates/cgka-engine/tests/hydration_quarantine.rs`:
- Around line 605-645: The test helpers `route`, `welcome_for`, and `evolution`
are duplicated in this test module and `deferred_peel_lifecycle.rs`, so move the
shared logic into the existing `tests/support/mod.rs` alongside `proof_signer`.
Update both test files to import and use the shared helpers, keeping the
function names and behavior consistent to prevent future drift.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7460494-3c03-4110-8a93-6f11e236d4b4

📥 Commits

Reviewing files that changed from the base of the PR and between e712232 and 974c2de.

⛔ Files ignored due to path filters (2)
  • crates/traits/tests/snapshots/snapshots__group.snap is excluded by !**/*.snap
  • crates/traits/tests/snapshots/snapshots__stale_quarantined.snap is excluded by !**/*.snap
📒 Files selected for processing (24)
  • crates/cgka-conformance-simulator/tests/openmls_replay_probe.rs
  • crates/cgka-engine/AGENTS.md
  • crates/cgka-engine/src/app_payload.rs
  • crates/cgka-engine/src/audit_helpers.rs
  • crates/cgka-engine/src/distributed_convergence.rs
  • crates/cgka-engine/src/engine.rs
  • crates/cgka-engine/src/group_lifecycle.rs
  • crates/cgka-engine/src/lib.rs
  • crates/cgka-engine/src/message_disposition.rs
  • crates/cgka-engine/src/message_processor/ingest.rs
  • crates/cgka-engine/src/message_processor/mod.rs
  • crates/cgka-engine/src/message_processor/store.rs
  • crates/cgka-engine/src/openmls_projection.rs
  • crates/cgka-engine/tests/deferred_peel_lifecycle.rs
  • crates/cgka-engine/tests/distributed_convergence.rs
  • crates/cgka-engine/tests/hydration_quarantine.rs
  • crates/marmot-account/tests/runtime.rs
  • crates/marmot-forensics/schema/audit-log-event.v2.schema.json
  • crates/marmot-forensics/src/audit.rs
  • crates/marmot-forensics/src/audit/tests.rs
  • crates/storage-sqlite/src/storage/test_support.rs
  • crates/traits/src/group.rs
  • crates/traits/src/ingest.rs
  • crates/traits/tests/snapshots.rs

@erskingardner

Copy link
Copy Markdown
Member Author

Adversarial review finding 1: crates/cgka-engine/src/message_processor/mod.rs:793-815

Including MessageState::PeelDeferred in replay_buffered_messages makes the re-join recovery path replay quarantine-retained inputs, but the success path still falls through Ok(_) => {}. If a retained raw row is successfully applied or terminally classified during do_join_welcome replay, the content-derived row is updated, but the original raw replay-buffer row remains durable PeelDeferred and its cap slot is never released. That leaves stale quarantine input in the retry lifecycle and can permanently consume the per-group cap after a re-join recovery.

Please mirror the retirement logic from retry_deferred_peels: when was_peel_deferred and the replay result is no longer buffered/quarantined, mark the raw row processed or otherwise terminal and call note_peel_deferred_row_retired.

@erskingardner

Copy link
Copy Markdown
Member Author

Adversarial review finding 2: crates/cgka-engine/src/message_processor/mod.rs:535-564

retry_deferred_peels leaves every StaleReason::PeelFailed result in PeelDeferred, but ingest_group_message uses PeelFailed for more than "still cannot peel". It also returns PeelFailed after the raw row has successfully peeled and the content-derived message is terminally rejected, for example invalid/unattributable application payloads. In those cases this retry loop consumes an attempt, leaves the raw row durable as PeelDeferred, records no progress, and can arm the unchanged-context gate, so terminal input remains retained until some unrelated context change or eventual budget exhaustion.

That undermines the new lifecycle promise that post-peel terminal classifications leave the deferred queue. Please distinguish "peel still failed" from "peeled and terminally rejected" here, or make the terminal PeelFailed paths mark/retire the raw deferred row the same way the pre-membership/too-distant paths already call mark_raw_transport_message_failed_if_deferred.

…e, throttle cap-rejection audit

- Move set_recorder's doc block back onto set_recorder; the new
  set_deferred_peel_retry_budget had been inserted between them, leaving
  set_recorder undocumented and giving the budget setter two
  contradictory doc paragraphs.
- Replace group_join_epoch with a shared msg_is_pre_membership predicate
  so the pre-processing short-circuit and the too_distant_in_the_past
  refinement can't diverge on the join-epoch bound.
- Throttle the deferred-peel / quarantine flood-cap Rejection audit to
  once per cap-full episode (re-armed when the backlog drops below the
  cap), so attacker-controlled transport ids flooding past the cap no
  longer emit one audit write per rejected message.

Skipped (CodeRabbit nit): consolidating route/welcome_for/evolution into
tests/support/mod.rs — that module is compiled into 17 integration-test
binaries with no #![allow(dead_code)], so shared-but-unused helpers would
fail the -D warnings gate; the duplication is contained to test scaffolding.

just fast-ci green; cgka-engine suite green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@erskingardner

Copy link
Copy Markdown
Member Author

Addressed the CodeRabbit review in ffaee58:

  • Misplaced doc comment (engine.rs, Minor) — fixed. set_recorder's doc block was split from it by the inserted set_deferred_peel_retry_budget; moved it back so each method carries its own doc.
  • Duplicate pre-membership condition (ingest.rs) — fixed. Replaced group_join_epoch with a shared msg_is_pre_membership predicate used by both the pre-processing short-circuit and the too_distant_in_the_past refinement, so the join-epoch bound can't diverge.
  • Unthrottled Rejection audit under flood (ingest.rs) — fixed. The flood-cap Rejection is now audited once per cap-full episode (re-armed via note_peel_deferred_row_retired once the backlog drops below the cap), so an attacker-paced flood past the cap no longer emits one audit write per rejected message. Applied to both the deferral-cap and quarantine-cap paths.
  • Duplicate test helpers (route/welcome_for/evolution) — skipped, with reason. tests/support/mod.rs is compiled into 17 integration-test binaries and has no #![allow(dead_code)], so shared-but-unused helpers would fail the -D warnings gate in the binaries that don't use them. The duplication is contained to test scaffolding, so I left the local copies in place rather than regress CI.

just fast-ci green; full cgka-engine suite green.

…#726 review)

Two cap-slot leaks in the deferred-peel lifecycle, both keeping a raw
PeelDeferred row (and its per-group cap slot) alive after the message was
actually resolved:

- Finding 2 (retry seam): ingest_group_message returns
  StaleReason::PeelFailed both when a peel genuinely still fails AND when
  the row peeled successfully but the content was terminally rejected
  (unattributable sender, invalid app payload, non-MLS body, welcome in a
  group envelope, stale-epoch-no-snapshot). retry_deferred_peels treated
  every PeelFailed as "still cannot peel" (no-op), so the terminal cases
  stayed durably PeelDeferred. Each terminal-after-peel site now calls
  mark_raw_transport_message_failed_if_deferred (mirroring the existing
  pre-membership / too-distant paths), so the raw row is marked Failed and
  released from the retry lifecycle; a no-op on the direct path.

- Finding 1 (replay seam): replay_buffered_messages left a successfully
  applied (Ok(Processed)) or terminally reclassified PeelDeferred row in
  place via the Ok(_) => {} arm, leaking its cap slot after re-join /
  publish-confirm replay. The match now mirrors retry_deferred_peels:
  deferred rows are retired (marked Processed + note_peel_deferred_row_retired)
  on apply/buffer and terminal reclassification, and only left in place for
  still-un-peelable PeelFailed or Quarantined. Non-deferred (Created/
  Retryable) rows keep their prior behavior.

Tests: deferred_row_terminally_rejected_after_peel_leaves_the_deferred_queue
(a forged unattributable payload deferred at a future epoch, then re-peeled
and rejected, ends Failed not PeelDeferred) and
rejoin_welcome_replays_and_retires_quarantine_retained_input (quarantine-
retained input replayed by do_join_welcome leaves zero PeelDeferred rows).

cgka-engine suite green; clippy + fmt clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@erskingardner

Copy link
Copy Markdown
Member Author

Addressed both adversarial-review findings in 9701fa8. Both were real per-group cap-slot leaks — a raw PeelDeferred row kept alive (holding a cap slot) after the message was actually resolved.

Finding 2 (retry seam) — fixed at the root. ingest_group_message returns StaleReason::PeelFailed both when a peel genuinely still fails and when the row peeled successfully but the content was terminally rejected (unattributable sender, invalid app payload, non-MLS body, welcome-in-group-envelope, stale-epoch-no-snapshot). Each of those terminal-after-peel sites now calls mark_raw_transport_message_failed_if_deferred — exactly as the pre-membership / too-distant paths already do — so the raw row is marked Failed and released from the retry lifecycle (a no-op on the direct path). With that in place, retry_deferred_peels's PeelFailed no-op arm is correct: by the time it sees PeelFailed, a peeled-and-terminal row is already retired, and a genuinely-still-deferred row is correctly left in place.

Finding 1 (replay seam) — fixed. replay_buffered_messages's Ok(_) => {} arm now mirrors retry_deferred_peels: a was_peel_deferred row is retired (marked Processed + note_peel_deferred_row_retired) on apply/buffer and on terminal reclassification, and is left in place only for still-un-peelable PeelFailed or Quarantined. Non-deferred (Created/Retryable) rows keep their prior behavior.

Regression tests:

  • deferred_row_terminally_rejected_after_peel_leaves_the_deferred_queue — a forged unattributable payload deferred at a future epoch, then re-peeled after catch-up and rejected, ends Failed (not stuck PeelDeferred).
  • rejoin_welcome_replays_and_retires_quarantine_retained_input — input retained while quarantined and replayed by do_join_welcome leaves zero PeelDeferred rows.

cgka-engine suite green; clippy + fmt clean.

@erskingardner erskingardner merged commit 3ab4ec4 into master Jul 4, 2026
21 checks passed
@erskingardner erskingardner deleted the claude/strange-rubin-4a814d branch July 4, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment