Close out #707: mirror ingest invariants on every seam, enforce quarantine, bound the deferred-peel lifecycle#726
Conversation
…eam (#383) The stored-convergence replay path could emit MessageReceived with an empty, unauthenticated MemberId: the ApplicationMessage arm defaulted an unresolvable MLS sender to empty bytes and validated the inner event against hex("") — so a crafted MarmotAppEvent with pubkey = "" surfaced to the application with a blank author, regressing audit item S3 that hardened the direct ingest seam against exactly this. Collapse the per-seam validator pairs into the shared chokepoints the mirror-on-all-seams contract requires: - Delete openmls_projection::sender_member_id; the replay path now uses identity::member_id_of_sender like the direct path. - Delete the bool app_payload_is_valid_for_sender; both seams share the typed validate_app_payload_for_sender, which now rejects an empty MemberId outright so an unattributable message can never validate. - The replay ApplicationMessage arm pushes ApplicationProcessed only for a resolvable, validated sender; anything else is Ignored, which keeps the message out of app_messages_by_id and lands it terminal (EpochInvalidated) once its epoch is at or below the settled tip. - emit_application_replay_events skips empty-sender observations as a belt-and-suspenders backstop. Document the mirror-on-all-seams convention (direct ingest, stored-convergence/replay, fork recovery + the shared chokepoints) as a hard invariant in AGENTS.md and update audit item S3. Tests: app_payload unit tests pin the chokepoint (empty sender rejected even when the inner pubkey is empty — the exploit precondition); convergence_app_message_with_unresolvable_sender_emits_no_event_and_lands_terminal drives the forged payload through the real ingest → convergence → replay path and asserts no MessageReceived with an empty MemberId and a terminal stored state. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sage idempotency (#394, #369) Two independent fail-closed/correctness fixes on the ingest path: - #394: the fork-recovery missing-snapshot branch panicked via unreachable!() if resolve_fork_candidate ever returned a resolution while recovery_snapshot_name_for_fork saw no snapshot — an implicit coupling across two fork_recovery lookups, on a path reachable from attacker-delivered same-epoch fork commits. The duplicated MissingSnapshot handling is factored into forked_epoch_fail_closed (detect fork, audit, invalidate the stored message, typed EngineError::ForkedEpoch), and the former-unreachable arm now fails closed through it with a privacy-safe warning instead of panicking. Matches the documented "ForkedEpoch is the fallback for missing snapshots or unrecoverable shapes" rule; behavior on the real MissingSnapshot arms is unchanged (covered by tests/fork_detection.rs). - #369: persist_stored_message_payload short-circuited on a (group, epoch, state) match without comparing the stored payload, so a second persist of the same id under a different StoredMessagePayload variant (RawTransport vs OpenMlsWire) was silently dropped and a reader on the other processing path lost the record. The short-circuit now also compares the encoded payload bytes and overwrites on any difference. Verified unaffected: the #723 OwnCommitWire enrichment writes through storage.put_message directly and never routes through this helper. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
quarantined_groups was consulted only at session-open hydration and in retry_hydrate_quarantined_group. The live data path never checked it: a quarantined group with loadable OpenMLS state kept ingesting commits (can_ingest treats a missing epoch-state entry as ingestible), the merge called set_stable and silently re-activated the group out of band, the convergence path fell back to the durable record epoch, the outbound send/drain path could stage and confirm commits, and every durable-record accessor leaked the "vanished" group's data — members() returned the full roster while epoch() said UnknownGroup, and group_context() would export a quarantined group's exporter secrets. Enforce the documented "vanish" contract through one grep-auditable chokepoint, Engine::ensure_group_live: - Accessors (members, group_record, group_context, admin_pubkeys, app_component, feature_status, upgradeable/upgrade capabilities, the safe-export family, own_leaf_index) return UnknownGroup, matching epoch(). - do_send and converge_and_drain_queued_outbound_intents refuse to run; converge_stored_openmls_messages reports a Blocked no-op run; retry_deferred_peels skips the group (plus a defensive match arm so the catch-all "mark Processed" arm can never retire the replay buffer). - ingest_group_message classifies input for a quarantined group as the new StaleReason::Quarantined and retains the raw transport durably as PeelDeferred — excluded from settlement math, replayed automatically once repair clears the quarantine. - hydrate_one_stored_group indexes the transport routing id right after the MLS load (before validation) so inbound for a quarantined-but-loadable group still resolves and can be retained. - retry_hydrate_quarantined_group schedules the recovered group for the app's convergence drain, which replays the retained input. - An authenticated re-join welcome for a quarantined id clears the quarantine (it re-validates every leaf — strictly stronger than retry-hydrate) and emits GroupHydrationRecovered, so an unrepairable group can always be recovered by re-invite. Tests cover the #707 acceptance criterion (quarantined group rejects a valid inbound commit on ingest AND convergence, stays quarantined until explicit repair, epoch()/members() agree), the full accessor sweep, the post-repair replay catch-up, and the re-join recovery path. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PeelDeferred rows were retried forever: every convergence drain re-ran every retained row through a full rollback+peel probe per retained snapshot (O(rows x snapshots) per pass, up to 16 passes), the only terminal exit was the post-peel too_distant_in_the_past classification that garbage never reaches, and rows are keyed on attacker-controllable raw transport ids, so a peer flooding undecryptable group-routed events grew the durable store without bound. Pre-membership history hit the same treadmill: nothing distinguished never-decryptable-by-design from recoverable-historical. Layered lifecycle, all consistent with the physics that a failed peel can only start succeeding when the (epoch, retained-snapshot-set) peel context changes: - Event-driven retries: a per-group context fingerprint is armed after a full unproductive cycle over the backlog and whole sweeps are skipped until the context actually changes. New deferrals don't clear it — a row is only deferred after failing a live peel under the same context. - Retry budget: a row that exhausts MAX_DEFERRED_PEEL_ATTEMPTS (32, test-tunable via set_deferred_peel_retry_budget) without ever peeling goes terminal Failed (permanently_undecryptable). Redelivery under a fresh transport id remains the safety valve. - Flood cap: at MAX_PEEL_DEFERRED_ROWS_PER_GROUP (256) retained rows, new undecryptable input is dropped unpersisted (audit Rejection, in-process dedup) — no unbounded durable growth. Applies to the quarantine replay buffer too. - Bounded sweeps: at most 64 rows per sweep with a resume cursor, so a large historical backlog never starves current-event processing. - Pre-membership classification: new Group.join_epoch (serde default; no SQL migration — records are JSON blobs) written at create/join. An application message whose MLS epoch precedes it is terminal before any OpenMLS processing (pre_membership_event); the too_distant_in_the_past site refines to pre_membership_event vs valid_history_snapshot_missing. Creators record EpochId(0) = no bound. - One disposition taxonomy (message_disposition.rs) supplies the stable audit reason tags across both seams. - Telemetry: MessageStateChanged gains optional retry_count and sweeps_waited (schema updated; per-event data stays in the local forensic audit log); each sweep emits one aggregate privacy-safe tracing line (rows_attempted, backlog, progressed, terminal, queue_depth, sweep_duration_ms). Tests in tests/deferred_peel_lifecycle.rs cover the unchanged-context gate, retry-after-epoch-advance, budget-terminal, flood cap, terminal pre-membership classification, and join-epoch bookkeeping; the existing send_preflight_retries_deferred_peels_after_convergence_apply stays green unmodified. Part of #707. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Bugbot is not enabled for your account, so this pull request was not reviewed. Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (3)
WalkthroughThis PR adds join-epoch tracking to groups, enforces hydration quarantine across live access, ingest, and convergence paths, introduces deferred-peel retry state and audit fields, aligns sender validation across ingest and replay seams, and fixes stored-message payload overwrite handling. ChangesEngine hardening: quarantine, deferred-peel retry, sender parity
Estimated code review effort: 5 (Critical) | ~120 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Ready to review this PR? Stage has broken it down into 6 individual chapters for you: Chapters generated by Stage for commit 9701fa8 on Jul 4, 2026 3:43pm UTC. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/cgka-engine/src/engine.rs (1)
1365-1376: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winMisplaced doc comment:
set_recorder's doc now describesset_deferred_peel_retry_budget.The new method was inserted between
set_recorder's doc block (Lines 1365-1368) andset_recorderitself. Consecutive///lines coalesce, soset_deferred_peel_retry_budgetnow carries two contradictory doc paragraphs (recorder replacement + retry budget) andset_recorderat Line 1377 is left undocumented.📝 Proposed fix: move the recorder doc back onto
set_recorder- /// Replace the installed forensic recorder on a live engine. Dropping the - /// prior recorder flushes and closes any file it held. Used to start or - /// stop audit logging in place when the audit switch is toggled, without - /// rebuilding the engine. Pass [`NoopRecorder`] to stop recording. /// Override the deferred-peel retry budget (mdk#339). Rows that /// exceed it without ever peeling transition to terminal `Failed` /// (`permanently_undecryptable`). Exposed so tests can exhaust the budget /// quickly; production uses the default. pub fn set_deferred_peel_retry_budget(&mut self, budget: u32) { self.deferred_peel_retry_budget = budget.max(1); } + /// Replace the installed forensic recorder on a live engine. Dropping the + /// prior recorder flushes and closes any file it held. Used to start or + /// stop audit logging in place when the audit switch is toggled, without + /// rebuilding the engine. Pass [`NoopRecorder`] to stop recording. pub fn set_recorder(&mut self, recorder: Box<dyn ForensicRecorder>) { self.recorder = recorder; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/cgka-engine/src/engine.rs` around lines 1365 - 1376, The doc block before set_deferred_peel_retry_budget is accidentally inheriting set_recorder’s documentation, leaving set_recorder undocumented and giving set_deferred_peel_retry_budget mixed semantics. Move the recorder-related /// lines back so they immediately precede set_recorder, and keep only the retry-budget paragraph attached to set_deferred_peel_retry_budget. Use the set_recorder and set_deferred_peel_retry_budget method names to verify each doc comment matches the correct function.
🧹 Nitpick comments (3)
crates/cgka-engine/tests/hydration_quarantine.rs (1)
605-645: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDuplicate test helpers across files.
route,welcome_for, andevolutionare byte-for-byte duplicates of the same-named helpers incrates/cgka-engine/tests/deferred_peel_lifecycle.rs. A sharedtests/support/mod.rsmodule already exists (e.g.proof_signer) — consider moving these there to avoid drift between the two copies.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/cgka-engine/tests/hydration_quarantine.rs` around lines 605 - 645, The test helpers `route`, `welcome_for`, and `evolution` are duplicated in this test module and `deferred_peel_lifecycle.rs`, so move the shared logic into the existing `tests/support/mod.rs` alongside `proof_signer`. Update both test files to import and use the shared helpers, keeping the function names and behavior consistent to prevent future drift.crates/cgka-engine/src/message_processor/ingest.rs (2)
452-483: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate pre-membership classification condition.
join_epoch.0 > 0 && msg_epoch < join_epochis repeated verbatim in both the early pre-processing short-circuit (Lines 458-460) and the post-TooDistantInThePastrefinement (Line 551). Extracting a small shared helper (e.g.fn is_pre_membership(join_epoch, msg_epoch) -> bool) would prevent the two call sites from silently diverging if the bound logic changes later.♻️ Proposed refactor
+ fn is_pre_membership_event(join_epoch: EpochId, msg_epoch: EpochId) -> bool { + join_epoch.0 > 0 && msg_epoch < join_epoch + } + // at line ~459 - let join_epoch = self.group_join_epoch(&group_id); - if join_epoch.0 > 0 && msg_epoch < join_epoch { + let join_epoch = self.group_join_epoch(&group_id); + if Self::is_pre_membership_event(join_epoch, msg_epoch) { // at line ~551 - let join_epoch = self.group_join_epoch(&group_id); - let tag = if join_epoch.0 > 0 && msg_epoch < join_epoch { + let join_epoch = self.group_join_epoch(&group_id); + let tag = if Self::is_pre_membership_event(join_epoch, msg_epoch) {Also applies to: 546-558
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/cgka-engine/src/message_processor/ingest.rs` around lines 452 - 483, The pre-membership check in ingest::ingest_message is duplicated, so the logic can drift between the early short-circuit and the later TooDistantInThePast refinement. Factor the repeated join_epoch.0 > 0 && msg_epoch < join_epoch condition into a small shared helper (for example, a method near group_join_epoch or a private predicate in message_processor::ingest) and call it from both sites to keep the classification behavior consistent.
280-308: 🚀 Performance & Scalability | 🔵 TrivialLGTM on the flood-cap logic itself — correctly keyed on the pre-shadow raw transport id.
Separately: the
Rejectionaudit event fired here (and at the parallel quarantine-cap path, lines 123-131) has no rate limiting. Since raw transport ids are attacker-controlled (per the adjacent comment), a sustained flood of undecryptable/quarantined-group input past the cap will still generate one audit write per rejected message indefinitely, even though durable storage growth is bounded. Worth considering a per-group/window throttle onRejectionemission for this path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/cgka-engine/src/message_processor/ingest.rs` around lines 280 - 308, The flood-cap logic is fine, but the `Rejection` audit emission in `ingest_group_message` and the matching quarantine-cap path is unthrottled, so attacker-controlled raw transport ids can still trigger unlimited audit writes. Add a per-group/window rate limit or suppression gate around `self.audit_group(... Rejection ...)` in these cap-rejection branches, while preserving the existing `seen_message_ids` and `StaleReason::PeelFailed` behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/cgka-engine/src/engine.rs`:
- Around line 1365-1376: The doc block before set_deferred_peel_retry_budget is
accidentally inheriting set_recorder’s documentation, leaving set_recorder
undocumented and giving set_deferred_peel_retry_budget mixed semantics. Move the
recorder-related /// lines back so they immediately precede set_recorder, and
keep only the retry-budget paragraph attached to set_deferred_peel_retry_budget.
Use the set_recorder and set_deferred_peel_retry_budget method names to verify
each doc comment matches the correct function.
---
Nitpick comments:
In `@crates/cgka-engine/src/message_processor/ingest.rs`:
- Around line 452-483: The pre-membership check in ingest::ingest_message is
duplicated, so the logic can drift between the early short-circuit and the later
TooDistantInThePast refinement. Factor the repeated join_epoch.0 > 0 &&
msg_epoch < join_epoch condition into a small shared helper (for example, a
method near group_join_epoch or a private predicate in
message_processor::ingest) and call it from both sites to keep the
classification behavior consistent.
- Around line 280-308: The flood-cap logic is fine, but the `Rejection` audit
emission in `ingest_group_message` and the matching quarantine-cap path is
unthrottled, so attacker-controlled raw transport ids can still trigger
unlimited audit writes. Add a per-group/window rate limit or suppression gate
around `self.audit_group(... Rejection ...)` in these cap-rejection branches,
while preserving the existing `seen_message_ids` and `StaleReason::PeelFailed`
behavior.
In `@crates/cgka-engine/tests/hydration_quarantine.rs`:
- Around line 605-645: The test helpers `route`, `welcome_for`, and `evolution`
are duplicated in this test module and `deferred_peel_lifecycle.rs`, so move the
shared logic into the existing `tests/support/mod.rs` alongside `proof_signer`.
Update both test files to import and use the shared helpers, keeping the
function names and behavior consistent to prevent future drift.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f7460494-3c03-4110-8a93-6f11e236d4b4
⛔ Files ignored due to path filters (2)
crates/traits/tests/snapshots/snapshots__group.snapis excluded by!**/*.snapcrates/traits/tests/snapshots/snapshots__stale_quarantined.snapis excluded by!**/*.snap
📒 Files selected for processing (24)
crates/cgka-conformance-simulator/tests/openmls_replay_probe.rscrates/cgka-engine/AGENTS.mdcrates/cgka-engine/src/app_payload.rscrates/cgka-engine/src/audit_helpers.rscrates/cgka-engine/src/distributed_convergence.rscrates/cgka-engine/src/engine.rscrates/cgka-engine/src/group_lifecycle.rscrates/cgka-engine/src/lib.rscrates/cgka-engine/src/message_disposition.rscrates/cgka-engine/src/message_processor/ingest.rscrates/cgka-engine/src/message_processor/mod.rscrates/cgka-engine/src/message_processor/store.rscrates/cgka-engine/src/openmls_projection.rscrates/cgka-engine/tests/deferred_peel_lifecycle.rscrates/cgka-engine/tests/distributed_convergence.rscrates/cgka-engine/tests/hydration_quarantine.rscrates/marmot-account/tests/runtime.rscrates/marmot-forensics/schema/audit-log-event.v2.schema.jsoncrates/marmot-forensics/src/audit.rscrates/marmot-forensics/src/audit/tests.rscrates/storage-sqlite/src/storage/test_support.rscrates/traits/src/group.rscrates/traits/src/ingest.rscrates/traits/tests/snapshots.rs
|
Adversarial review finding 1: Including Please mirror the retirement logic from |
|
Adversarial review finding 2:
That undermines the new lifecycle promise that post-peel terminal classifications leave the deferred queue. Please distinguish "peel still failed" from "peeled and terminally rejected" here, or make the terminal |
…e, throttle cap-rejection audit - Move set_recorder's doc block back onto set_recorder; the new set_deferred_peel_retry_budget had been inserted between them, leaving set_recorder undocumented and giving the budget setter two contradictory doc paragraphs. - Replace group_join_epoch with a shared msg_is_pre_membership predicate so the pre-processing short-circuit and the too_distant_in_the_past refinement can't diverge on the join-epoch bound. - Throttle the deferred-peel / quarantine flood-cap Rejection audit to once per cap-full episode (re-armed when the backlog drops below the cap), so attacker-controlled transport ids flooding past the cap no longer emit one audit write per rejected message. Skipped (CodeRabbit nit): consolidating route/welcome_for/evolution into tests/support/mod.rs — that module is compiled into 17 integration-test binaries with no #![allow(dead_code)], so shared-but-unused helpers would fail the -D warnings gate; the duplication is contained to test scaffolding. just fast-ci green; cgka-engine suite green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Bugbot is not enabled for your account, so this pull request was not reviewed. Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs. |
|
Addressed the CodeRabbit review in ffaee58:
|
…#726 review) Two cap-slot leaks in the deferred-peel lifecycle, both keeping a raw PeelDeferred row (and its per-group cap slot) alive after the message was actually resolved: - Finding 2 (retry seam): ingest_group_message returns StaleReason::PeelFailed both when a peel genuinely still fails AND when the row peeled successfully but the content was terminally rejected (unattributable sender, invalid app payload, non-MLS body, welcome in a group envelope, stale-epoch-no-snapshot). retry_deferred_peels treated every PeelFailed as "still cannot peel" (no-op), so the terminal cases stayed durably PeelDeferred. Each terminal-after-peel site now calls mark_raw_transport_message_failed_if_deferred (mirroring the existing pre-membership / too-distant paths), so the raw row is marked Failed and released from the retry lifecycle; a no-op on the direct path. - Finding 1 (replay seam): replay_buffered_messages left a successfully applied (Ok(Processed)) or terminally reclassified PeelDeferred row in place via the Ok(_) => {} arm, leaking its cap slot after re-join / publish-confirm replay. The match now mirrors retry_deferred_peels: deferred rows are retired (marked Processed + note_peel_deferred_row_retired) on apply/buffer and terminal reclassification, and only left in place for still-un-peelable PeelFailed or Quarantined. Non-deferred (Created/ Retryable) rows keep their prior behavior. Tests: deferred_row_terminally_rejected_after_peel_leaves_the_deferred_queue (a forged unattributable payload deferred at a future epoch, then re-peeled and rejected, ends Failed not PeelDeferred) and rejoin_welcome_replays_and_retires_quarantine_retained_input (quarantine- retained input replayed by do_join_welcome leaves zero PeelDeferred rows). cgka-engine suite green; clippy + fmt clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Bugbot is not enabled for your account, so this pull request was not reviewed. Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs. |
|
Addressed both adversarial-review findings in 9701fa8. Both were real per-group cap-slot leaks — a raw Finding 2 (retry seam) — fixed at the root. Finding 1 (replay seam) — fixed. Regression tests:
|
Closes the remaining open children of tracking issue #707 (one convergence contract: every application-visible invariant enforced identically on the direct-ingest, stored-convergence/replay, and fork-recovery seams). Four commits, one per cluster.
Fixes #383
Fixes #394
Fixes #369
Fixes #364
Fixes #365
Fixes #339
Fixes #707
Commit 1 — seam parity (#383)
The replay seam could emit
MessageReceivedwith an empty, unauthenticated sender (S3 regression): the ApplicationMessage arm defaulted an unresolvable MLS sender to empty bytes and validated the inner event againsthex(""), so a craftedMarmotAppEventwithpubkey: ""surfaced with a blank author. The per-seam validator pairs are collapsed into the shared chokepoints (identity::member_id_of_sender, typedapp_payload::validate_app_payload_for_sender— which now rejects an emptyMemberIdoutright), the replay arm pushesIgnoredfor anything that fails attribution (terminal via existing canonicalization), andemit_application_replay_eventsskips empty senders as a backstop. The mirror-on-all-seams contract is now documented as a hard invariant incrates/cgka-engine/AGENTS.md.Commit 2 — fail-closed hardening (#394, #369)
unreachable!()on an invariant break reachable from attacker-delivered same-epoch fork commits; both arms share one fail-closed helper (detect_fork+ audit +EpochInvalidated+ typedForkedEpoch).persist_stored_message_payload's idempotency short-circuit now compares the encoded payload bytes, so a same-(group, epoch, state)persist under a differentStoredMessagePayloadvariant overwrites instead of silently keeping the old variant. Verified independent of the Materialize own confirmed commits as pre-validated convergence candidates (#706) #723OwnCommitWireenrichment (which writes viaput_messagedirectly).Commit 3 — quarantine enforcement (#364, #365)
quarantined_groupsis now enforced on every live surface through one chokepoint (Engine::ensure_group_live): all durable/MLS accessors returnUnknownGroup(matchingepoch()and the documented "vanish" contract — includinggroup_context(), which previously exported a quarantined group's exporter secrets);send, convergence, and queued-intent drains refuse to run; inbound input is retained durably (PeelDeferred) and classified with the newStaleReason::Quarantined; nothing canset_stablea quarantined group out of band. Repair viaretry_hydrate_quarantined_groupor an authenticated re-join welcome clears the quarantine and replays the retained input.Commit 4 — deferred-peel lifecycle (#339, full scope)
PeelDeferredrows were retried forever at O(rows × snapshots) per convergence pass and grew the durable store without bound under a re-wrap flood (raw transport ids are attacker-controllable). The lifecycle is now: context-fingerprint gating (retry only when the (epoch, snapshot-set) peel context actually changes), a per-row retry budget → terminalpermanently_undecryptable, a per-group cap of 256 retained rows (overflow dropped unpersisted), bounded 64-row sweeps so backlog never starves current events, and post-peel pre-membership classification via the newGroup.join_epoch(pre_membership_eventvsvalid_history_snapshot_missing). OneMessageDispositiontaxonomy supplies the audit reason tags;MessageStateChangedgains optionalretry_count/sweeps_waited, and each sweep emits an aggregate privacy-safe tracing line with queue depth and duration.Note on #339's "identify pre-membership before peel" criterion: the transport wrapper carries no epoch, so pre-peel classification is physically impossible; the fix achieves the criterion's intent (no unbounded retries, no cryptographic amplification, terminal classification) via the gate + budget + post-peel classification.
Verification
just fast-cigreen; full workspace suite green (1992 tests, 0 failed); clippy clean across the workspace. New coverage:app_payloadunit tests +convergence_app_message_with_unresolvable_sender_emits_no_event_and_lands_terminal(#383),same_key_persist_with_different_payload_variant_overwrites(#369), five quarantine-enforcement tests incl. the #707 acceptance-criterion testquarantined_group_rejects_valid_inbound_commit_on_do_ingest(#364/#365), and six lifecycle tests in the newtests/deferred_peel_lifecycle.rs(#339) withsend_preflight_retries_deferred_peels_after_convergence_applyunmodified.🤖 Generated with Claude Code
Summary by CodeRabbit