Skip to content

Clear removed marker when branch selection supersedes the local member's removal#724

Merged
erskingardner merged 3 commits into
masterfrom
claude/intelligent-goldberg-1d95c1
Jul 4, 2026
Merged

Clear removed marker when branch selection supersedes the local member's removal#724
erskingardner merged 3 commits into
masterfrom
claude/intelligent-goldberg-1d95c1

Conversation

@erskingardner

@erskingardner erskingardner commented Jul 4, 2026

Copy link
Copy Markdown
Member

What

Handles the #702 × #703 interaction left out of scope during the #702 merge: a commit that removes the local member's own leaf is applied (realizing removal — Group.removed set, send gate closed, group presented as removed), and that commit later loses branch selection. The withdrawal machinery from #702 already retracts the self-removed notification; this PR pins the marker semantics so the device does not stay self-quarantined for a group it canonically still belongs to.

  • distributed_convergence::emit_convergence_events: when the selected canonical branch records the local member's membership, a surviving removed marker is cleared — the explicit, state-derived inverse of realize_self_eviction. Outbound intents purged at realization stay purged; the app re-issues sends against the reopened gate.
  • Doc updates in engine.rs / traits::Group: the marker is terminal only while the removal remains canonical; it clears via authenticated re-join or supersession of the removal that set it.
  • CLI CHANGELOG (Unreleased/Fixed): sends work again on such a group instead of failing invalid_transition forever.

Spec basis

member-departure.md defines removal evidence as "an accepted commit on the selected canonical branch … recorded in retained canonical state" and makes realization state-derived; convergence.md ("Applying the selected branch") requires that a change that lost branch selection "MUST NOT remain visible to the application as a completed change". A superseded removal was never canonically applied, so clearing is spec-required, not a violation of the "terminal for that group copy" rule — that rule presumes the removal stays canonical. Since the terminal sentence can be misread as unconditional, a spec clarification is filed as marmot-protocol/marmot#220.

What a reviewer should know

The marker could not actually survive a convergence reorg on master. The reorg apply rolls back to the retained anchor before replaying the winner, and SQLite group snapshots capture the whole group record — so the anchor restore already reset removed: false incidentally (see snapshot_rollback_restores_group_messages_queue_caps_and_openmls_group_state). The resulting view was spec-correct, but only as an undocumented side effect of snapshot mechanics; a future change narrowing snapshot scope (plausible, if someone deliberately preserves the "terminal" marker across rollbacks) would silently reintroduce the quarantine bug. This PR makes the invariant explicit in code, docs, and tests.

Tests

  • superseded_self_removal_clears_removed_marker_and_restores_send — the previously untested scenario end-to-end: Carol applies a peer commit removing her own leaf (marker set, send rejected with the Removed invalid-transition, self-removed notification emitted), then the removal loses a same-epoch fork to a rename through stored convergence. Asserts the full resulting view: withdrawal names the superseded removal in the stamped id space, CommitRolledBack fires, no ForkRecovered, marker cleared, membership + the winner's rename presented, roster-correction MemberAdded emitted, and the previously rejected send succeeds. Passes with or without the code change — it pins the behavior against regressions in either mechanism (anchor restore or explicit clear).
  • convergence_apply_clears_removed_marker_without_canonical_evidence — covers the explicit reconciliation branch directly (marker forced on the record, mirroring the existing pathological-copy tests in invite_leave.rs); fails without the fix.

Verified: both new tests, full cargo test -p cgka-engine, cargo test -p cgka-traits, just fast-ci, cargo fmt --check.

Related: the open post-restart own-commit branch-selection investigation is how this scenario becomes reachable in practice.

🤖 Generated with Claude Code


Open in Stage

Summary by CodeRabbit

  • Bug Fixes
    • Removed groups now clear their “removed” marker when later convergence indicates the device is still an active member.
    • App sending eligibility is restored after convergence reorg/branch selection supersedes a prior self-removal state.
    • Superseded self-removal no longer leaves the group copy incorrectly self-quarantined.
  • Tests
    • Added regression tests covering removed-marker reconciliation across stored-convergence apply/reorg edge cases.
  • Documentation
    • Updated the changelog and expanded “removed” marker documentation to clarify all clearing scenarios.

…r's removal

A commit that removes the local member's own leaf realizes removal
(Group.removed set, send gate closed, group presented as removed). When
that commit later loses branch selection, the withdrawal machinery from
#702 already retracts the self-removed notification, but the marker
semantics were unspecified and untested: supersession tests used
rename/retention commits only, and eviction tests never superseded a
removal.

Spec reading (marmot-protocol/marmot): member-departure.md defines
removal evidence as "an accepted commit on the selected canonical
branch ... recorded in retained canonical state" and makes realization
state-derived; convergence.md "Applying the selected branch" requires
that a change that lost branch selection "MUST NOT remain visible to
the application as a completed change". A superseded removal was never
canonically applied, so the copy must stop self-quarantining — the
"terminal for that group copy" rule presumes the removal stays
canonical (clarification filed as marmot-protocol/marmot#220).

Implementation-wise the reorg apply already restored the pre-removal
group record from the retained anchor (snapshots capture the whole
record), so the marker could not actually survive a convergence reorg —
but that was incidental to snapshot mechanics, undocumented, and
untested. This change makes the semantics explicit:

- distributed_convergence: when the selected canonical branch records
  the local member's membership, clear a surviving removed marker as
  the state-derived inverse of realize_self_eviction (reconciles a
  pathological marker no anchor restore can see; keeps the send-gate
  posture derivable from canonical state alone).
- engine/traits docs: the marker clears via authenticated re-join OR
  supersession of the removal that set it; it is terminal only while
  the removal remains canonical.
- tests: superseded_self_removal_clears_removed_marker_and_restores_send
  pins the full resulting view end-to-end (a client whose own removal
  loses branch selection: withdrawal names the superseded removal in
  the stamped id space, marker cleared, membership and the winner's
  rename presented, roster-correction row emitted, send gate reopens);
  convergence_apply_clears_removed_marker_without_canonical_evidence
  pins the explicit reconciliation guard.
- cli CHANGELOG: sends work again on a group whose removal was
  superseded, instead of failing invalid_transition forever.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@stage-review

stage-review Bot commented Jul 4, 2026

Copy link
Copy Markdown

Ready to review this PR? Stage has broken it down into 3 individual chapters for you:

Title
1 Update documentation for removed-marker lifecycle
2 Implement removed-marker reconciliation in convergence
3 Verify marker clearing and send restoration
Open in Stage

Chapters generated by Stage for commit 101873a on Jul 4, 2026 9:44am UTC.

@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b317e655-335a-4eb1-8231-08bcb23e0d4e

📥 Commits

Reviewing files that changed from the base of the PR and between 28018da and 101873a.

📒 Files selected for processing (2)
  • crates/cgka-engine/src/distributed_convergence.rs
  • crates/cgka-engine/tests/distributed_convergence.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/cgka-engine/src/distributed_convergence.rs

Walkthrough

This PR clears a stale Group.removed marker during convergence when canonical branch selection still includes the local member, updates related documentation, and adds regression tests plus a changelog note.

Changes

Removed marker reconciliation

Layer / File(s) Summary
Core reconciliation logic and docs
crates/cgka-engine/src/distributed_convergence.rs, crates/cgka-engine/src/engine.rs, crates/traits/src/group.rs
emit_convergence_events now snapshots group.removed, verifies the live MLS leaf for self_id, clears a stale stored marker when canonical membership still includes self, and logs a privacy-safe breadcrumb; related docs describe the clearing path.
Regression tests for marker clearing
crates/cgka-engine/tests/distributed_convergence.rs, crates/cli/CHANGELOG.md
Three integration tests cover superseded self-removal, a pathological removed marker with canonical membership, and a fully evicted-shaped record healed by convergence apply; the changelog records the user-facing fix.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related issues

Possibly related PRs

  • marmot-protocol/mdk#703: Shares the same group.removed state path in distributed_convergence.rs, with one PR setting the marker and this PR clearing it when convergence supersedes the removal.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main behavior change in the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/intelligent-goldberg-1d95c1

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/cgka-engine/src/distributed_convergence.rs (1)

634-668: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Avoid unconditional get_group fetch on the common (already-not-removed) path.

This block runs on every settled convergence pass where self remains a member, and unconditionally calls storage.get_group to check group.removed, even though current_group.removed was already loaded earlier in this same function (before current_group.members was consumed at the for member in current_group.members loop). Capturing that bool before the move and gating the extra fetch/write on it would save a redundant storage read on the hot common case (marker not set), matching the perf-consciousness already applied elsewhere in this file (e.g., the seen_message_ids_hex cache noted for the "up to 16" pass drain).

♻️ Suggested refactor
+        let self_was_removed = current_group.removed;
         let previous_ids: HashSet<MemberId> = previous_members
@@
         if current_ids.contains(self.identity.self_id()) {
-            let mut group = self
-                .storage
-                .get_group(group_id)
-                .map_err(|e| OpenMlsProjectionError::Storage(format!("{e:?}")))?;
-            if group.removed {
-                group.removed = false;
-                self.storage
-                    .put_group(&group)
-                    .map_err(|e| OpenMlsProjectionError::Storage(format!("{e:?}")))?;
-                tracing::info!(
-                    target: "cgka_engine::distributed_convergence",
-                    method = "emit_convergence_events",
-                    "cleared removed marker: selected canonical branch records local membership"
-                );
+            if self_was_removed {
+                let mut group = self
+                    .storage
+                    .get_group(group_id)
+                    .map_err(|e| OpenMlsProjectionError::Storage(format!("{e:?}")))?;
+                if group.removed {
+                    group.removed = false;
+                    self.storage
+                        .put_group(&group)
+                        .map_err(|e| OpenMlsProjectionError::Storage(format!("{e:?}")))?;
+                    tracing::info!(
+                        target: "cgka_engine::distributed_convergence",
+                        method = "emit_convergence_events",
+                        "cleared removed marker: selected canonical branch records local membership"
+                    );
+                }
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/cgka-engine/src/distributed_convergence.rs` around lines 634 - 668,
The `emit_convergence_events` path is doing an unnecessary `storage.get_group`
on every settled pass even when the group is already not removed. Capture the
earlier `current_group.removed` state before `current_group.members` is
consumed, then use that bool to skip the extra read/write unless the marker was
actually set. Keep the existing clear-and-log behavior only for the removed
case, and update the later `get_group`/`put_group` logic accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/cgka-engine/src/distributed_convergence.rs`:
- Around line 634-668: The `emit_convergence_events` path is doing an
unnecessary `storage.get_group` on every settled pass even when the group is
already not removed. Capture the earlier `current_group.removed` state before
`current_group.members` is consumed, then use that bool to skip the extra
read/write unless the marker was actually set. Keep the existing clear-and-log
behavior only for the removed case, and update the later `get_group`/`put_group`
logic accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31a018d5-10db-491a-8a39-630198feaac4

📥 Commits

Reviewing files that changed from the base of the PR and between ddf2602 and 0dfab10.

📒 Files selected for processing (5)
  • crates/cgka-engine/src/distributed_convergence.rs
  • crates/cgka-engine/src/engine.rs
  • crates/cgka-engine/tests/distributed_convergence.rs
  • crates/cli/CHANGELOG.md
  • crates/traits/src/group.rs

Review follow-up (CodeRabbit): emit_convergence_events already holds the
post-apply group record, so capture its removed flag before the members
vec is consumed and skip the extra get_group read on the common
(not-removed) path. The rare marked path still re-checks under a fresh
record before writing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@erskingardner erskingardner left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial review summary

Solid fix for a real spec gap: making the removed-marker clear explicit rather than relying on anchor-restore side effects is the right defensive move, and convergence_apply_clears_removed_marker_without_canonical_evidence is the test that actually gates the new code.

I don't see a correctness bug in the convergence guard itself — the mutual exclusion with the removal loop in the same pass looks sound, and the group_record_was_removed snapshot + re-fetch pattern is a reasonable optimization.

Main gaps are parity (direct seam still snapshot-only), coverage (the big E2E test passes without this diff), and restart (hydration never reconciles a stale marker). Inline notes below.

Overall: approve direction, with follow-ups worth tracking before/soon after merge.

Comment thread crates/cgka-engine/src/distributed_convergence.rs
Comment thread crates/cgka-engine/src/distributed_convergence.rs
Comment thread crates/cgka-engine/src/distributed_convergence.rs
Comment thread crates/cgka-engine/tests/distributed_convergence.rs
Comment thread crates/cgka-engine/tests/distributed_convergence.rs
@erskingardner

Copy link
Copy Markdown
Member Author

Two additional adversarial notes that don't map to changed lines in this diff:

1. Direct-seam parity gap (message_processor/ingest.rs)

The post-ForkRecovered block (~889–902) handles leave-request restoration when after_ids.contains(self) but never explicitly clears Group.removed. Same story as pre-fix convergence: snapshot rollback to the incumbent's pre-commit anchor normally restores removed = false, but that is incidental — the direct seam sets g.removed = true on self-removal (~810–815) without a symmetric clear on supersession.

If snapshot scope ever narrows (the scenario this PR explicitly guards against on the convergence side), the direct seam would reintroduce the quarantine bug for clients that applied a self-removal via ingest rather than stored convergence. Worth either mirroring this guard there or documenting the shared snapshot dependency in one place.

2. Session restart / hydration

hydrate_one_stored_group never reconciles Group.removed against MLS membership at open. This fix only runs on the next converge_stored_openmls_messages pass. The PR description already flags the post-restart branch-selection path as the practical trigger — until that lands, a device that persists a pathological removed = true and restarts stays send-gated until convergence runs again. Not necessarily in scope for #724, but worth a tracked follow-up (hydrate-time reconcile or a startup convergence sweep).

…ped record

Review follow-ups (PR #724):

- The reconciliation guard no longer trusts the Marmot record roster
  alone: self_leaf_is_active_in_mls requires the live OpenMLS group to
  load, be active, and carry our identity on its own leaf (fail-closed)
  before the marker clears, so the heal stays derivable from canonical
  MLS state even if the record roster were stale.
- Trimmed the inline spec rationale to a pointer; the full removed-marker
  lifecycle now lives in module-level rustdoc on distributed_convergence.
- New test convergence_apply_heals_fully_evicted_shaped_record: the full
  post-realize_self_eviction record shape (removed = true, self stripped
  from members) heals on a forward convergence apply — roster rebuilt
  from replayed MLS state, marker cleared, send gate reopened, roster-
  correction MemberAdded emitted. Fails without the reconciliation guard
  (verified against the pre-PR source).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@erskingardner erskingardner merged commit b255836 into master Jul 4, 2026
21 checks passed
@erskingardner erskingardner deleted the claude/intelligent-goldberg-1d95c1 branch July 4, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant