Skip to content

draft: model-driven raft hardening and issue-verbiage workspace#1

Closed
anyasabo wants to merge 17 commits into
mainfrom
issue/malformed-config-decode-panic-d41a38e
Closed

draft: model-driven raft hardening and issue-verbiage workspace#1
anyasabo wants to merge 17 commits into
mainfrom
issue/malformed-config-decode-panic-d41a38e

Conversation

@anyasabo
Copy link
Copy Markdown
Owner

@anyasabo anyasabo commented May 6, 2026

Summary

  • Collect model-driven bug-hunting commits and deterministic regressions into a single draft workspace PR on the fork for iterative review.
  • Include hardening for malformed configuration decode paths so RPC-driven append/snapshot handling returns errors rather than panicking on invalid config payloads.
  • Keep this PR as a discussion vehicle to refine risk framing, reproduction detail, and production detection guidance before opening upstream issue/PR threads.

Test plan

  • go test -run "TestRaft_(AppendEntriesMalformedConfigurationReturnsError|InstallSnapshotMalformedConfigurationReturnsError|AppendEntriesConfigProcessFailureRefreshesLastLog)" -count=1 .
  • go test -run "TestRaft_(AppendEntries|InstallSnapshot)" -count=1 .
  • go test -run "Test(Configuration_encodeDecodeConfiguration|Raft_(AppendEntriesConfigProcessFailureRefreshesLastLog|AppendEntriesMalformedConfigurationReturnsError|InstallSnapshotMalformedConfigurationReturnsError))" -count=1 .
  • TLC fixed/bug A/B runs for newly added formal models in this branch.

Made with Cursor

anyasabo and others added 17 commits May 6, 2026 16:49
Add focused TLA+ models and deterministic tests for pre-vote, known-leader gating, and leadership transfer so liveness regressions are caught before they ship.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a focused TLA+ model for explicit transfer-to-nonvoter behavior, including a bug profile that demonstrates false success when leader loss is treated as transfer completion.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a focused TLA+ model for transfer-to-target races with mid-flight demotion, including a bug profile that demonstrates false success when leader loss is treated as transfer completion.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a focused TLA+ model for transfer requests when the target is removed mid-flight, including a bug profile that demonstrates false success when leader loss is treated as transfer completion.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a snapshot catch-up TLA+ model with fixed and bug profiles so term regression,
config-mutation-during-snapshot, and leader-reset failures are caught early. Also
harden leadership transfer by rejecting non-voter targets and add deterministic
regression tests for stale-term installSnapshot and outstanding config-change restore guards.

Co-authored-by: Cursor <cursoragent@cursor.com>
Capture the next high-yield mixed-version election risks and define model toggles,
properties, and deterministic test mappings so the next modeling cycle is ready
to execute immediately.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a protocol-compat election TLA+ model with fixed and bug profiles to catch
candidate identity decoding errors, transfer-flag drops, and missing-addr
pre-vote behavior. Derive deterministic tests that pin Addr precedence in
requestVote and ensure missing-addr pre-vote requests remain rejected when a
leader is known.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a replication-target selection TLA+ model that captures non-voter eligibility,
stale-index preference, and sticky-target risks, then map it to deterministic
pickServer tests that enforce voter-only selection and rebalance after backoff.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a snapshot/append interleaving TLA+ model that catches term, leader, and
commit-index regressions from stale append RPCs after snapshot state is known.
Derive deterministic appendEntries tests to lock in no term/leader overwrite and
no commit-index rollback under post-snapshot stale/lower-commit requests.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a snapshot-stream failure TLA+ model to catch state advancement on short-read
failures and derive a deterministic installSnapshot regression test that ensures
lastSnapshot and lastApplied remain unchanged when a transfer is incomplete.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a restoreUserSnapshot inflight-abort TLA+ model with bug profiles for missing
abort and uncleared inflight queue behavior, then add a deterministic regression
test verifying inflight futures are aborted and removed even when restore fails.

Co-authored-by: Cursor <cursoragent@cursor.com>
Model the conflict-truncation + StoreLogs-failure path and add a regression test
that reproduces stale cached lastLog metadata after durable suffix deletion. Fix
appendEntries to recompute cached lastLog from storage when append fails.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a formal model for latest-configuration rollback when append conflict
truncation crosses configuration entries, and derive a deterministic regression
test that ensures latest configuration resets to committed after truncation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a formal model for post-snapshot uncommitted config replay followed by
conflict truncation, and add a deterministic snapshot-based appendEntries test
that verifies latest configuration rolls back to committed after truncation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Model prev-log validation at snapshot boundaries and add deterministic tests for
matching vs mismatched boundary terms. Fix appendEntries to use snapshot metadata
for compacted prev indices so post-snapshot replay succeeds correctly.

Co-authored-by: Cursor <cursoragent@cursor.com>
When appendEntries stores replacement logs but fails while processing a config entry, the durable log may advance while cached lastLog stays stale. Refresh cached lastLog from storage on this error path and add a TLA+ model plus deterministic regression test to guard the invariant.

Co-authored-by: Cursor <cursoragent@cursor.com>
RPC handlers should treat malformed configuration bytes as recoverable protocol errors, not process-terminating panics. Add non-panicking decode paths for append and snapshot handling, plus model and regression tests to lock in failure-response behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
@anyasabo
Copy link
Copy Markdown
Owner Author

anyasabo commented May 6, 2026

Closing this broad workspace PR in favor of focused per-bug draft PRs (#2-#6) for easier review and risk analysis.

@anyasabo anyasabo closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant