Skip to content

fix: handle malformed configuration payloads without panics#2

Draft
anyasabo wants to merge 4 commits into
mainfrom
fix/malformed-config-decode-only
Draft

fix: handle malformed configuration payloads without panics#2
anyasabo wants to merge 4 commits into
mainfrom
fix/malformed-config-decode-only

Conversation

@anyasabo
Copy link
Copy Markdown
Owner

@anyasabo anyasabo commented May 6, 2026

Problem

PR #465 changed the legacy v1 decodePeers path to return errors instead of panicking on malformed input. The v2+ DecodeConfiguration call sites still panicked on malformed bytes in key runtime paths.

For a core distributed-systems library, panicking on malformed persisted or remote payloads is an availability risk. These paths should reject bad payloads and keep the process alive.

What changed

  • Added internal decodeConfiguration([]byte) (Configuration, error) as an error-returning counterpart to DecodeConfiguration.
  • Updated appendEntries config-entry handling (processConfigurationLogEntry, LogConfiguration case) to return errors instead of panicking.
  • Updated installSnapshot (SnapshotVersion > 0) to return RPC errors on malformed configuration payloads instead of panicking.
  • Updated runFSM / applySingle (ConfigurationStore path) to log decode failures and skip malformed LogConfiguration entries instead of panicking.
  • Added/strengthened regression tests for malformed configuration payload handling and post-failure state invariants, including startup replay coverage.

Behavior and invariants

AppendEntries (LogConfiguration)

  • Malformed config payload now returns an RPC error (AppendEntriesResponse.Success=false) instead of panicking.
  • Entry may already be persisted in the log store (existing ordering), but:
    • lastLog is not advanced on decode failure,
    • configuration state is not mutated.

InstallSnapshot (SnapshotVersion > 0)

  • Malformed snapshot configuration now returns an RPC error (InstallSnapshotResponse.Success=false) instead of panicking.
  • Snapshot/configuration state remains unchanged on decode failure.

FSM apply (ConfigurationStore)

  • Malformed committed LogConfiguration entries are logged and skipped (no panic).
  • Subsequent valid configuration entries continue to apply.

Startup replay (NewRaft log scan)

  • Replaying a persisted malformed LogConfiguration entry now returns a decode error instead of panicking while scanning configuration entries during startup.

Ordering note

For LogConfiguration, decode now happens before configuration-state mutation in processConfigurationLogEntry. This prevents partial state promotion when decode fails.

The deprecated peer-log branch (LogAddPeerDeprecated / LogRemovePeerDeprecated) is unchanged in this PR.

Scope / follow-up

This PR intentionally focuses on panic-to-error behavior in decode paths.

In the append path where StoreLogs succeeds and later config processing fails, behavior is now:

  • before/after this PR: return an error with persisted entries, and lastLog cache may remain stale on that return path.

Keeping lastLog cache in sync in that failure path is tracked as a separate follow-up in PR #6.

Why this is safe

  • Success-path behavior for valid payloads is preserved.
  • Failure-path behavior changes from process crash to explicit error handling.
  • Existing exported DecodeConfiguration panic semantics are preserved for callers that rely on panic-on-error behavior.

Test plan

  • go test -run "TestRaft_(NewRaftMalformedConfigurationEntryInLogReturnsError|RunFSMMalformedConfigurationSkipsEntryAndContinues|AppendEntriesMalformedConfigurationReturnsError|InstallSnapshotMalformedConfigurationReturnsError)" -count=1 .
  • go test -run "TestRaft_.*Configuration.*" -count=1 .

Configuration decode failures in append and install-snapshot RPC paths should be handled as recoverable errors, not panics. Add non-panicking decode plumbing and targeted regression tests for malformed configuration payloads.

Co-authored-by: Cursor <cursoragent@cursor.com>
anyasabo and others added 3 commits May 7, 2026 14:59
…then tests

- Fix remaining panicking DecodeConfiguration call in fsm.go applySingle
  (ConfigurationStore path) by using error-returning decodeConfiguration.
- Add doc comment to unexported decodeConfiguration helper.
- Strengthen AppendEntries malformed-config test: verify log-store contains
  the persisted entry, lastLog was not advanced, and configuration state
  was not mutated.
- Strengthen InstallSnapshot malformed-config test: verify configuration
  state was not mutated by failed decode.

Co-authored-by: Cursor <cursoragent@cursor.com>
Strengthen append/snapshot malformed configuration tests to assert configuration state immutability, and add runFSM coverage proving malformed configuration entries are skipped without stopping subsequent valid applies.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a regression test proving NewRaft returns a decode error (instead of panicking) when replaying a persisted malformed LogConfiguration entry during startup.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant