feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics#934
feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics#934bootjp wants to merge 6 commits into
Conversation
Two-phase interface for the cold-start skip path to preserve the
snapshot header state (HLC ceiling, Stage 8a cutover) the FSM's
Restore would normally apply, without running the multi-GiB body
restore.
- ParseSnapshotHeader(r) — reads the v1/v2 header from a caller-
supplied reader (wrapped in a crc32 TeeReader by the engine)
and drains the rest so the wrapping CRC covers the full
payload. Returns the parsed (ceiling, cutover) WITHOUT mutating
FSM state. Errors propagate; no side-effect on error.
- ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
verified header state. Engine calls this only after Parse
returned and the crc32 hash matched the file footer.
Splitting parse from apply lets the CRC verifier stay co-located
with its private helpers in internal/raftengine/etcd (matching
openAndRestoreFSMSnapshot's safety contract) while the v1/v2 header
parser stays inside the kv package where kv.ReadSnapshotHeader
already lives. Neither package imports the other in production.
Interface only; implementation + skip gate land in subsequent
commits. Refs PR #910 design §5 round-7 + post-B2 implementation
series.
Implements the two-phase opt-in interface added in step 1.
- ParseSnapshotHeader(r) — wraps r in a bufio.Reader, calls
kv.ReadSnapshotHeader (handles v1/v2/headerless/unknown magic),
drains remaining body bytes via io.Copy(io.Discard, br) so the
engine's wrapping crc32 TeeReader covers every payload byte.
Returns the parsed (ceiling, cutover) tuple WITHOUT mutating
f.hlc or f.restoredCutover — that side-effect waits for phase 2
after the engine confirms the CRC.
- ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
verified header state. Mirrors the two side-effects from Restore
that are scoped to the header portion: SetPhysicalCeiling +
restoredCutover.
The drain pass is essential: without it, the engine's TeeReader
would only hash the first ~26 bytes of the body (whatever
ReadSnapshotHeader consumed) and the wrapping CRC check would not
correspond to the file footer. With the drain, the CRC matches
restoreAndComputeCRC's full-payload boundary in
openAndRestoreFSMSnapshot.
Added compile-time guard in fsm_applied_index_iface_check.go:
var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil)
so a future rename or signature drift fails 'go build' immediately
rather than silently degrading the skip optimisation.
Tests: go vet ./kv/ + golangci-lint clean. Skip-gate consumer lands
in subsequent commits. Refs PR #910 design §5 round-7 + B3 step 1.
…s 3+4) Wires the user-visible cold-start optimisation from the PR #910 design. After this commit the engine restores the FSM only when strictly necessary; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300 band-aid was sized to absorb. restoreSnapshotState now branches on fsmAlreadyAtIndex: if fsmAlreadyAtIndex(fsm, tok.Index) { return applyHeaderStateOnSkip(fsm, snapPath, tok.CRC32C) } return openAndRestoreFSMSnapshot(...) // unchanged fsmAlreadyAtIndex returns true ONLY when the FSM exposes raftengine.AppliedIndexReader AND LastAppliedIndex reports (have, true, nil) AND have >= want. Any uncertainty -- FSM doesn't opt in, store reports missing meta key, LastAppliedIndex returns an error -- collapses to false so the caller falls back to the full restore path. Strictly-additive guarantee (PR #910 §4): a wasteful restore is strictly safer than a stale-but-incorrect skip. applyHeaderStateOnSkip mirrors openAndRestoreFSMSnapshot's three- step safety contract exactly: 1. fsmMinFileSize -> ErrFSMSnapshotTooSmall 2. readFSMFooter vs tokenCRC -> ErrFSMSnapshotTokenCRC 3. crc32 TeeReader over the full payload -> ErrFSMSnapshotFileCRC Only after all three pass does the FSM's ApplySnapshotHeader run (pure assignment of ceiling + cutover). FSMs that do not implement raftengine.SnapshotHeaderApplier (legacy test fakes) still get the CRC verification but no header side-effects -- they have no header state to carry forward. Helpers extracted to keep cyclomatic complexity under 10: - verifyFSMSnapshotPrefix: steps 1 + 2 (size + footer-vs-token). - readSnapshotHeaderOrDrain: step 3's seam-vs-no-seam branch (delegates to setter.ParseSnapshotHeader OR drains io.Discard through the tee). Imports gained: 'hash/crc32' and the parent raftengine package (for the AppliedIndexReader + SnapshotHeaderApplier interfaces). Tests: go vet + golangci-lint clean across etcd/, kv/, store/; existing test suite still green (internal/raftengine/etcd 21.5s, kv 9.5s, store 23.9s). Refs PR #910 design §4 + §5 round-7; closes the user-visible cold-start side of the implementation series.
…(B3 step 5)
Wires Prometheus metrics + structured INFO logs around the
restoreSnapshotState skip decision so operators can verify the
optimisation engages in production (design target: >= 90% skip rate
in steady state after B2's soak phase).
internal/raftengine/cold_start.go — new ColdStartObserver
interface with three methods (RestoreSkipped / RestoreExecuted /
RestoreFallback). Lives in the parent raftengine package so
monitoring can implement it without circular import. Contract
documents the strictly-additive fallback reason enum (not_reader /
missing_meta / read_err) and the no-blocking requirement (cold-
start critical path).
internal/raftengine/etcd/engine.go — OpenConfig gains an optional
ColdStartObserver field. nil disables metrics; the skip itself
still runs.
internal/raftengine/etcd/wal_store.go — restoreSnapshotState
signature gains (obs, logger) parameters. The body switches from
the bool fsmAlreadyAtIndex to a coldStartDecision enum
(skip / execute / fallback-{not_reader,missing_meta,read_err})
returned by decideSkipOutcome. reportColdStart dispatches to the
observer + emits a zap.Info log on every cold-start.
- 'restoreSnapshotState skipped' with fsm_applied / snapshot_index
/ gap_ahead. The user-visible perf win logged once per restart.
- 'restoreSnapshotState executed (FSM behind snapshot)' with
gap_behind so operators can see how stale the FSM was.
- 'restoreSnapshotState fallback to full restore' with reason so
operators can chase 'why is the optimisation not engaging' in
production.
monitoring/cold_start.go — newColdStartMetrics + ColdStartObserver
struct. Two series:
elastickv_fsm_cold_start_restore_total{outcome,fallback_reason}
elastickv_fsm_cold_start_applied_index_gap{outcome}
(See doc-comments for the precise label semantics.)
monitoring/registry.go — Registry gains ColdStartObserver()
accessor. main.go (next commit, not in this PR) wires it to the
engine via OpenConfig.
Existing test call sites of restoreSnapshotState
(wal_store_test.go * 8 + grpc_transport_test.go * 1) updated to
pass (nil, nil) — tests opt out of metrics. The legacy bootstrap-
path loadWalState recursion in wal_store.go:522 also passes nil
because OpenConfig wiring has not yet reached that call.
Tests + lint: go vet + golangci-lint clean across all touched
packages; go test ./internal/raftengine/etcd/ ./kv/ ./store/
./monitoring/ -short ok 55s total.
Refs PR #910 design §9. B3 step 5 of the implementation series.
…tep 6)
Nine new tests covering the Branch 3 skip gate end-to-end:
skip-gate behaviour (5):
- TestSkipGate_SkipsWhenFSMFreshEnough — applied >= snap.Index
triggers the skip path; FSM body stays empty; observer records
one RestoreSkipped with gap_ahead; ApplySnapshotHeader fires
with the parsed ceiling.
- TestSkipGate_ExecutesWhenFSMStale — applied < snap.Index runs
the full restore; ApplySnapshotHeader does NOT fire; observer
records one RestoreExecuted with gap_behind.
- TestSkipGate_FallbackMissingMeta — appliedPresent=false falls
back to full restore with reason='missing_meta'.
- TestSkipGate_FallbackReadErr — LastAppliedIndex error collapses
to fallback (NOT propagated) with reason='read_err'.
- TestSkipGate_FallbackNotReader — dummyFSM (no
AppliedIndexReader) falls back with reason='not_reader'.
CRC failure modes (3):
- TestApplyHeaderStateOnSkip_TruncatedFile — empty file triggers
step 1 ErrFSMSnapshotTooSmall; FSM state unchanged.
- TestApplyHeaderStateOnSkip_WrongTokenCRC — bad tokenCRC arg
triggers step 2 ErrFSMSnapshotTokenCRC; FSM state unchanged.
- TestApplyHeaderStateOnSkip_BodyCorruption — XOR one byte of
the body in-place; step 1 + 2 pass but step 3
(full-body CRC) catches it via ErrFSMSnapshotFileCRC; FSM
state unchanged.
kvFSM contract (1):
- TestSkipGate_KVFSMHeaderRoundTrip — production *kv.kvFSM
satisfies SnapshotHeaderApplier; v1 header (EKVTHLC1 +
ceilingMs BE) is parsed and applied; hlc.PhysicalCeiling()
reflects the ceiling; RestoredCutover() is 0 for v1.
Test doubles:
- skipGateFSM implements StateMachine + AppliedIndexReader +
SnapshotHeaderApplier; configurable via field assignment.
- recordingObs implements raftengine.ColdStartObserver; records
every callback for assertion.
Test results:
- go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip' -count=1
-> ok 0.028s
- Full short test sweep across etcd/, kv/, store/, monitoring/
-> ok ~57s total.
- go vet + golangci-lint clean across all touched packages.
Closes B3 implementation series. Refs PR #910 design §4 / §5 / §9
test list.
|
Warning Review limit reached
More reviews will be available in 49 minutes and 16 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (ce277e6):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
@claude review Branch 3 of the PR #910 design — the user-visible cold-start perf win. Follows the merged Branch 2 (PR #915 — metaAppliedIndex plumbing). 6 commits, focused review pathways: Review observation surface:
|
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 4m 16s —— View job PR #934 Review — Branch 3: Cold-Start Snapshot-Restore Skip GateTodo list:
Overall assessmentThe implementation is correct and the core design invariants are faithfully preserved. The three design constraints given in the trigger — strictly-additive fallback (§4), two-phase seam (§5/round-7), and CRC mirroring (§5/round-6) — are all correctly implemented. Four findings below, in descending severity. Finding 1 —
|
There was a problem hiding this comment.
Code Review
This pull request introduces a cold-start snapshot-restore skip-gate optimization. It allows skipping the restoration of a multi-GiB snapshot body if the on-disk state machine (FSM) is already as fresh as or fresher than the snapshot. The changes include the definition of the ColdStartObserver and SnapshotHeaderApplier interfaces, integration into the WAL loading and FSM restore paths, implementation of Prometheus metrics for monitoring skip-gate outcomes, and comprehensive unit tests. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce277e6667
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // nil disables metrics; the skip itself still runs. See | ||
| // docs/design/2026_06_02_idempotent_snapshot_restore.md §9 and | ||
| // internal/raftengine/cold_start.go for the contract. | ||
| ColdStartObserver raftengine.ColdStartObserver |
There was a problem hiding this comment.
Plumb the cold-start observer through the factory
In the production startup path this observer is never populated: main.go creates monitoring.NewRegistry(...), but engines are opened through internal/raftengine/etcd.Factory.Create, and that method builds OpenConfig without a ColdStartObserver field because raftengine.FactoryConfig has no way to carry it. As a result loadWalState always receives obs == nil in normal server runs, so the new cold-start counters/gauges remain empty even when skips/executions/fallbacks happen. Please thread Registry.ColdStartObserver() through the factory config/open call.
Useful? React with 👍 / 👎.
PR #934 round-1 codex P2: in production, monitoring.NewRegistry's ColdStartObserver was never reaching the engine because internal/raftengine/etcd.Factory.Create built OpenConfig solely from the engine-agnostic raftengine.FactoryConfig (which has no place to carry the observer). The Branch 3 cold-start counters/gauges therefore stayed at zero in real runs even when skips/executions/fallbacks happened. Thread it via the engine-specific etcd.FactoryConfig (already holds TickInterval/HeartbeatTick/etc.) so the change is local to the etcd package and main.go — buildShardGroups, buildRuntimeForGroup, and the shared raftengine.FactoryConfig stay untouched. Wiring: - etcd.FactoryConfig gains ColdStartObserver raftengine.ColdStartObserver - Factory.Create copies it into OpenConfig{} - newRaftFactory in main.go now takes the observer arg and passes metricsRegistry.ColdStartObserver() through - moved the factory construction below metricsRegistry creation so the observer is available at factory build time Tests: - 3 existing newRaftFactory callers updated to pass nil (no metrics needed in those test scopes) - TestSkipGate_*/TestApplyHeaderStateOnSkip_*/TestSkipGate_KVFSMHeaderRoundTrip (9 tests) still pass — they exercise restoreSnapshotState directly with a recording observer test double, unaffected by the factory plumbing Verified: go vet ./..., golangci-lint run ./internal/raftengine/etcd/... ./monitoring/... ./internal/raftengine/... ./., and go test -count=1 ./internal/raftengine/etcd/ all clean.
|
@claude review Round-1 codex P2 addressed (commit 320d06e):
Kept the change inside the engine-specific Verified: |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (320d06e):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Claude finished @bootjp's task in 2m 33s —— View job PR #934 Round-2 Review — ColdStartObserver factory wiring (commit 320d06e)Todo list:
Codex P2 — Closed correctly ✓The full wiring chain is intact:
Test callers: Bootstrap recursive path ( Nil-interface safety: One minor note (no action required)
Round-1 findings status
OverallP2 is closed correctly. The change is minimal and surgical — only |
|
To use Codex here, create a Codex account and connect to github. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 320d06efb8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| decision, have := decideSkipOutcome(fsm, tok.Index) | ||
| reportColdStart(obs, logger, decision, tok.Index, have) | ||
| if decision == coldStartSkip { | ||
| return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C) |
There was a problem hiding this comment.
Prevent skipping when WAL will replay newer entries
When the FSM has applied entries after the latest persisted snapshot (the normal interval between snapshots, since metaAppliedIndex advances on each Apply), LastAppliedIndex() > tok.Index takes this skip path even though Open still initializes the raft applied floor from LocalSnap.Metadata.Index and openAndReadWAL reloads the post-snapshot entries. Those entries are then replayed into a Pebble store that already contains them; transaction replays can hit OCC conflicts before kvFSM observes their commit timestamps, leaving the restarted node’s HLC below timestamps already present on disk. The skip needs to advance/suppress replay through have, or only skip when there are no post-snapshot WAL entries to replay.
Useful? React with 👍 / 👎.
Summary
Implements Branch 3 of the cold-start snapshot-restore skip optimisation designed in PR #910, building on PR #915 (B2 —
metaAppliedIndexplumbing). This is the user-visible perf win of the series: after this lands, the engine restores the FSM only when the on-disk state is actually behind the snapshot pointer; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909'sHEALTH_TIMEOUT_SECONDS=300band-aid was sized to absorb.Reading order (6 commits, designed for one-at-a-time review)
8312f4baraftengine.SnapshotHeaderAppliertwo-phase interface (ParseSnapshotHeader+ApplySnapshotHeader)024eb43ckvFSMimplements both phases + compile-time guard933dab86applyHeaderStateOnSkip(3-step CRC, mirrorsopenAndRestoreFSMSnapshot) +restoreSnapshotStateskip gatefsmAlreadyAtIndex+ skip-gate wiringraftengine.ColdStartObserver+monitoring.ColdStartMetrics+Registry.ColdStartObserver()ce277e66(Commit 5 was merged into the metrics commit; the actual git log is 5 commits.)
What this does
restoreSnapshotStatebranches ondecideSkipOutcome:decideSkipOutcomereturns one of 5 outcomes:skipLastAppliedIndex >= snap.IndexexecuteLastAppliedIndex < snap.Indexfallback_not_readerAppliedIndexReaderfallback_missing_metafallback_read_errLastAppliedIndexreturned an errorThe three fallback paths preserve the strictly-additive guarantee from the design doc §4: over-restoring is always strictly safer than skipping incorrectly. Errors from
LastAppliedIndexdo not abort cold start — they collapse to fallback.CRC verification (mirrors
openAndRestoreFSMSnapshot)applyHeaderStateOnSkipruns the same three-step safety contract as the full restore path. Failure of any step propagates a typed error WITHOUT mutating FSM state:info.Size() < fsmMinFileSizeErrFSMSnapshotTooSmallreadFSMFootervstokenCRCErrFSMSnapshotTokenCRCcrc32.TeeReadervs footerErrFSMSnapshotFileCRCThe FSM's
ApplySnapshotHeader(HLC ceiling + Stage 8a cutover) only fires after all three pass.Metrics + INFO log
raftengine.ColdStartObserverinterface lives in the parentraftenginepackage so monitoring can implement it without circular import.monitoring.RegistrygainsColdStartObserver(). The engine receives it throughOpenConfig.ColdStartObserver— nil disables metrics, the skip itself still runs.Two Prometheus series:
Plus a structured zap.Info log on every cold-start with fsm_applied / snapshot_index / gap and (for fallback) the reason enum.
Design constraints honoured
decideSkipOutcomecollapses every uncertainty to a fallback variant;LastAppliedIndexerrors do NOT abort cold start.ParseSnapshotHeaderreads the v1/v2 header from a caller-supplied reader and drains the rest;ApplySnapshotHeaderis pure assignment. Splits the parse from the side-effect so the engine can verify the CRC between them.openAndRestoreFSMSnapshot. The full-body CRC pass IS the slow part of this PR (~6s for a 6 GiB FSM at 1 GiB/s SSD read) — but still strictly cheaper than the restore it replaces, which also reads the file once AND writes a temp Pebble database viarestoreBatchLoopInto.not_reporterlabel is namednot_readerin the actual impl (matching the round-5 PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 rename).Engine.applySnapshot(engine.go:1641 InstallSnapshot hot path) is untouched. The TODO sentinel added in PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names B3 as the follow-up site for the peer-after-InstallSnapshot bump — that's not in scope here either.Test results
What this does NOT do
HEALTH_TIMEOUT_SECONDS=300. That's Branch 4, gated on production data showing steady-state skip rate ≥ 90%.Engine.applySnapshotto populatemetaAppliedIndexfor peers receiving InstallSnapshot. The TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still flags this; addressing it requires a separate design pass.ELASTICKV_RAFT_SNAPSHOT_COUNT=10end-to-end). The unit tests cover the gate decision matrix; an integration test would prove the codex round-3 P2 scenario is closed end-to-end in production-like setup, which is most valuable as a follow-up after B3 has soaked.Soak plan
Same as B2: deploy and observe
elastickv_fsm_cold_start_restore_total{outcome="skipped"}ramp up across cold starts. The design target is steady-state skip rate ≥ 90%; the per-outcome label distribution and gap distribution are the primary signals.After one release of soak, Branch 4 can lower
HEALTH_TIMEOUT_SECONDS.Refs
metaAppliedIndexbeing populatedHEALTH_TIMEOUT_SECONDSband-aid this series eventually obviates