Skip to content

feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics#934

Open
bootjp wants to merge 6 commits into
mainfrom
impl/snapshot-skip-b3-skip-gate
Open

feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics#934
bootjp wants to merge 6 commits into
mainfrom
impl/snapshot-skip-b3-skip-gate

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 5, 2026

Summary

Implements Branch 3 of the cold-start snapshot-restore skip optimisation designed in PR #910, building on PR #915 (B2 — metaAppliedIndex plumbing). This is the user-visible perf win of the series: after this lands, the engine restores the FSM only when the on-disk state is actually behind the snapshot pointer; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300 band-aid was sized to absorb.

Reading order (6 commits, designed for one-at-a-time review)

# commit scope
1 8312f4ba raftengine.SnapshotHeaderApplier two-phase interface (ParseSnapshotHeader + ApplySnapshotHeader)
2 024eb43c kvFSM implements both phases + compile-time guard
3 933dab86 applyHeaderStateOnSkip (3-step CRC, mirrors openAndRestoreFSMSnapshot) + restoreSnapshotState skip gate
4 (folded into #3) fsmAlreadyAtIndex + skip-gate wiring
5 (next) metrics + INFO log: raftengine.ColdStartObserver + monitoring.ColdStartMetrics + Registry.ColdStartObserver()
6 ce277e66 9 new tests (skip-gate behaviour ×5 + CRC failure modes ×3 + kvFSM contract ×1)

(Commit 5 was merged into the metrics commit; the actual git log is 5 commits.)

What this does

restoreSnapshotState branches on decideSkipOutcome:

decision, have := decideSkipOutcome(fsm, tok.Index)
reportColdStart(obs, logger, decision, tok.Index, have)
if decision == coldStartSkip {
    return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)
}
return openAndRestoreFSMSnapshot(...) // unchanged

decideSkipOutcome returns one of 5 outcomes:

outcome trigger action
skip LastAppliedIndex >= snap.Index header-only path
execute LastAppliedIndex < snap.Index full restore
fallback_not_reader FSM doesn't implement AppliedIndexReader full restore
fallback_missing_meta meta key absent (pre-upgrade fsm.db) full restore
fallback_read_err LastAppliedIndex returned an error full restore

The three fallback paths preserve the strictly-additive guarantee from the design doc §4: over-restoring is always strictly safer than skipping incorrectly. Errors from LastAppliedIndex do not abort cold start — they collapse to fallback.

CRC verification (mirrors openAndRestoreFSMSnapshot)

applyHeaderStateOnSkip runs the same three-step safety contract as the full restore path. Failure of any step propagates a typed error WITHOUT mutating FSM state:

Step Check Error
1 info.Size() < fsmMinFileSize ErrFSMSnapshotTooSmall
2 readFSMFooter vs tokenCRC ErrFSMSnapshotTokenCRC
3 full-body crc32.TeeReader vs footer ErrFSMSnapshotFileCRC

The FSM's ApplySnapshotHeader (HLC ceiling + Stage 8a cutover) only fires after all three pass.

Metrics + INFO log

raftengine.ColdStartObserver interface lives in the parent raftengine package so monitoring can implement it without circular import. monitoring.Registry gains ColdStartObserver(). The engine receives it through OpenConfig.ColdStartObserver — nil disables metrics, the skip itself still runs.

Two Prometheus series:

elastickv_fsm_cold_start_restore_total{outcome,fallback_reason}
elastickv_fsm_cold_start_applied_index_gap{outcome}

Plus a structured zap.Info log on every cold-start with fsm_applied / snapshot_index / gap and (for fallback) the reason enum.

Design constraints honoured

  • §4 strictly-additive fallback: decideSkipOutcome collapses every uncertainty to a fallback variant; LastAppliedIndex errors do NOT abort cold start.
  • §5 two-phase seam (round-7): ParseSnapshotHeader reads the v1/v2 header from a caller-supplied reader and drains the rest; ApplySnapshotHeader is pure assignment. Splits the parse from the side-effect so the engine can verify the CRC between them.
  • §5 CRC mirroring (round-6): 3-step verification (size + footer-vs-tokenCRC + full-body-CRC) exactly matches openAndRestoreFSMSnapshot. The full-body CRC pass IS the slow part of this PR (~6s for a 6 GiB FSM at 1 GiB/s SSD read) — but still strictly cheaper than the restore it replaces, which also reads the file once AND writes a temp Pebble database via restoreBatchLoopInto.
  • §9 observability: three outcomes + gap label; the design's not_reporter label is named not_reader in the actual impl (matching the round-5 PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 rename).
  • Non-Goals respected: Engine.applySnapshot (engine.go:1641 InstallSnapshot hot path) is untouched. The TODO sentinel added in PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names B3 as the follow-up site for the peer-after-InstallSnapshot bump — that's not in scope here either.

Test results

go vet ./internal/raftengine/etcd/ ./internal/raftengine/ ./kv/ ./store/ ./monitoring/   → exit 0
golangci-lint run ... (same packages)                                                     → 0 issues
go test ./internal/raftengine/etcd/ ./kv/ ./store/ ./monitoring/ -short                   → ok ~57s total
go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip'        → ok 0.028s

What this does NOT do

  • Does NOT change HEALTH_TIMEOUT_SECONDS=300. That's Branch 4, gated on production data showing steady-state skip rate ≥ 90%.
  • Does NOT wire Engine.applySnapshot to populate metaAppliedIndex for peers receiving InstallSnapshot. The TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still flags this; addressing it requires a separate design pass.
  • Does NOT add an idle-cluster integration test (ELASTICKV_RAFT_SNAPSHOT_COUNT=10 end-to-end). The unit tests cover the gate decision matrix; an integration test would prove the codex round-3 P2 scenario is closed end-to-end in production-like setup, which is most valuable as a follow-up after B3 has soaked.

Soak plan

Same as B2: deploy and observe elastickv_fsm_cold_start_restore_total{outcome="skipped"} ramp up across cold starts. The design target is steady-state skip rate ≥ 90%; the per-outcome label distribution and gap distribution are the primary signals.

After one release of soak, Branch 4 can lower HEALTH_TIMEOUT_SECONDS.

Refs

bootjp added 5 commits June 5, 2026 15:42
Two-phase interface for the cold-start skip path to preserve the
snapshot header state (HLC ceiling, Stage 8a cutover) the FSM's
Restore would normally apply, without running the multi-GiB body
restore.

  - ParseSnapshotHeader(r) — reads the v1/v2 header from a caller-
    supplied reader (wrapped in a crc32 TeeReader by the engine)
    and drains the rest so the wrapping CRC covers the full
    payload. Returns the parsed (ceiling, cutover) WITHOUT mutating
    FSM state. Errors propagate; no side-effect on error.
  - ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
    verified header state. Engine calls this only after Parse
    returned and the crc32 hash matched the file footer.

Splitting parse from apply lets the CRC verifier stay co-located
with its private helpers in internal/raftengine/etcd (matching
openAndRestoreFSMSnapshot's safety contract) while the v1/v2 header
parser stays inside the kv package where kv.ReadSnapshotHeader
already lives. Neither package imports the other in production.

Interface only; implementation + skip gate land in subsequent
commits. Refs PR #910 design §5 round-7 + post-B2 implementation
series.
Implements the two-phase opt-in interface added in step 1.

  - ParseSnapshotHeader(r) — wraps r in a bufio.Reader, calls
    kv.ReadSnapshotHeader (handles v1/v2/headerless/unknown magic),
    drains remaining body bytes via io.Copy(io.Discard, br) so the
    engine's wrapping crc32 TeeReader covers every payload byte.
    Returns the parsed (ceiling, cutover) tuple WITHOUT mutating
    f.hlc or f.restoredCutover — that side-effect waits for phase 2
    after the engine confirms the CRC.

  - ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
    verified header state. Mirrors the two side-effects from Restore
    that are scoped to the header portion: SetPhysicalCeiling +
    restoredCutover.

The drain pass is essential: without it, the engine's TeeReader
would only hash the first ~26 bytes of the body (whatever
ReadSnapshotHeader consumed) and the wrapping CRC check would not
correspond to the file footer. With the drain, the CRC matches
restoreAndComputeCRC's full-payload boundary in
openAndRestoreFSMSnapshot.

Added compile-time guard in fsm_applied_index_iface_check.go:

  var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil)

so a future rename or signature drift fails 'go build' immediately
rather than silently degrading the skip optimisation.

Tests: go vet ./kv/ + golangci-lint clean. Skip-gate consumer lands
in subsequent commits. Refs PR #910 design §5 round-7 + B3 step 1.
…s 3+4)

Wires the user-visible cold-start optimisation from the PR #910
design. After this commit the engine restores the FSM only when
strictly necessary; the common case (local restart with a fresh
fsm.db) collapses to a header read + CRC verification, skipping
the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300
band-aid was sized to absorb.

restoreSnapshotState now branches on fsmAlreadyAtIndex:

  if fsmAlreadyAtIndex(fsm, tok.Index) {
      return applyHeaderStateOnSkip(fsm, snapPath, tok.CRC32C)
  }
  return openAndRestoreFSMSnapshot(...) // unchanged

fsmAlreadyAtIndex returns true ONLY when the FSM exposes
raftengine.AppliedIndexReader AND LastAppliedIndex reports
(have, true, nil) AND have >= want. Any uncertainty -- FSM doesn't
opt in, store reports missing meta key, LastAppliedIndex returns an
error -- collapses to false so the caller falls back to the full
restore path. Strictly-additive guarantee (PR #910 §4): a wasteful
restore is strictly safer than a stale-but-incorrect skip.

applyHeaderStateOnSkip mirrors openAndRestoreFSMSnapshot's three-
step safety contract exactly:

  1. fsmMinFileSize  -> ErrFSMSnapshotTooSmall
  2. readFSMFooter vs tokenCRC -> ErrFSMSnapshotTokenCRC
  3. crc32 TeeReader over the full payload -> ErrFSMSnapshotFileCRC

Only after all three pass does the FSM's ApplySnapshotHeader run
(pure assignment of ceiling + cutover). FSMs that do not implement
raftengine.SnapshotHeaderApplier (legacy test fakes) still get the
CRC verification but no header side-effects -- they have no header
state to carry forward.

Helpers extracted to keep cyclomatic complexity under 10:

  - verifyFSMSnapshotPrefix: steps 1 + 2 (size + footer-vs-token).
  - readSnapshotHeaderOrDrain: step 3's seam-vs-no-seam branch
    (delegates to setter.ParseSnapshotHeader OR drains
    io.Discard through the tee).

Imports gained: 'hash/crc32' and the parent raftengine package
(for the AppliedIndexReader + SnapshotHeaderApplier interfaces).

Tests: go vet + golangci-lint clean across etcd/, kv/, store/;
existing test suite still green
(internal/raftengine/etcd 21.5s, kv 9.5s, store 23.9s).

Refs PR #910 design §4 + §5 round-7; closes the user-visible
cold-start side of the implementation series.
…(B3 step 5)

Wires Prometheus metrics + structured INFO logs around the
restoreSnapshotState skip decision so operators can verify the
optimisation engages in production (design target: >= 90% skip rate
in steady state after B2's soak phase).

internal/raftengine/cold_start.go — new ColdStartObserver
interface with three methods (RestoreSkipped / RestoreExecuted /
RestoreFallback). Lives in the parent raftengine package so
monitoring can implement it without circular import. Contract
documents the strictly-additive fallback reason enum (not_reader /
missing_meta / read_err) and the no-blocking requirement (cold-
start critical path).

internal/raftengine/etcd/engine.go — OpenConfig gains an optional
ColdStartObserver field. nil disables metrics; the skip itself
still runs.

internal/raftengine/etcd/wal_store.go — restoreSnapshotState
signature gains (obs, logger) parameters. The body switches from
the bool fsmAlreadyAtIndex to a coldStartDecision enum
(skip / execute / fallback-{not_reader,missing_meta,read_err})
returned by decideSkipOutcome. reportColdStart dispatches to the
observer + emits a zap.Info log on every cold-start.

  - 'restoreSnapshotState skipped' with fsm_applied / snapshot_index
    / gap_ahead. The user-visible perf win logged once per restart.
  - 'restoreSnapshotState executed (FSM behind snapshot)' with
    gap_behind so operators can see how stale the FSM was.
  - 'restoreSnapshotState fallback to full restore' with reason so
    operators can chase 'why is the optimisation not engaging' in
    production.

monitoring/cold_start.go — newColdStartMetrics + ColdStartObserver
struct. Two series:

  elastickv_fsm_cold_start_restore_total{outcome,fallback_reason}
  elastickv_fsm_cold_start_applied_index_gap{outcome}

(See doc-comments for the precise label semantics.)

monitoring/registry.go — Registry gains ColdStartObserver()
accessor. main.go (next commit, not in this PR) wires it to the
engine via OpenConfig.

Existing test call sites of restoreSnapshotState
(wal_store_test.go * 8 + grpc_transport_test.go * 1) updated to
pass (nil, nil) — tests opt out of metrics. The legacy bootstrap-
path loadWalState recursion in wal_store.go:522 also passes nil
because OpenConfig wiring has not yet reached that call.

Tests + lint: go vet + golangci-lint clean across all touched
packages; go test ./internal/raftengine/etcd/ ./kv/ ./store/
./monitoring/ -short ok 55s total.

Refs PR #910 design §9. B3 step 5 of the implementation series.
…tep 6)

Nine new tests covering the Branch 3 skip gate end-to-end:

skip-gate behaviour (5):
  - TestSkipGate_SkipsWhenFSMFreshEnough — applied >= snap.Index
    triggers the skip path; FSM body stays empty; observer records
    one RestoreSkipped with gap_ahead; ApplySnapshotHeader fires
    with the parsed ceiling.
  - TestSkipGate_ExecutesWhenFSMStale — applied < snap.Index runs
    the full restore; ApplySnapshotHeader does NOT fire; observer
    records one RestoreExecuted with gap_behind.
  - TestSkipGate_FallbackMissingMeta — appliedPresent=false falls
    back to full restore with reason='missing_meta'.
  - TestSkipGate_FallbackReadErr — LastAppliedIndex error collapses
    to fallback (NOT propagated) with reason='read_err'.
  - TestSkipGate_FallbackNotReader — dummyFSM (no
    AppliedIndexReader) falls back with reason='not_reader'.

CRC failure modes (3):
  - TestApplyHeaderStateOnSkip_TruncatedFile — empty file triggers
    step 1 ErrFSMSnapshotTooSmall; FSM state unchanged.
  - TestApplyHeaderStateOnSkip_WrongTokenCRC — bad tokenCRC arg
    triggers step 2 ErrFSMSnapshotTokenCRC; FSM state unchanged.
  - TestApplyHeaderStateOnSkip_BodyCorruption — XOR one byte of
    the body in-place; step 1 + 2 pass but step 3
    (full-body CRC) catches it via ErrFSMSnapshotFileCRC; FSM
    state unchanged.

kvFSM contract (1):
  - TestSkipGate_KVFSMHeaderRoundTrip — production *kv.kvFSM
    satisfies SnapshotHeaderApplier; v1 header (EKVTHLC1 +
    ceilingMs BE) is parsed and applied; hlc.PhysicalCeiling()
    reflects the ceiling; RestoredCutover() is 0 for v1.

Test doubles:
  - skipGateFSM implements StateMachine + AppliedIndexReader +
    SnapshotHeaderApplier; configurable via field assignment.
  - recordingObs implements raftengine.ColdStartObserver; records
    every callback for assertion.

Test results:
  - go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip' -count=1
    -> ok 0.028s
  - Full short test sweep across etcd/, kv/, store/, monitoring/
    -> ok ~57s total.
  - go vet + golangci-lint clean across all touched packages.

Closes B3 implementation series. Refs PR #910 design §4 / §5 / §9
test list.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 49 minutes and 16 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8010ef1-7281-4083-9b71-fd37d71bc4f0

📥 Commits

Reviewing files that changed from the base of the PR and between d23cf69 and 320d06e.

📒 Files selected for processing (17)
  • internal/raftengine/cold_start.go
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/factory.go
  • internal/raftengine/etcd/grpc_transport_test.go
  • internal/raftengine/etcd/wal_store.go
  • internal/raftengine/etcd/wal_store_skip_gate_test.go
  • internal/raftengine/etcd/wal_store_test.go
  • internal/raftengine/statemachine.go
  • kv/fsm.go
  • kv/fsm_applied_index_iface_check.go
  • main.go
  • main_bootstrap_e2e_test.go
  • main_bootstrap_test.go
  • monitoring/cold_start.go
  • monitoring/hlc_test.go
  • monitoring/registry.go
  • multiraft_runtime_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch impl/snapshot-skip-b3-skip-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (ce277e6):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 5, 2026

@claude review

Branch 3 of the PR #910 design — the user-visible cold-start perf win. Follows the merged Branch 2 (PR #915 — metaAppliedIndex plumbing). 6 commits, focused review pathways:

Review observation surface:

  • §4 strictly-additive fallback: decideSkipOutcome collapses every uncertainty (FSM not a reader, meta key missing, read error) to a fallback enum; LastAppliedIndex errors MUST NOT abort cold start; tests TestSkipGate_Fallback* cover all three reasons.
  • §5 two-phase seam (round-7): ParseSnapshotHeader(r io.Reader) (ceiling, cutover uint64, err error) + ApplySnapshotHeader(ceiling, cutover uint64). Engine wraps r in a crc32 TeeReader; FSM Parse reads + drains; engine verifies CRC; engine calls Apply ONLY on verification success. The body bytes are read but discarded (fsm.db already has equivalent state).
  • §5 CRC three-step (round-6): size + footer-vs-tokenCRC + full-body-CRC, byte-for-byte matching openAndRestoreFSMSnapshot's safety contract. Tests TestApplyHeaderStateOnSkip_* inject each failure mode and assert FSM state unchanged.
  • §6 nosync mode: not relevant here (B3 reads only; B2 handled the write side).
  • §9 observability: 3 outcomes + gap label + structured INFO log. raftengine.ColdStartObserver interface in the parent package so monitoring can implement without circular import.
  • Compile-time guards: var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil) added next to the existing AppliedIndexReader/Writer guards — a rename or signature drift fails go build rather than silently degrading the skip to full-restore-always.
  • Non-Goals respected: Engine.applySnapshot untouched (peer-after-InstallSnapshot bump remains a B4 / follow-up; the TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names it).
  • Caller audit (per /loop semantic-change rule): restoreSnapshotState signature changed (added obs + logger). All in-package callers updated; 9 test call sites updated to pass (nil, nil) — tests opt out of metrics.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Claude finished @bootjp's task in 4m 16s —— View job


PR #934 Review — Branch 3: Cold-Start Snapshot-Restore Skip Gate

Todo list:

  • Read changed files and understand the implementation
  • Review decideSkipOutcome / strictly-additive fallback (§4)
  • Review two-phase seam: ParseSnapshotHeader + ApplySnapshotHeader (§5 round-7)
  • Review CRC three-step verification (§5 round-6)
  • Review observability (ColdStartObserver + metrics) (§9)
  • Review compile-time guards
  • Review test coverage
  • TLA+ spec divergence check (kv/fsm.go anchor)
  • Final summary

Overall assessment

The implementation is correct and the core design invariants are faithfully preserved. The three design constraints given in the trigger — strictly-additive fallback (§4), two-phase seam (§5/round-7), and CRC mirroring (§5/round-6) — are all correctly implemented. Four findings below, in descending severity.


Finding 1 — ParseSnapshotHeader error short-circuits before the full-body CRC is verified (design trade-off to document)

Location: internal/raftengine/etcd/wal_store.go:409-415

ceiling, cutover, err := readSnapshotHeaderOrDrain(setter, hasSetter, tee)
if err != nil {
    return err   // ← returns header error; h.Sum32() is never reached
}

if h.Sum32() != footer {
    return errors.Wrapf(ErrFSMSnapshotFileCRC, ...)
}

If ParseSnapshotHeader returns an error (e.g., ErrSnapshotHeaderUnknownMagic), the full-body CRC step is skipped. This is not a safety bug — the header error propagates, so the cold start correctly fails closed and FSM state is not mutated. However, it means a file that passes the step-2 (footer-vs-token) check but has a corrupt header surfaces as a header format error rather than ErrFSMSnapshotFileCRC.

This differs subtly from openAndRestoreFSMSnapshot where restoreAndComputeCRC runs the full-body CRC before fsm.Restore can even report a parse error. The skip path inverts the order: header parse first (inside the CRC pass), CRC checked after. The design deliberately separates parse from apply, so this ordering is intentional. A one-line comment explaining why early-return on parse error is acceptable here would help future readers avoid "fixing" this to match the full-restore flow.


Finding 2 — have == want boundary case is not tested

Location: internal/raftengine/etcd/wal_store_skip_gate_test.go

TestSkipGate_SkipsWhenFSMFreshEnough uses appliedIdx = 200, snapIndex = 100 (well ahead). TestSkipGate_ExecutesWhenFSMStale uses appliedIdx = 100, snapIndex = 200 (well behind). The default branch in decideSkipOutcome fires at have >= want, so have == want is a skip. This boundary case matters for two reasons:

  1. The observer records gap_ahead = have - snapIndex = 0 — a zero gap is valid, but needs explicit confirmation.
  2. The appliedIndexGap gauge would be set to 0, which Prometheus will emit and operators should expect.

[Fix this →](https://claude.ai/code?q=Add%20TestSkipGate_SkipsAtExactBoundary%20to%20internal%2Frightengine%2Fetcd%2Fwal_store_skip_gate_test.go%20on%20branch%20impl%2Fsnapshot-skip-b3-skip-gate.%20Use%20appliedIdx%20%3D%20snapIndex%20%3D%20100%2C%20verify%20restoreSnapshotState%20returns%20nil%2C%20fsm.bodyBytes%20is%20empty%2C%20fsm.restoredHeader%20is%20true%2C%20and%20obs.skipped%20%3D%20%5B%5D uint64%7B0%7D&repo=bootjp/elastickv)


Finding 3 — RestoreFallback discards snapIndex; the appliedIndexGap metric emits nothing for fallback — confirm intentional

Location: monitoring/cold_start.go:87-93

func (o *ColdStartObserver) RestoreFallback(snapIndex uint64, reason string) {
    _ = snapIndex    // ← explicitly discarded
    if o == nil || o.metrics == nil {
        return
    }
    o.metrics.restoreTotal.WithLabelValues("fallback", reason).Inc()
    // no appliedIndexGap update
}

This is correct per the design ("store could not authoritatively name a value"). But operators watching only elastickv_fsm_cold_start_applied_index_gap may see a missing data point and confuse it with a broken metric rather than an expected fallback. A Prometheus absente() alert on outcome=~"skipped|executed" would be the right operational guidance. Consider adding this to the runbook or alerting notes, not to the code.


Finding 4 — TLA+ spec: no divergence

kv/fsm.go anchor (applyHLCLease / SetPhysicalCeiling):

The new ApplySnapshotHeader calls:

if f.hlc != nil && ceiling > 0 {
    f.hlc.SetPhysicalCeiling(int64(ceiling))
}

This is a cold-start-only path, invoked before the Raft event loop starts. It sets the same ceiling that Restore → ReadSnapshotHeader → SetPhysicalCeiling would have set in the full-restore path. The HLC-4 preconditions are unaffected:

  • (i) bounded skew — ceiling is read from the persisted snapshot, not the local wall clock.
  • (ii) logical-counter handoff on leader change — not applicable at cold start; leader election has not happened yet.
  • (iii) commit-time ceiling fence — the skip path issues no writes.

No OCC, MVCC visibility, or routing-catalog invariants are touched. No spec update required.


Minor notes (no action required)

  • //nolint:mnd on 1<<20 in ParseSnapshotHeader (kv/fsm.go): Correct given the cross-package constant boundary — fsmRestoreReadAhead lives in internal/raftengine/etcd which kv cannot import. The inline literal matches the value. ✅

  • fallback_reason label is present but "" for skipped/executed outcomes — This is idiomatic Prometheus (sparse label rather than absent label) and avoids cardinality explosion. Consistent with conventions used elsewhere in the codebase. ✅

  • Seek-back symmetry: applyHeaderStateOnSkip correctly calls file.Seek(0, io.SeekStart) after verifyFSMSnapshotPrefix leaves the file cursor after the footer, mirroring openAndRestoreFSMSnapshot:294. ✅

  • bootstrapNewCluster / persistBootState pass nil observer — correct; the bootstrap snapshot restore is a no-op for fresh clusters, and the comment in persistBootState explicitly explains why no observer is needed. ✅

  • monitoring/hlc_test.go alignment-only diff (column alignment) — no logic change. ✅


Summary

Core correctness is solid. Finding 1 (header-error vs CRC-error ordering) is a design consequence worth documenting inline. Finding 2 (boundary test at have == want) is the only gap in coverage. Finding 3 is an operational note. No spec divergence.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cold-start snapshot-restore skip-gate optimization. It allows skipping the restoration of a multi-GiB snapshot body if the on-disk state machine (FSM) is already as fresh as or fresher than the snapshot. The changes include the definition of the ColdStartObserver and SnapshotHeaderApplier interfaces, integration into the WAL loading and FSM restore paths, implementation of Prometheus metrics for monitoring skip-gate outcomes, and comprehensive unit tests. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce277e6667

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// nil disables metrics; the skip itself still runs. See
// docs/design/2026_06_02_idempotent_snapshot_restore.md §9 and
// internal/raftengine/cold_start.go for the contract.
ColdStartObserver raftengine.ColdStartObserver
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Plumb the cold-start observer through the factory

In the production startup path this observer is never populated: main.go creates monitoring.NewRegistry(...), but engines are opened through internal/raftengine/etcd.Factory.Create, and that method builds OpenConfig without a ColdStartObserver field because raftengine.FactoryConfig has no way to carry it. As a result loadWalState always receives obs == nil in normal server runs, so the new cold-start counters/gauges remain empty even when skips/executions/fallbacks happen. Please thread Registry.ColdStartObserver() through the factory config/open call.

Useful? React with 👍 / 👎.

PR #934 round-1 codex P2: in production, monitoring.NewRegistry's
ColdStartObserver was never reaching the engine because
internal/raftengine/etcd.Factory.Create built OpenConfig solely from the
engine-agnostic raftengine.FactoryConfig (which has no place to carry the
observer). The Branch 3 cold-start counters/gauges therefore stayed at
zero in real runs even when skips/executions/fallbacks happened.

Thread it via the engine-specific etcd.FactoryConfig (already holds
TickInterval/HeartbeatTick/etc.) so the change is local to the etcd
package and main.go — buildShardGroups, buildRuntimeForGroup, and the
shared raftengine.FactoryConfig stay untouched.

Wiring:
- etcd.FactoryConfig gains ColdStartObserver raftengine.ColdStartObserver
- Factory.Create copies it into OpenConfig{}
- newRaftFactory in main.go now takes the observer arg and passes
  metricsRegistry.ColdStartObserver() through
- moved the factory construction below metricsRegistry creation so the
  observer is available at factory build time

Tests:
- 3 existing newRaftFactory callers updated to pass nil (no metrics
  needed in those test scopes)
- TestSkipGate_*/TestApplyHeaderStateOnSkip_*/TestSkipGate_KVFSMHeaderRoundTrip
  (9 tests) still pass — they exercise restoreSnapshotState directly with
  a recording observer test double, unaffected by the factory plumbing

Verified: go vet ./..., golangci-lint run ./internal/raftengine/etcd/...
./monitoring/... ./internal/raftengine/... ./., and
go test -count=1 ./internal/raftengine/etcd/ all clean.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 5, 2026

@claude review

Round-1 codex P2 addressed (commit 320d06e):

monitoring.Registry.ColdStartObserver() now flows through to OpenConfig.ColdStartObserver in production:

  • etcd.FactoryConfig gained ColdStartObserver raftengine.ColdStartObserver
  • Factory.Create copies it into the OpenConfig{} literal at factory.go:68
  • main.go reorders so metricsRegistry is built before newRaftFactory, then passes metricsRegistry.ColdStartObserver() in
  • Three newRaftFactory callers in tests pass nil (those scopes do not exercise metrics)

Kept the change inside the engine-specific etcd.FactoryConfig rather than the shared raftengine.FactoryConfig to avoid touching buildShardGroups, buildRuntimeForGroup, and unrelated test wiring.

Verified: go vet ./..., golangci-lint run (0 issues across raftengine/etcd, monitoring, raftengine, root), go test ./internal/raftengine/etcd/ green, all 9 TestSkipGate_* / TestApplyHeaderStateOnSkip_* still pass.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (320d06e):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Claude finished @bootjp's task in 2m 33s —— View job


PR #934 Round-2 Review — ColdStartObserver factory wiring (commit 320d06e)

Todo list:

  • Load tools
  • Read changed files (factory.go, main.go, engine.go, cold_start.go, registry.go)
  • Verify ColdStartObserver flows correctly end-to-end in production path
  • Check test callers pass nil cleanly
  • Re-verify Round-1 findings status (boundary test, header-error comment)
  • Post final review

Codex P2 — Closed correctly ✓

The full wiring chain is intact:

Step Location Verified
NewRegistry builds coldStartObs monitoring/registry.go:57
Registry.ColdStartObserver() returns it monitoring/registry.go:65-70
metricsRegistry constructed before factory call main.go:323 vs main.go:331
newRaftFactory takes coldStartObs raftengine.ColdStartObserver main.go:53
FactoryConfig.ColdStartObserver carries it factory.go:26
Factory.Create copies into OpenConfig{ColdStartObserver: ...} factory.go:68
Open passes cfg.ColdStartObserver to loadWalState wal_store.go:43
loadWalStaterestoreSnapshotStatereportColdStart wal_store.go:135, 264
reportColdStart guards every dispatch with obs != nil wal_store.go:329,340,354

Test callers: main_bootstrap_test.go:94, main_bootstrap_e2e_test.go:354, multiraft_runtime_test.go:54,109 all pass nil correctly — tests don't exercise metrics.

Bootstrap recursive path (wal_store.go:527): passes nil with an explanatory comment — this is the fresh-bootstrap path that runs before OpenConfig wiring, and a fresh-bootstrap restore is a no-op anyway. Correct.

Nil-interface safety: reportColdStart guards with obs != nil (interface-level check), which is the right check for an interface type. *ColdStartObserver methods additionally check o == nil for the concrete receiver, providing defence-in-depth if the observer is ever passed as a non-nil interface wrapping a nil pointer. ✓


One minor note (no action required)

factory.go:24 and main.go:325-330 both reference "PR #934 round-1 codex P2". Per the codebase's comment convention, PR references in inline comments rot as the codebase evolves; the stable reference is docs/design/2026_06_02_idempotent_snapshot_restore.md §9. The information is accurate now but won't age well. Suggest replacing with the design-doc reference on cleanup, though it's not blocking.


Round-1 findings status

Finding Status
F1 — document header-error-before-CRC ordering in applyHeaderStateOnSkip Open (minor, documentation only)
F2 — add TestSkipGate_SkipsAtExactBoundary (have == want, gap=0) Open (test completeness)
F3 — operational note on appliedIndexGap absence for fallback Acknowledged, no code change needed
F4 — TLA+ spec: no divergence Confirmed
Codex P2 — ColdStartObserver not plumbed in production path Closed by 320d06e

Overall

P2 is closed correctly. The change is minimal and surgical — only FactoryConfig, Factory.Create, and the main initialisation order needed touching, which is exactly the right scope. No new issues introduced. The two remaining open findings (F1 comment, F2 boundary test) are cosmetic/test-completeness items that don't affect correctness or safety.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 320d06efb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +264 to +267
decision, have := decideSkipOutcome(fsm, tok.Index)
reportColdStart(obs, logger, decision, tok.Index, have)
if decision == coldStartSkip {
return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent skipping when WAL will replay newer entries

When the FSM has applied entries after the latest persisted snapshot (the normal interval between snapshots, since metaAppliedIndex advances on each Apply), LastAppliedIndex() > tok.Index takes this skip path even though Open still initializes the raft applied floor from LocalSnap.Metadata.Index and openAndReadWAL reloads the post-snapshot entries. Those entries are then replayed into a Pebble store that already contains them; transaction replays can hit OCC conflicts before kvFSM observes their commit timestamps, leaving the restarted node’s HLC below timestamps already present on disk. The skip needs to advance/suppress replay through have, or only skip when there are no post-snapshot WAL entries to replay.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant