Skip to content

backup: Phase 0b M6 implementation - cmd/elastickv-snapshot-encode + library#904

Merged
bootjp merged 37 commits into
mainfrom
backup/m6-cli-design
Jun 3, 2026
Merged

backup: Phase 0b M6 implementation - cmd/elastickv-snapshot-encode + library#904
bootjp merged 37 commits into
mainfrom
backup/m6-cli-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 1, 2026

Summary

Phase 0b M6 implementation per the merged design doc docs/design/2026_06_01_proposed_snapshot_encode_cli.md (#896, merged at fe9e941).

Wires the merged M1–M5 encoder slices into a user-facing CLI plus a library entrypoint mirroring the decoder's DecodeSnapshot. Implements the round-trip self-test the parent doc mandates, with write-then-rename atomic publish so a self-test failure never reaches the restore path.

What lands

Library (internal/backup/):

  • encode_snapshot.go: EncodeSnapshot(opts, out io.Writer) (EncodeResult, error) — high-level wrapper that dispatches per-adapter encoders in canonical fan-out order (redis → dynamodb → s3 → sqs), implements two-mode buffering (stream when SelfTest=false, buffer when SelfTest=true), runs the structural self-test against the in-memory buffer, and copies to out only on match. Unexported corruptBufferForTest hook lets same-package tests inject buffer corruption that reaches the self-test decode but never out (codex P2 v6 docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896 — write-then-rename atomicity).
  • encode_info.go: EncodeInfo schema + WriteEncodeInfo / ReadEncodeInfo helpers + EncodeInfoSidecarPath (path-derived <output>.encode_info.json, no static-name collisions per gemini medium v2 docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896). Format-version gate so a future bump surfaces as ErrUnsupportedEncodeInfoFormatVersion.
  • manifest.go: Exclusions.RenameS3Collisions bool added with JSON tag rename_s3_collisions. Intentionally NOT in exclusionsRequiredFields so legacy manifests decode safely with the zero value false (claude v5 medium docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896).

Decoder CLI update (cmd/elastickv-snapshot-decode/main.go):

  • emitManifest now populates the new RenameS3Collisions field from cfg.renameCollisions, completing the decoder→encoder round-trip for --rename-collisions dumps.

Encoder CLI (cmd/elastickv-snapshot-encode/main.go):

  • Flags: --input, --output, --adapter (decoder-parity CSV parser), --last-commit-ts, --self-test, --scratch-root.
  • Atomic publish: write to <output>.tmp-<random>, fsync+close, then rename. Self-test failure → mismatch.txt next to where .fsm would have been, temp file removed, exit 2.
  • Fail-closed HLC ceiling: --last-commit-ts T < manifest → exit 2 with typed ErrSelfTestLowerLastCommitTS.
  • Exit codes: 0 success / 1 user-input error / 2 data-correctness failure (decoder-parity).
  • encoder_version stamped at build time via -ldflags "-X main.version=..." (mirrors decoder pattern at main.go:45).

Test plan

go test -race -count=1 — all green:

  • internal/backup/: 3 new tests (encode_info schema, format gate, legacy manifest forward-compat)
  • internal/backup/encode_snapshot_test.go: 5 tests (library round-trip, self-test match against canonicalized input, corruption never reaches out, missing-input guard, sidecar path derivation)
  • cmd/elastickv-snapshot-encode/main_test.go: 9 tests (missing manifest, unknown adapter, lower-TS fail-closed, equal+higher TS accept, path-derived sidecar, two-files-no-collision, full round-trip with --self-test, atomic-publish never leaves bad .fsm, --last-commit-ts parser)

make lint: clean.

Caller audit per CLAUDE.md semantic-change rule

  • Exclusions struct gained a field. Existing callers either build via field-tagged literals (decoder CLI's emitManifest — updated to populate the new field) or read it (encoder's buildSelfTestDecodeOptions — new code). No silent semantic change.
  • DecodeOptions.RenameS3Collisions was already a public field used by the decoder; the encoder now also reads it via the manifest. No caller-side change needed.

Self-review (5 passes)

  1. Data loss: write-then-rename atomic publish — self-test failure never publishes the .fsm. Corrupted buffer never reaches the io.Writer (pinned by TestEncodeSnapshotSelfTestDetectsCorruption asserting out.Len()==0). All b.Add errors propagate through EncodeSnapshot.
  2. Concurrency: pure offline. CLI is single-shot; library takes a caller-owned io.Writer. Temp-file suffix uses crypto/rand so concurrent encodes against the same --output cannot collide.
  3. Performance: SelfTest=false streams with one sha256.Writer tee, no extra allocations. SelfTest=true allocates one FSM-sized *bytes.Buffer plus the scratch decode tree (documented memory cost).
  4. Data consistency: --last-commit-ts T < manifest fail-closed with typed error; self-test threads MANIFEST DecodeOptions (Exclusions.* + DynamoDBLayout → DynamoDBBundleJSONL) so trees produced with non-default decoder flags round-trip cleanly.
  5. Test coverage: 17 new tests cover library entrypoint, CLI flag parsing, atomic publish discipline, sidecar path-derivation, corruption detection, forward-compat for legacy manifests, and the four user-visible behaviors (round-trip / override / scope / missing-manifest).

Risk

Low. The encoder is offline; restoration is non-destructive (new keyspace on a fresh cluster via stop-replace-restart). The only public-API change is the Exclusions.RenameS3Collisions field, which is forward-compat for older manifests. All existing M1–M5 tests continue to pass.

Summary by CodeRabbit

  • New Features

    • New snapshot encode CLI with atomic publish, multi-adapter encoding, optional self-test verification, and per-output encode sidecar (.encode_info.json).
    • Sidecar safe-open exported and sidecar permission tightening to owner-only mode.
  • Bug Fixes

    • Manifests now include/preserve an S3 rename-collision flag while remaining compatible with older manifests.
    • Self-test failures avoid publishing bad outputs, preserve prior artifacts appropriately, and reject unexpected non-directory S3 entries.
  • Tests

    • Extensive unit and end-to-end tests covering CLI, library, self-test, manifest compatibility, sidecar handling, and filesystem safety.

bootjp added 2 commits June 2, 2026 00:53
Implementation of the M6 design (this PR's earlier doc commits 31c0a0f
through 1736435). Wires the merged M1-M5 encoder slices into a
user-facing CLI plus a library entrypoint, with the round-trip self-test
the parent doc mandates.

## What ships

- internal/backup/manifest.go: Exclusions gains RenameS3Collisions bool
  with json tag rename_s3_collisions. Intentionally NOT added to
  exclusionsRequiredFields so legacy manifests decode safely with the
  zero value false. Pinned by TestExclusionsLegacyManifestOmitsRenameS3Collisions.

- cmd/elastickv-snapshot-decode/main.go: emitManifest now populates
  the new field from cfg.renameCollisions, so M6's self-test can thread
  the same option back through DecodeSnapshot.

- internal/backup/encode_info.go: ENCODE_INFO.json schema with
  format_version gate, NewEncodeInfo constructor, WriteEncodeInfo /
  ReadEncodeInfo helpers, and EncodeInfoSidecarPath that derives the
  sidecar from the .fsm path (<output>.encode_info.json — no static
  name collisions).

- internal/backup/encode_snapshot.go: EncodeSnapshot(opts, out) library
  entrypoint mirroring DecodeSnapshot. Dispatches per-adapter encoders
  in canonical fan-out order (redis -> dynamodb -> s3 -> sqs), and
  implements the two-mode buffering model:
  - SelfTest=false: stream FSM straight to out with a sha256.Writer tee.
  - SelfTest=true: buffer FSM in *bytes.Buffer, fire the unexported
    corruptBufferForTest hook (if set), self-test against the buffer,
    copy to out ONLY on match. Corruption hooks reach the self-test
    decode but never reach out (write-then-rename atomicity).

- cmd/elastickv-snapshot-encode/main.go: CLI with --input / --output /
  --adapter / --last-commit-ts / --self-test / --scratch-root flags;
  decoder-parity adapter CSV parser; fail-closed
  T >= manifest.last_commit_ts validation; atomic publish via
  <output>.tmp-<random> with deferred cleanup; sidecar emission with
  fsync+close discipline.

## Tests (12 new)

internal/backup/encode_info_test.go:
- TestEncodeInfoRoundTrip
- TestEncodeInfoRejectsUnknownFormatVersion
- TestExclusionsLegacyManifestOmitsRenameS3Collisions

internal/backup/encode_snapshot_test.go:
- TestEncodeSnapshotLibraryRoundTrip
- TestEncodeSnapshotSelfTestMatchesInput (uses canonicalize-once
  pattern: encode -> decode -> re-encode self-tests cleanly)
- TestEncodeSnapshotSelfTestDetectsCorruption (corruption reaches
  self-test decode, NEVER reaches the io.Writer)
- TestEncodeSnapshotRequiresInputRoot
- TestEncodeInfoSidecarPath

cmd/elastickv-snapshot-encode/main_test.go:
- TestCLIRejectsMissingManifest
- TestCLIRejectsUnknownAdapter
- TestCLIRejectsLowerLastCommitTSOverride
- TestCLIAcceptsEqualAndHigherLastCommitTSOverride (sub-tests for
  equal + higher)
- TestCLIEncodeInfoPathDerivedFromOutput
- TestCLIEncodeInfoTwoFilesNoCollision
- TestCLIRoundTripSelfTestAllAdapters
- TestCLISelfTestFailureLeavesNoFsmAtOutputPath
- TestParseLastCommitTS

All green (race + no cache). golangci-lint clean.

## Caller audit per CLAUDE.md semantic-change rule

- Exclusions struct gained a field. Existing callers either:
  (a) build the struct via field-tagged literals (decoder CLI's
      emitManifest — updated to populate the new field), or
  (b) read it (encoder's buildSelfTestDecodeOptions — new code).
  No silent semantic change for any pre-existing caller.

- DecodeOptions.RenameS3Collisions was already a public field used by
  the decoder; the encoder now also reads it via the manifest. No
  caller-side change needed.

## Self-review

1. Data loss: write-then-rename atomic publish — self-test failure
   never publishes the .fsm. Mismatch txt + corrupt buffer never
   reach the io.Writer (pinned by
   TestEncodeSnapshotSelfTestDetectsCorruption asserting out.Len()==0).
2. Concurrency: pure offline. The CLI is a single-shot binary; library
   entrypoint takes an io.Writer the caller owns.
3. Performance: SelfTest=false streams with one sha256 tee, no extra
   allocations. SelfTest=true allocates one FSM-sized buffer plus
   the scratch decode tree (documented memory cost).
4. Data consistency: --last-commit-ts T < manifest is fail-closed with
   typed error; self-test threads MANIFEST DecodeOptions
   (Exclusions.* + DynamoDBLayout -> DynamoDBBundleJSONL) so trees
   produced with non-default decoder flags round-trip cleanly.
5. Test coverage: 12 new tests cover library entrypoint, CLI flag
   parsing, atomic publish discipline, sidecar path-derivation,
   corruption detection, forward-compat for legacy manifests.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a snapshot-encode CLI and EncodeSnapshot library with optional on-disk self-test and deterministic structural diffs, ENCODE_INFO.json sidecar schema and helpers, atomic temp→publish flow with randomized temp names, and a manifest Exclusions flag rename_s3_collisions that decode emits.

Changes

Snapshot Encoding Pipeline

Layer / File(s) Summary
Manifest exclusions and decode integration
internal/backup/manifest.go, cmd/elastickv-snapshot-decode/main.go
Manifest Exclusions adds optional rename_s3_collisions (RenameS3Collisions) with omitempty and backward-compatible zero-value; decode CLI includes this flag when emitting MANIFEST.json.
ENCODE_INFO.json schema and tests
internal/backup/encode_info.go, internal/backup/encode_info_test.go
Defines ENCODE_INFO.json format version constant, typed unsupported-version error, EncodeInfo and EncodeInfoSelfTest structs, New/Read/Write helpers, sidecar path derivation, and tests for round-trip and unknown-format rejection.
Open sidecar export and Unix permission enforcement
internal/backup/open_sidecar_export.go, internal/backup/open_nofollow_unix.go, internal/backup/open_nofollow_unix_test.go
Exports OpenSidecarFile, enforces owner-only 0o600 permissions after truncation, and adds a Unix-only test verifying tightened permissions.
S3 encoder strictness and tests
internal/backup/encode_s3.go, internal/backup/encode_s3_test.go
S3 encoder now fails closed on unexpected non-directory top-level s3/ entries while skipping reserved _-prefixed entries; tests added for both behaviors.
EncodeSnapshot library with optional self-test
internal/backup/encode_snapshot.go, internal/backup/encode_snapshot_test.go
Adds EncodeSnapshot API, EncodeOptions/EncodeResult types, sentinel errors, SHA-256 tee, adapter-runner wiring (redis→dynamodb→s3→sqs), redis multi-DB guards, buffered self-test path that decodes and performs bounded deterministic streamed diffs, and extensive library tests validating round-trip, self-test idempotency/corruption detection, input/manifest guards, and adapter-data error mapping.
elastickv-snapshot-encode CLI with atomic publishing
cmd/elastickv-snapshot-encode/main.go, cmd/elastickv-snapshot-encode/main_test.go
CLI parses required flags, reads/validates MANIFEST.json, enforces ceiling-only last-commit override, builds EncodeOptions, writes restrictive-permissions temp .fsm, optionally runs self-test (writes <output>.mismatch.txt on mismatch), atomically renames temp→final, fsyncs parent dir, writes <output>.encode_info.json sidecar with safe open/write/fsync semantics, and maps data-correctness sentinel errors to a distinct exit code. Tests cover error mappings, sidecar contracts, self-test flows, and permission/security checks.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI as elastickv-snapshot-encode
  participant Lib as EncodeSnapshot
  participant FS as Filesystem
  participant Decode as DecodeSnapshot

  User->>CLI: run encode (--input, --output, --adapter, --last-commit-ts, --self-test)
  CLI->>FS: read <input>/MANIFEST.json
  CLI->>Lib: build EncodeOptions + run EncodeSnapshot
  Lib->>FS: read adapter input trees (redis/dynamodb/s3/sqs)
  Lib->>Decode: when SelfTest=true, write temp .fsm -> DecodeSnapshot (self-test)
  Decode->>Lib: return match/mismatch + decoded tree
  Lib->>FS: on success rename temp .fsm -> <output> and fsync parent dir
  CLI->>FS: write <output>.encode_info.json sidecar (safe open/write/fsync)
  CLI->>User: exit code (0 or classified non-zero)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • bootjp/elastickv#713: Related to sidecar-safe open implementations and Unix permission handling used/exposed here.
  • bootjp/elastickv#828: Related Redis encoder changes that integrate with EncodeSnapshot adapter wiring and Redis handling.
  • bootjp/elastickv#810: Related to decode manifest emission and the rename-collisions flag that this PR emits.

Poem

🐰 I hopped through manifests and stitched a sidecar seam,

I wrote temp .fsm files safe and kept the publish clean,
I ran a tiny self-test to sniff a sneaky glitch,
Left a JSON note with hashes—tidy, sealed, and stitched,
Then hopped away content into the backup stream.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: implementation of Phase 0b M6 encoder (cmd/elastickv-snapshot-encode) plus supporting library components.
Docstring Coverage ✅ Passed Docstring coverage is 89.74% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch backup/m6-cli-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 59s —— View job


Code Review - PR #904

Tasks

  • Gather context and understand the request
  • Read changed files
  • Review library code (internal/backup/)
  • Review CLI code (cmd/elastickv-snapshot-encode/)
  • Review tests
  • Post final review

Summary

This is a solid, well-structured implementation. The atomic publish discipline, self-test detection, and HLC ceiling fail-closed logic are all correctly implemented. A few issues worth addressing before merge:


Issues

1. fmt.Sscanf for hex parsing doesn't consume the entire input — partial parse silently succeeds

cmd/elastickv-snapshot-encode/main.go:156

if _, err := fmt.Sscanf(s[2:], "%x", &ts); err != nil {

fmt.Sscanf stops at the first non-hex character and returns n=1, err=nil — so "0xffZZ" parses as 0xff with no error. This means a user typo like --last-commit-ts 0x1234abc_garbage would silently accept a wrong value (and pass the >= manifest check if it happens to be large enough).

Replace with strconv.ParseUint(s[2:], 16, 64) which rejects partial parses. The same fix applies to the decimal branch: fmt.Sscanf(s, "%d", &ts) with a positive decimal will stop before a sign or non-digit; strconv.ParseUint(s, 10, 64) is stricter.

Fix this →


2. Stale comment: EncodeResult.SHA256 says "via SHA256Hex" but no such function exists

internal/backup/encode_snapshot.go:85

// SHA256 of the produced .fsm (lowercase hex via SHA256Hex).
SHA256 [32]byte

The field is [32]byte, not a string. The hex encoding happens later at the CLI level via hex.EncodeToString(result.SHA256[:]) in writeSidecar. The reference to SHA256Hex is a phantom; there is no such function in the package. The comment should read "raw SHA-256 digest; hex-encode with hex.EncodeToString(r.SHA256[:]) for display."

Fix this →


3. uitoa / itoa reinvent strconv — misleading for future readers

internal/backup/encode_snapshot.go:402-422

The hand-rolled decimal converters are correct but strconv.FormatUint and strconv.Itoa do exactly the same thing and are already linked into every Go binary. The only benefit here is avoiding an import, but strconv is already transitively present. Keeping custom implementations means a future reader may not recognise them as stdlib equivalents and will spend time auditing them.

Minor, but worth dropping.


4. diffOneSubdir mismatch output is non-deterministic for "bytes differ" entries

internal/backup/encode_snapshot.go:328-350

The loop over bFiles (a map) produces "bytes differ" entries in map-iteration order, which is randomised per-run. The "missing in input" entries also come from map iteration. Only "missing in scratch" entries (the remaining slice) are sorted.

Consequence: mismatch.txt can look different across two runs with the same inputs, making it harder to compare mismatch reports or write tests that check the txt content. Sorting diffs before returning (or collecting into a slice and sorting) would make this deterministic.

Fix this →


5. EncodeSnapshot returns (result, nil) on self-test mismatch — callers must check two things

internal/backup/encode_snapshot.go:181-183

if !matched {
    return result, nil  // no error, but SelfTestMatched = false
}

When SelfTest=true and the self-test fails, EncodeSnapshot returns (result, nil) — a nil error. Callers who only check err != nil will silently proceed as if the encode succeeded. The CLI correctly checks result.SelfTestMatched, but any future library caller who patterns-matches just on the error will miss the failure.

Returning a typed error (e.g., errSelfTestMismatch promoted from the CLI into the library) on mismatch, or at minimum documenting the contract prominently on EncodeSnapshot's godoc, would make this safer. The current behaviour is already documented in EncodeResult.SelfTestMatched, but the "no error on mismatch" choice is surprising enough to warrant a sentence in the EncodeSnapshot signature comment.


Observations (no fix required, but worth knowing)

6. writeSidecar is not atomic with respect to the .fsm rename

If the process is killed after os.Rename(tempPath, cfg.outputPath) but before writeSidecar completes, the published .fsm will exist without a sidecar. This is probably fine — the .fsm itself is the critical restore artifact and the sidecar is operator-metadata — but it means restore tooling should tolerate a missing sidecar gracefully. Worth a note in the design doc's "failure modes" section if it isn't already there.

7. walkRegularFiles slurps entire file contents into memory

internal/backup/encode_snapshot.go:359-392

For the self-test diff, every regular file in every enabled adapter subtree is read fully into a map. The comment already acknowledges this is bounded by test fixtures and defers streaming compare. Fine for now, but worth flagging if the first production-scale encode triggers an OOM.

8. var _ = strconv.FormatUint in main_test.go

cmd/elastickv-snapshot-encode/main_test.go:362

This is an unusual pattern to keep an import alive. If strconv has no active use in the test file, the import and this line can be removed. If it's an intentional placeholder for future tests, a short comment explaining why would help.

9. sha256w = hashSHA256 type alias chain

internal/backup/encode_snapshot.go:432-438

The alias sha256w = hashSHA256 (where hashSHA256 is an interface) is confusing — two names for the same type, neither of which is hash.Hash. The comment says it avoids importing hash, but crypto/sha256 already transitively imports hash, so the import would add no size cost. Using hash.Hash directly would be clearer.


Strengths

  • Write-then-rename atomic publish is correctly implemented and well-tested (TestCLISelfTestFailureLeavesNoFsmAtOutputPath, TestEncodeSnapshotSelfTestDetectsCorruption).
  • corruptBufferForTest hook cleanly separates the test injection path from production code without build tags.
  • exclusionsRequiredFields intentionally excludes rename_s3_collisions, with a clear comment explaining why — good forward-compat discipline.
  • EncodeInfoSidecarPath path-derivation (not a static name) correctly avoids sibling collisions; both the library and CLI tests pin this.
  • HLC ceiling fail-closed (ErrSelfTestLowerLastCommitTS) is a typed error that maps cleanly to exit code 2 — matches the decoder's exit-code contract.
  • buildSelfTestDecodeOptions threads Exclusions.* and DynamoDBBundleJSONL through from the manifest — ensures the self-test's scratch tree uses the same decode flags as the original run.
    | Branch

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the elastickv-snapshot-encode command-line tool and library, which encodes vendor-independent directory trees into native .fsm files, generates an encode_info.json sidecar, and supports an optional round-trip self-test. The review feedback highlights several critical memory-efficiency improvements to prevent Out Of Memory (OOM) crashes on large production snapshots. Key recommendations include writing the .fsm to a temporary file on disk instead of buffering it in memory during the self-test, streaming the file comparison chunk-by-chunk rather than reading all file bytes into memory, and increasing the entropy of the temporary file suffix to avoid collisions.

Comment on lines +157 to +188
func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) {
var buf bytes.Buffer
bytesWritten, err := b.WriteTo(&buf)
if err != nil {
return EncodeResult{}, errors.WithStack(err)
}
bufBytes := buf.Bytes()
if opts.corruptBufferForTest != nil {
opts.corruptBufferForTest(bufBytes)
}

header, mismatchTxt, matched, stErr := runSelfTest(bufBytes, opts)
result := EncodeResult{
Header: header,
BytesWritten: bytesWritten,
SHA256: sha256.Sum256(bufBytes),
SelfTestRan: true,
SelfTestMatched: matched,
SelfTestMismatchTxt: mismatchTxt,
AdaptersEnabled: enabled,
}
if stErr != nil {
return result, stErr
}
if !matched {
return result, nil
}
if _, err := io.Copy(out, bytes.NewReader(bufBytes)); err != nil {
return result, errors.Wrap(err, "copy buffered fsm to out")
}
return result, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Buffering the entire .fsm file in memory (bytes.Buffer) during the self-test can lead to Out Of Memory (OOM) crashes on large production snapshots. Instead, write the .fsm directly to a temporary file on disk, compute the SHA256 hash in a single pass using the existing sha256Writer, and stream-read from the temporary file for the self-test and copying.

func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) {
	tempFile, err := os.CreateTemp(opts.SelfTestDecodeOptions.OutRoot, "encode-buffer-")
	if err != nil {
		return EncodeResult{}, errors.WithStack(err)
	}
	defer func() {
		_ = tempFile.Close()
		_ = os.Remove(tempFile.Name())
	}()

	hashWriter := newSHA256Writer(tempFile)
	bytesWritten, err := b.WriteTo(hashWriter)
	if err != nil {
		return EncodeResult{}, errors.WithStack(err)
	}

	if opts.corruptBufferForTest != nil {
		if _, err := tempFile.Seek(0, io.SeekStart); err != nil {
			return EncodeResult{}, errors.WithStack(err)
		}
		bufBytes, err := io.ReadAll(tempFile)
		if err != nil {
			return EncodeResult{}, errors.WithStack(err)
		}
		opts.corruptBufferForTest(bufBytes)
		if _, err := tempFile.WriteAt(bufBytes, 0); err != nil {
			return EncodeResult{}, errors.WithStack(err)
		}
	}

	header, mismatchTxt, matched, stErr := runSelfTest(tempFile, opts)
	result := EncodeResult{
		Header:              header,
		BytesWritten:        bytesWritten,
		SHA256:              hashWriter.Sum(),
		SelfTestRan:         true,
		SelfTestMatched:     matched,
		SelfTestMismatchTxt: mismatchTxt,
		AdaptersEnabled:     enabled,
	}
	if stErr != nil {
		return result, stErr
	}
	if !matched {
		return result, nil
	}
	if _, err := tempFile.Seek(0, io.SeekStart); err != nil {
		return result, errors.Wrap(err, "seek temp file for copy")
	}
	if _, err := io.Copy(out, tempFile); err != nil {
		return result, errors.Wrap(err, "copy buffered fsm to out")
	}
	return result, nil
}

Comment thread internal/backup/encode_snapshot.go Outdated
Comment on lines +240 to +275
func runSelfTest(fsmBytes []byte, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) {
scratchBase := opts.SelfTestDecodeOptions.OutRoot
scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-")
if err != nil {
return SnapshotHeader{}, nil, false, errors.Wrap(err, "mkdir scratch")
}
defer func() {
_ = os.RemoveAll(scratchDir)
}()

decOpts := opts.SelfTestDecodeOptions
decOpts.OutRoot = scratchDir

result, derr := DecodeSnapshot(bytes.NewReader(fsmBytes), decOpts)
if derr != nil {
// Decoder errored on our own output — that IS a self-test
// failure (the .fsm we produced isn't loadable). Surface as
// a mismatch with the decoder error embedded in the txt.
mismatchTxt := []byte("self-test failed: DecodeSnapshot rejected the produced .fsm: " + derr.Error())
return SnapshotHeader{}, mismatchTxt, false, nil
}

if result.Header.LastCommitTS != opts.LastCommitTS {
mismatchTxt := []byte(formatHeaderMismatch(opts.LastCommitTS, result.Header.LastCommitTS))
return result.Header, mismatchTxt, false, nil
}

diff, derr := diffAdapterTrees(opts.InputRoot, scratchDir, opts.Adapters)
if derr != nil {
return result.Header, nil, false, errors.Wrap(derr, "diff scratch tree")
}
if len(diff) > 0 {
return result.Header, []byte(strings.Join(diff, "\n") + "\n"), false, nil
}
return result.Header, nil, true, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update runSelfTest to accept an *os.File instead of a byte slice to support streaming the .fsm from disk during the self-test, avoiding loading the entire snapshot into memory.

func runSelfTest(fsmFile *os.File, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) {
	scratchBase := opts.SelfTestDecodeOptions.OutRoot
	scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-")
	if err != nil {
		return SnapshotHeader{}, nil, false, errors.Wrap(err, "mkdir scratch")
	}
	defer func() {
		_ = os.RemoveAll(scratchDir)
	}()

	decOpts := opts.SelfTestDecodeOptions
	decOpts.OutRoot = scratchDir

	if _, err := fsmFile.Seek(0, io.SeekStart); err != nil {
		return SnapshotHeader{}, nil, false, errors.Wrap(err, "seek temp file")
	}

	result, derr := DecodeSnapshot(fsmFile, decOpts)
	if derr != nil {
		// Decoder errored on our own output — that IS a self-test
		// failure (the .fsm we produced isn't loadable). Surface as
		// a mismatch with the decoder error embedded in the txt.
		mismatchTxt := []byte("self-test failed: DecodeSnapshot rejected the produced .fsm: " + derr.Error())
		return SnapshotHeader{}, mismatchTxt, false, nil
	}

	if result.Header.LastCommitTS != opts.LastCommitTS {
		mismatchTxt := []byte(formatHeaderMismatch(opts.LastCommitTS, result.Header.LastCommitTS))
		return result.Header, mismatchTxt, false, nil
	}

	diff, derr := diffAdapterTrees(opts.InputRoot, scratchDir, opts.Adapters)
	if derr != nil {
		return result.Header, nil, false, errors.Wrap(derr, "diff scratch tree")
	}
	if len(diff) > 0 {
		return result.Header, []byte(strings.Join(diff, "\n") + "\n"), false, nil
	}
	return result.Header, nil, true, nil
}

Comment on lines +318 to +392
func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) {
aFiles, aErr := walkRegularFiles(aDir)
if aErr != nil && !errors.Is(aErr, os.ErrNotExist) {
return nil, errors.Wrapf(aErr, "walk input %s", aDir)
}
bFiles, bErr := walkRegularFiles(bDir)
if bErr != nil && !errors.Is(bErr, os.ErrNotExist) {
return nil, errors.Wrapf(bErr, "walk scratch %s", bDir)
}

var diffs []string
aMap := map[string][]byte{}
for path, body := range aFiles {
aMap[path] = body
}
for path, bBody := range bFiles {
aBody, ok := aMap[path]
if !ok {
diffs = append(diffs, relPrefix+"/"+path+" (missing in input)")
continue
}
if !bytes.Equal(aBody, bBody) {
diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)")
}
delete(aMap, path)
}
// Anything remaining in aMap is present in input but not in scratch.
remaining := make([]string, 0, len(aMap))
for path := range aMap {
remaining = append(remaining, relPrefix+"/"+path+" (missing in scratch)")
}
sort.Strings(remaining)
diffs = append(diffs, remaining...)
return diffs, nil
}

// walkRegularFiles returns a map of relative path -> file bytes for every
// regular file under root. Missing root is the empty map + os.ErrNotExist.
// Bounded by the per-adapter test fixtures the encoder runs against;
// production-scale dumps may want streaming compare, deferred until a
// real bottleneck shows up.
func walkRegularFiles(root string) (map[string][]byte, error) {
out := map[string][]byte{}
rootInfo, err := os.Stat(root)
if err != nil {
return nil, errors.WithStack(err)
}
if !rootInfo.IsDir() {
return nil, errors.Errorf("not a directory: %s", root)
}
if err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error {
if err != nil {
return err
}
if d.IsDir() {
return nil
}
if !d.Type().IsRegular() {
return nil
}
body, rerr := os.ReadFile(path) //nolint:gosec // walking a caller-provided root, regular files only
if rerr != nil {
return errors.Wrap(rerr, path)
}
rel, rerr := filepath.Rel(root, path)
if rerr != nil {
return errors.WithStack(rerr)
}
out[filepath.ToSlash(rel)] = body
return nil
}); err != nil {
return nil, errors.WithStack(err)
}
return out, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Loading the entire contents of all files in both the input and scratch directories into memory (walkRegularFiles reading all file bytes via os.ReadFile) will cause OOM crashes on large backups (e.g., with large S3 blobs). Instead, walk the directories to collect file sizes, and perform a streaming chunk-by-chunk comparison for files present in both directories.

func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) {
	aFiles, aErr := walkRegularFiles(aDir)
	if aErr != nil && !errors.Is(aErr, os.ErrNotExist) {
		return nil, errors.Wrapf(aErr, "walk input %s", aDir)
	}
	bFiles, bErr := walkRegularFiles(bDir)
	if bErr != nil && !errors.Is(bErr, os.ErrNotExist) {
		return nil, errors.Wrapf(bErr, "walk scratch %s", bDir)
	}

	var diffs []string
	for path, aSize := range aFiles {
		bSize, ok := bFiles[path]
		if !ok {
			diffs = append(diffs, relPrefix+"/"+path+" (missing in scratch)")
			continue
		}
		if aSize != bSize {
			diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)")
			continue
		}
		equal, err := compareFiles(filepath.Join(aDir, path), filepath.Join(bDir, path))
		if err != nil {
			return nil, err
		}
		if !equal {
			diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)")
		}
	}
	for path := range bFiles {
		if _, ok := aFiles[path]; !ok {
			diffs = append(diffs, relPrefix+"/"+path+" (missing in input)")
		}
	}
	sort.Strings(diffs)
	return diffs, nil
}

func walkRegularFiles(root string) (map[string]int64, error) {
	out := map[string]int64{}
	rootInfo, err := os.Stat(root)
	if err != nil {
		return nil, errors.WithStack(err)
	}
	if !rootInfo.IsDir() {
		return nil, errors.Errorf("not a directory: %s", root)
	}
	if err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error {
		if err != nil {
			return err
		}
		if d.IsDir() {
			return nil
		}
		if !d.Type().IsRegular() {
			return nil
		}
		info, rerr := d.Info()
		if rerr != nil {
			return errors.Wrap(rerr, path)
		}
		rel, rerr := filepath.Rel(root, path)
		if rerr != nil {
			return errors.WithStack(rerr)
		}
		out[filepath.ToSlash(rel)] = info.Size()
		return nil
	}); err != nil {
		return nil, errors.WithStack(err)
	}
	return out, nil
}

func compareFiles(pathA, pathB string) (bool, error) {
	fA, err := os.Open(pathA)
	if err != nil {
		return false, errors.WithStack(err)
	}
	defer fA.Close()

	fB, err := os.Open(pathB)
	if err != nil {
		return false, errors.WithStack(err)
	}
	defer fB.Close()

	bufA := make([]byte, 64*1024)
	bufB := make([]byte, 64*1024)
	for {
		nA, errA := io.ReadFull(fA, bufA)
		nB, errB := io.ReadFull(fB, bufB)

		if errA == io.EOF || errA == io.ErrUnexpectedEOF {
			errA = nil
		}
		if errB == io.EOF || errB == io.ErrUnexpectedEOF {
			errB = nil
		}

		if errA != nil {
			return false, errors.WithStack(errA)
		}
		if errB != nil {
			return false, errors.WithStack(errB)
		}

		if nA != nB {
			return false, nil
		}
		if nA == 0 {
			break
		}
		if !bytes.Equal(bufA[:nA], bufB[:nB]) {
			return false, nil
		}
	}
	return true, nil
}

Comment thread cmd/elastickv-snapshot-encode/main.go Outdated
Comment on lines +47 to +52
// tempSuffixHexLen is the hex-character length of the random
// suffix appended to <output>.tmp-<hex>; 8 hex chars = 4 bytes of
// entropy, which gives ~4×10^9 collision space per --output path
// (more than enough for concurrent encodes against the same path).
tempSuffixHexLen = 8
tempSuffixByteLen = tempSuffixHexLen / 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using only 4 bytes of entropy (8 hex characters) for the temporary file suffix has a relatively high probability of collision (due to the Birthday Paradox) in highly concurrent environments or automated test pipelines. Increasing the suffix length to 16 hex characters (8 bytes of entropy) provides a much safer collision-free space.

Suggested change
// tempSuffixHexLen is the hex-character length of the random
// suffix appended to <output>.tmp-<hex>; 8 hex chars = 4 bytes of
// entropy, which gives ~4×10^9 collision space per --output path
// (more than enough for concurrent encodes against the same path).
tempSuffixHexLen = 8
tempSuffixByteLen = tempSuffixHexLen / 2
// tempSuffixHexLen is the hex-character length of the random
// suffix appended to <output>.tmp-<hex>; 16 hex chars = 8 bytes of
// entropy, which gives a collision-free space per --output path.
tempSuffixHexLen = 16
tempSuffixByteLen = tempSuffixHexLen / 2

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1fa691fb74

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return backup.AdapterSet{}, errors.Errorf("unknown adapter %q", name)
}
}
return set, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty adapter selections

When --adapter contains only separators or whitespace (for example --adapter ' ,', or a templated value that expands to spaces), every token hits the empty case and this returns a zero AdapterSet with no error. The rest of the encoder then invokes no adapter encoders, publishes a valid header-only .fsm, and exits successfully, which can turn a bad CLI argument into an apparently successful but empty restore artifact; the decoder-side parser already rejects this zero-adapter case.

Useful? React with 👍 / 👎.

Comment thread cmd/elastickv-snapshot-encode/main.go Outdated
Comment on lines +161 to +162
if _, err := fmt.Sscanf(s, "%d", &ts); err != nil {
return 0, errors.Wrap(err, "--last-commit-ts decimal parse")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject trailing junk in timestamps

For decimal --last-commit-ts values, fmt.Sscanf with %d accepts a numeric prefix and ignores trailing bytes, so an input like --last-commit-ts 100oops is silently treated as 100. Because this flag controls the restored HLC ceiling and sidecar metadata, malformed operator input should fail instead of publishing a snapshot with a different timestamp than the argument appears to specify.

Useful? React with 👍 / 👎.

Six substantive review findings on the M6 implementation PR. Headline
fix: switched the self-test buffer from in-memory bytes.Buffer to a
disk-backed temp file so large snapshots cannot OOM the encoder.

## Gemini high #1: SelfTest=true was buffering FSM in memory

Out Of Memory risk on production-scale snapshots (gemini high #904).
Fixed by writing the FSM through a tee (sha256 + os.File.WriteAt) into
an os.CreateTemp file under opts.SelfTestDecodeOptions.OutRoot, then
Seek(0) and stream through DecodeSnapshot for the self-test. The temp
file is os.Remove'd via defer on every exit path. The contract is
unchanged: corruption reaches the self-test decode but never reaches
the io.Writer (TestEncodeSnapshotSelfTestDetectsCorruption still asserts
out.Len()==0).

Memory cost: O(1) for the SelfTest=true path now (previously was an
FSM-sized bytes.Buffer on top of the sort working set).

## Gemini high #2: runSelfTest signature

Changed from runSelfTest(fsmBytes []byte, ...) to
runSelfTest(fsmFile io.Reader, ...) so the .fsm streams through
DecodeSnapshot instead of being held as []byte. Same OOM fix at the
runSelfTest layer.

## Gemini high #3: walkRegularFiles loaded every file's bytes into memory

For large S3 blobs this OOMs the encoder. Replaced walkRegularFiles
with walkRegularFilePaths (paths only) and streamFilesEqual (chunk
compare at streamCmpBufSize = 64 KiB). Per-file compare cost is now
O(streamCmpBufSize), not O(file size).

## Gemini medium: tempSuffixHexLen entropy was 4 bytes

Birthday-paradox collision risk in highly concurrent CI environments.
Bumped from 8 hex / 4 bytes (2^32 space) to 16 hex / 8 bytes (2^64
space). Same width crypto/rand uses for cryptographic nonces.

## Claude high + codex P2: fmt.Sscanf silently accepts partial parses

--last-commit-ts 0xffZZ parsed as 0xff (fmt.Sscanf stops at first
non-hex). --last-commit-ts 100oops parsed as 100. This means a
typo could silently set the snapshot HLC ceiling to something the
operator did not type. Replaced fmt.Sscanf with strconv.ParseUint
(base 10 / 16) which rejects trailing garbage as a parse error.
New TestParseLastCommitTS cases cover: 0xffZZ, 100oops, " 100 ext",
"-1".

## Codex P2: empty adapter selection produced an empty AdapterSet

--adapter ' ,' (e.g. a templated argument that expanded to spaces)
would yield a zero AdapterSet and the encoder would publish a valid
header-only .fsm with no adapter records — a silently-empty restore
artifact. Now rejected with a clear error. Pinned by
TestParseAdapterSetRejectsEmptySelection covering " ,", ",,,",
"   ", and "," — plus a positive case (single "s3") asserting the
guard does not break legitimate single-adapter selections.

## Caller audit per CLAUDE.md semantic-change rule

- runSelfTest signature changed from ([]byte, opts) to (io.Reader, opts).
  Sole caller is encodeBuffered (same package, just updated). No
  external callers since the function is unexported.
- corruptBufferForTest hook type changed from func([]byte) to
  func(*os.File). Sole caller is the same-package test
  TestEncodeSnapshotSelfTestDetectsCorruption (updated to ReadAt/
  WriteAt the temp file). External callers cannot set this field
  (lowercase).
- streamFilesEqual replaces the bytes.Equal compare in diffOneSubdir.
  Same return contract (bool, error) so the caller (diffOneSubdir)
  is unchanged in behavior.
- parseAdapterSet now returns an error for empty-selection inputs.
  Caller is parseFlags which already propagates the error to exit-1.

Tests + lint green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v2 (b131186): addresses all 6 review findings on the implementation PR.

Gemini high #1 (OOM on large snapshots): Self-test buffer is now an on-disk os.CreateTemp file (NOT *bytes.Buffer). FSM streams through tee → sha256 + temp file; self-test decodes from the temp file directly. Memory cost dropped from O(FSM size) to O(1). Pinned by existing TestEncodeSnapshotSelfTestDetectsCorruption (corrupt-via-WriteAt, never reaches out).

Gemini high #2 (runSelfTest signature): Changed from fsmBytes []byte to fsmFile io.Reader, completing the streaming chain.

Gemini high #3 (walkRegularFiles OOM): Replaced bytes-into-memory with walkRegularFilePaths + streamFilesEqual (64 KiB chunk compare). Per-file cost is now O(chunk size), not O(file size).

Gemini medium (tempSuffixHexLen entropy): Bumped from 8 hex / 4 bytes (2^32) to 16 hex / 8 bytes (2^64), same width crypto/rand uses for cryptographic nonces.

Claude high + codex P2 (fmt.Sscanf partial parses): Replaced with strconv.ParseUint strict parsing. Trailing garbage like 0xffZZ or 100oops now surfaces as a flag-parse error. TestParseLastCommitTS extended to cover both.

Codex P2 (empty adapter selection): --adapter ' ,' etc. now rejected. Pinned by TestParseAdapterSetRejectsEmptySelection. Without the guard a templated arg expanding to spaces would silently publish a header-only .fsm.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 41s —— View job


Code Review - PR #904 (v2)

Tasks

  • Gather context and understand the request
  • Read changed files (v2 commit b131186)
  • Review library code (internal/backup/)
  • Review CLI code (cmd/elastickv-snapshot-encode/)
  • Review tests
  • Post final review

v2 Verification — All 6 Required Fixes Confirmed

Finding Status
Gemini high #1: OOM — self-test buffer in *bytes.Buffer encodeBuffered now uses os.CreateTemp, tees through teeWriter to hash+file, self-test streams from *os.File
Gemini high #2: runSelfTest(fsmBytes []byte) → streaming ✅ Signature is now runSelfTest(fsmFile io.Reader, opts EncodeOptions)
Gemini high #3: walkRegularFiles loading all bytes into memory walkRegularFilePaths returns path→abspath map; streamFilesEqual does 64 KiB chunk compare via streamReadersEqual
Gemini medium: tempSuffixHexLen low entropy (4 bytes) ✅ Bumped to 16 hex chars / 8 bytes (2^64 space)
Claude high + codex P2: fmt.Sscanf partial parses parseLastCommitTS uses strconv.ParseUint with strict rejection; TestParseLastCommitTS covers "0xffZZ" and "100oops"
Codex P2: empty adapter selection silently publishes header-only .fsm parseAdapterSet rejects zero-adapter result; TestParseAdapterSetRejectsEmptySelection covers " ,", ",,,"

One New Issue Introduced by v2

Stale doc comments still say *bytes.Buffer after the disk-backed rewrite

internal/backup/encode_snapshot.go:52-57 (the SelfTest field comment):

// SelfTest enables the round-trip self-test. EncodeSnapshot buffers
// the FSM in *bytes.Buffer, decodes from the buffer, and copies to
// the caller's io.Writer only if the buffer survives DecodeSnapshot

internal/backup/encode_snapshot.go:113-116 (the EncodeSnapshot function doc):

// When opts.SelfTest=true an internal *bytes.Buffer holds the
// FSM during the self-test; bytes are copied to out only after the
// self-test matches. Memory cost in self-test mode is one FSM-sized
// allocation on top of the existing sort working set.

Both still describe the v1 in-memory approach. The actual v2 implementation writes to an os.CreateTemp file (encode-self-test-fsm-*) and the memory cost is O(1) (not O(FSM size)). A future reader looking at these comments to understand the memory model will get the wrong answer.

Fix this →


Carry-over Observations (non-blocking, noted in v1)

These were marked "no fix required" in the v1 review and remain unchanged:

  • EncodeResult.SHA256 comment (encode_snapshot.go:91) still says "lowercase hex via SHA256Hex" — no such function exists; the hex encoding happens in the CLI's writeSidecar.
  • diffOneSubdir non-determinism — the diffs slice for "bytes differ" and "missing in input" entries is built in map-iteration order; only remaining is sorted. mismatch.txt output can vary across runs with identical inputs.
  • EncodeSnapshot returns (result, nil) on self-test mismatch — callers must check result.SelfTestMatched AND err != nil; callers who only check the error will silently pass through a mismatch. The godoc for EncodeSnapshot doesn't call out this dual-check contract.
  • sha256w = hashSHA256 alias chain — two interface names for the same narrow surface; hash.Hash (already transitively imported) would be clearer in sha256Writer.h.
  • uitoa/itoa hand-rolledstrconv.FormatUint/strconv.Itoa are exact equivalents.

Strengths in v2

The disk-backed self-test path is cleanly implemented:

  • teeWriter correctly hashes bytes AS they stream to the temp file — EncodeResult.SHA256 reflects the encoder's actual output, not any test corruption.
  • The corruption hook fires AFTER Sync() but BEFORE the seek-to-0, so the self-test decoder always reads the (possibly corrupted) on-disk bytes — the write-then-rename invariant is correctly preserved.
  • streamReadersEqual correctly short-circuits on aSize != bSize before opening files, and the io.ReadFull loop handles the last-partial-chunk case (io.ErrUnexpectedEOF) before checking aErr != nil.
  • TestEncodeSnapshotSelfTestDetectsCorruption correctly asserts out.Len()==0 on mismatch — the critical publish-nothing invariant.
  • TestParseAdapterSetRejectsEmptySelection covers the " ,", ",", ",,," cases that would otherwise produce a zero AdapterSet.

Verdict: Ready to merge once the two stale *bytes.Buffer comments are updated. All 6 previously required fixes are correctly implemented and tested.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b13118638f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +130 to +131
b := newSnapshotBuilder(opts.LastCommitTS)
enabled, err := runAdapterEncoders(b, opts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce the manifest timestamp floor in EncodeSnapshot

When EncodeSnapshot is called directly (the new library entrypoint the design exposes for future callers), a caller can pass LastCommitTS lower than InputRoot/MANIFEST.json:last_commit_ts; this path immediately builds with that value and never reads the manifest or returns the advertised ErrSelfTestLowerLastCommitTS. The CLI pre-check catches its own invocations, but in-process callers can still emit an .fsm whose header and MVCC key timestamps lower the restored HLC ceiling, violating the fail-closed invariant this wrapper is meant to enforce.

Useful? React with 👍 / 👎.

…over cleanups)

Mandatory (claude v2):
- Stale *bytes.Buffer comments in SelfTest field doc and EncodeSnapshot
  godoc updated to describe the v2 on-disk temp file model with O(1)
  memory cost (encode-self-test-fsm-* under SelfTestDecodeOptions.OutRoot).

Non-blocking carry-over items from v1 + v2 (folded in since trivial):
- EncodeResult.SHA256 comment no longer references a non-existent
  SHA256Hex function; says "raw 32-byte digest; CLI hex-encodes via
  encoding/hex in writeSidecar".
- EncodeSnapshot godoc now spells out the dual-check contract
  ("Callers MUST check result.SelfTestMatched before treating a nil
  error as success") since self-test mismatch returns (result, nil).
- diffOneSubdir output is now fully sorted (was: only the "remaining"
  segment sorted; map-iteration order leaked into the rest). mismatch.txt
  is deterministic across runs with identical inputs.
- Replaced hand-rolled uitoa/itoa with strconv.FormatUint and
  strconv.Itoa. Dropped the sha256w / hashSHA256 alias chain; the
  sha256Writer.h field now uses hash.Hash directly (already imported).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v3 (b8da884): addresses claude v2 review.

Mandatory (the v2 blocker): Two stale *bytes.Buffer doc comments updated to describe the on-disk temp file model (encode-self-test-fsm-*) with O(1) memory cost.

Non-blocking carry-overs folded in since trivial:

  • EncodeResult.SHA256 comment no longer references non-existent SHA256Hex (now: "raw 32-byte digest; CLI hex-encodes in writeSidecar")
  • EncodeSnapshot godoc spells out the dual-check contract for self-test mismatch (returns (result, nil) — callers MUST check SelfTestMatched)
  • diffOneSubdir fully sorts the output (was: only "remaining" segment sorted); mismatch.txt is now deterministic
  • Hand-rolled uitoa/itoa replaced with strconv.FormatUint / strconv.Itoa
  • Dropped the sha256w / hashSHA256 alias chain; sha256Writer.h now uses hash.Hash directly

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 33s —— View job


Code Review - PR #904 (v3)

Tasks

  • Gather context and understand the request
  • Read changed files (v3 commit b8da884)
  • Verify all v2 carry-over fixes are implemented correctly
  • Check for any new issues introduced in v3
  • Post final review

v3 Verification — All v2 Carry-overs Confirmed

Finding Status
Mandatory v2 fix: stale *bytes.Buffer doc in SelfTest field comment ✅ Now says "on-disk temp file under SelfTestDecodeOptions.OutRoot (encode-self-test-fsm-*)" and "O(1) memory cost"
Mandatory v2 fix: stale *bytes.Buffer doc in EncodeSnapshot func comment ✅ Now describes disk-based approach; explicitly credits gemini high #904 OOM fix
EncodeResult.SHA256 phantom SHA256Hex reference ✅ Now: "raw 32-byte digest; the CLI hex-encodes it via encoding/hex when writing ENCODE_INFO.json"
EncodeSnapshot dual-check contract undocumented ✅ Lines 127-131 now explicitly document: "Self-test failure returns (result, nil)… Callers MUST check result.SelfTestMatched before treating a nil error as success"
diffOneSubdir non-deterministic output sort.Strings(diffs) on line 423 now applies to ALL diff entries (missing-in-input + bytes-differ + missing-in-scratch)
Hand-rolled uitoa/itoa formatHeaderMismatch uses strconv.FormatUint; truncation message uses strconv.Itoa
sha256w/hashSHA256 alias chain ✅ Both sha256Writer.h (line 538) and teeWriter.h (line 244) now typed as hash.Hash directly; "hash" imported explicitly

One New Doc Bug in v3

ErrSelfTestLowerLastCommitTS and EncodeSnapshot both claim the HLC floor check lives in the library — it doesn't

internal/backup/encode_snapshot.go:17-25 (the var comment):

// ErrSelfTestLowerLastCommitTS is returned by EncodeSnapshot when the
// effective LastCommitTS in EncodeOptions is below the manifest's value.
// ...
// Surfaced at the EncodeSnapshot layer so the CLI's main exits with code 2

internal/backup/encode_snapshot.go:133-136 (the func comment):

// Returns ErrSelfTestLowerLastCommitTS when opts.LastCommitTS is below
// the manifest's value — caller is responsible for reading the manifest
// and computing the effective T (this layer just validates the floor).

Both are inaccurate. EncodeSnapshot never reads MANIFEST.json, never compares opts.LastCommitTS to the manifest value, and never returns ErrSelfTestLowerLastCommitTS. That check is performed exclusively by the CLI's resolveLastCommitTS before EncodeSnapshot is ever called:

// cmd/elastickv-snapshot-encode/main.go:349-354
func resolveLastCommitTS(cfg *config, manifestTS uint64) (uint64, bool, error) {
    if cfg.lastCommitTS < manifestTS {
        return 0, false, errors.Wrapf(backup.ErrSelfTestLowerLastCommitTS, ...)
    }
    ...
}

The functional behaviour is correct — the HLC floor IS enforced before any bytes are written. But a future library caller reading the EncodeSnapshot godoc would trust that calling EncodeSnapshot with a low LastCommitTS returns ErrSelfTestLowerLastCommitTS, and be surprised when it doesn't (the call silently succeeds and stamps the low timestamp into the .fsm).

The fix is straightforward: correct both comments to say the error is a sentinel for the caller to return after its own manifest comparison, not something EncodeSnapshot itself enforces.

Fix this →


Observations (non-blocking)

sha256Writer and teeWriter are structurally identical

Both are {w io.Writer; h hash.Hash} with identical Write bodies. teeWriter is used in encodeBuffered, sha256Writer in encodeStream. They could be unified into one type — sha256Writer already has the Sum() [32]byte convenience method; adding it would cover both paths. Not a blocker, but the duplication will grow if a third path is added.


Strengths in v3

  • diffOneSubdir's sort.Strings(diffs) at the end is now truly a full-fence sort: the loop first drains bPaths (collecting missing-in-input and bytes-differ entries), then drains remaining aPaths (collecting missing-in-scratch), then sorts the combined slice. Determinism is correct regardless of map-iteration order.
  • teeWriter.h and sha256Writer.h both correctly typed as hash.Hash, which removes the confusing sha256w/hashSHA256 indirection without adding any import cost ("hash" is already in-tree via crypto/sha256).
  • strconv.FormatUint / strconv.Itoa are drop-in replacements for the hand-rolled converters with no semantic change — the behaviour is identical and a reader no longer has to audit them.
  • The EncodeSnapshot dual-check warning ("library callers should follow the same pattern") is now in the right place: immediately after documenting the (result, nil) return on mismatch, not buried in a separate godoc block.

Verdict: One doc fix needed before merge. All functional behaviour is correct; all v2 findings are addressed. The doc bug (EncodeSnapshot + ErrSelfTestLowerLastCommitTS incorrectly attributed to the library layer) is the only remaining item.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8da884f9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/elastickv-snapshot-encode/main.go Outdated
// encodeToTempFile creates tempPath, runs EncodeSnapshot into it,
// fsync+close. Caller is responsible for the os.Remove cleanup on error.
func encodeToTempFile(tempPath string, encodeOpts backup.EncodeOptions) (backup.EncodeResult, error) {
tempFile, err := os.Create(tempPath) //nolint:gosec // operator-supplied path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Create snapshot artifacts owner-only

When the encoder runs with a typical umask like 022, os.Create creates the temporary FSM as mode 0666 masked to 0644, and that same file becomes the published <output> after the rename. Because the .fsm contains the full KV dataset, this exposes backup contents to other local users on multi-user backup hosts while the encode is running and after publish; the decoded adapter writers in internal/backup/open_nofollow_* use 0600 for this reason. Please create the temp output with an explicit owner-only mode before renaming it.

Useful? React with 👍 / 👎.

… unify writer types

Mandatory (claude v3):
ErrSelfTestLowerLastCommitTS var comment and EncodeSnapshot godoc both
claimed the HLC ceiling check is enforced in the library. It is NOT.
The check lives in the CLI's resolveLastCommitTS BEFORE EncodeSnapshot
is called. A future library caller (Phase 1 live extractor, integration
test) reading the godoc would assume EncodeSnapshot guards against a
low timestamp and be surprised when it silently stamps the low value
into the .fsm.

Both comments updated to say:
- The error is the sentinel callers should return after their OWN
  manifest comparison.
- EncodeSnapshot does NOT read MANIFEST.json or validate
  opts.LastCommitTS against any external floor.

Non-blocking observation (claude v3):
sha256Writer and teeWriter were structurally identical (both
{w io.Writer; h hash.Hash} with the same Write body). teeWriter
removed; encodeBuffered now uses newSHA256Writer for the temp-file
tee. One less type to audit.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v4 (a357279): addresses claude v3.

Mandatory: ErrSelfTestLowerLastCommitTS var comment + EncodeSnapshot godoc claimed the HLC ceiling check is enforced in the library — it isn't (the check lives in the CLI's resolveLastCommitTS BEFORE EncodeSnapshot is called). Both docs corrected to say the error is the sentinel callers should return after their OWN manifest comparison, and EncodeSnapshot does NOT read MANIFEST.json or validate against any external floor.

Non-blocking observation: Unified the duplicated sha256Writer + teeWriter types. Both had identical {w io.Writer; h hash.Hash} structure and Write bodies. teeWriter removed; encodeBuffered now uses newSHA256Writer for the temp-file tee. One less type to audit.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 35s —— View job


Code Review - PR #904 (v4)

Tasks

  • Gather context and understand the request
  • Read changed files (v4 commit a357279)
  • Verify mandatory v3 fix: ErrSelfTestLowerLastCommitTS + EncodeSnapshot doc corrections
  • Verify non-blocking fix: teeWriter removed, encodeBuffered uses newSHA256Writer
  • Check for new issues introduced in v4
  • Post final review

v4 Verification — All v3 Items Confirmed

Finding Status
Mandatory v3 fix: ErrSelfTestLowerLastCommitTS var comment ✅ Changed from "is returned by EncodeSnapshot when the effective LastCommitTS is below the manifest's value" to "is the sentinel callers should return when…"; added explicit statement that EncodeSnapshot does NOT read MANIFEST.json or enforce this floor
Mandatory v3 fix: EncodeSnapshot func comment ✅ Changed from "Returns ErrSelfTestLowerLastCommitTS when opts.LastCommitTS is below the manifest's value" to "EncodeSnapshot does NOT read MANIFEST.json and does NOT validate opts.LastCommitTS against any external floor"; correctly identifies resolveLastCommitTS (CLI only) as the check site
Non-blocking: teeWriter removed ✅ Struct + methods fully deleted; encodeBuffered now calls newSHA256Writer(tempFile) and hashTee.Sum() — semantically identical to the removed var sha [32]byte; copy(sha[:], hasher.Sum(nil)) pattern, one less type to audit

One Remaining Issue — encodeInfoFilePerm is dead code; sidecar created 0644

cmd/elastickv-snapshot-encode/main.go:56:

encodeInfoFilePerm = 0o600

This constant is defined but never used. Both file-creation sites still call os.Create, which ignores the permission argument:

// encodeToTempFile (line 323) — temp FSM, renamed to <output>
tempFile, err := os.Create(tempPath)

// writeSidecar (line 407) — ENCODE_INFO.json
f, err := os.Create(sidecarPath)

os.Create opens with mode 0666 and relies on the process umask. With a typical umask of 022 this gives 0644 — the full KV dataset is world-readable on a multi-user backup host while encoding and after publication. mismatchTxtPerm = 0o600 IS wired to os.WriteFile for mismatch.txt, but encodeInfoFilePerm is silently a no-op.

The Codex v3 review flagged the temp FSM side of this as P2. The dead constant shows the sidecar side was intended to be fixed too but wasn't applied.

Fix both call sites:

// encodeToTempFile
tempFile, err := os.OpenFile(tempPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o600)

// writeSidecar — now encodeInfoFilePerm is no longer dead
f, err := os.OpenFile(sidecarPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, encodeInfoFilePerm)

Fix this →


Strengths in v4

  • The ErrSelfTestLowerLastCommitTS doc correction is precise: the rewrite distinguishes between "what this sentinel means" (caller-defined HLC regression) and "who enforces it" (the CLI's resolveLastCommitTS, not EncodeSnapshot). Future library callers reading the godoc now get an accurate contract.
  • Unifying teeWriter into sha256Writer eliminates the only meaningful duplication left in the encoder. The hashTee.Sum() call is cleaner than the previous var sha [32]byte; copy(sha[:], hasher.Sum(nil)) pattern and leaves a single canonical tee-with-hash type in the package.
  • encodeBuffered is now visibly simpler: three lines (newSHA256Writer, WriteTo, Sum) versus five with the old hasher + teeWriter split.

Verdict: One fix before merge: os.Createos.OpenFile at both file-creation sites so the FSM and sidecar are 0600 rather than 0644, and encodeInfoFilePerm becomes live code. All v3 findings are correctly addressed and the codebase is otherwise clean.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a3572798b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

b := newSnapshotBuilder(opts.LastCommitTS)
enabled, err := runAdapterEncoders(b, opts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject zero-adapter library encodes

When an in-process caller passes the zero AdapterSet (for example, a future Phase 1 caller forgets to thread adapter selection), runAdapterEncoders returns an empty list with no error and the subsequent write emits a valid header-only .fsm, silently discarding any adapter subtrees under InputRoot. The CLI parser now rejects empty selections, but this public library entrypoint still has the same fail-open behavior for direct callers, so it should fail before writing when no adapters are enabled.

Useful? React with 👍 / 👎.

Claude v4 mandatory finding: encodeInfoFilePerm = 0o600 was declared
but dead code. Both file-creation sites (encodeToTempFile and
writeSidecar) used os.Create, which honors umask and typically gives
0644. On a multi-user backup host, the full KV dataset (.fsm) and the
sidecar (which carries source path, cluster_id, and SHA-256) would
have been world-readable.

Replaced both os.Create calls with os.OpenFile + explicit
encodeInfoFilePerm:
- encodeToTempFile: temp .fsm is now 0o600 from creation, so the
  on-disk dataset is never world-readable during the encode-then-
  rename window.
- writeSidecar: ENCODE_INFO.json gets the same 0o600.

New TestCLIPublishesFsmAndSidecarMode0600 asserts both files have no
group/other access bits (`perm & 0o077 == 0`) after a successful run.
Skips on Windows where Unix-style mode bits aren't meaningful.

Codex v3 had also flagged the temp FSM side at P2 in passing; that's
now addressed alongside the sidecar.

Caller audit: both call sites updated in this same file; no other
os.Create calls remain in the encoder CLI. The `0o600` constant is now
live code.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v5 (0c223d2): addresses claude v4 mandatory.

Mandatory (claude v4): encodeInfoFilePerm = 0o600 was dead code; both os.Create call sites (encodeToTempFile + writeSidecar) relied on umask and gave 0644 in practice. On a multi-user backup host the .fsm + sidecar would have been world-readable.

Both replaced with os.OpenFile + encodeInfoFilePerm. The constant is now live code. New TestCLIPublishesFsmAndSidecarMode0600 asserts perm & 0o077 == 0 on both files. Skips on Windows.

Caller audit: both sites updated; no other os.Create calls remain in the encoder CLI.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 29s —— View job


Code Review - PR #904 (v5)

Tasks

  • Gather context and understand the request
  • Read changed files (v5 commit 0c223d2)
  • Verify mandatory v4 fix: os.Createos.OpenFile at both sites with 0o600
  • Verify new test TestCLIPublishesFsmAndSidecarMode0600
  • Check for any new issues introduced in v5
  • Post final review

v5 Verification — Mandatory v4 Fix Confirmed

Finding Status
Mandatory v4 fix: encodeToTempFile — temp FSM world-readable os.OpenFile(tempPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, encodeInfoFilePerm) at main.go:325
Mandatory v4 fix: writeSidecar — sidecar world-readable os.OpenFile(sidecarPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, encodeInfoFilePerm) at main.go:412
encodeInfoFilePerm = 0o600 is now live code ✅ Used at both os.OpenFile call sites; no longer dead
New test TestCLIPublishesFsmAndSidecarMode0600 ✅ Skips on Windows; asserts perm & 0o077 == 0 for both out (the .fsm) and out + ".encode_info.json"

Permission chain is correct end-to-end: the temp file is created 0o600os.Rename preserves permissions on POSIX → the published .fsm arrives at cfg.outputPath with 0o600. The test correctly checks the final path, not the transient temp path.

The library-level self-test buffer (encode_snapshot.go:194) uses os.CreateTemp, which creates files mode 0600 by default on Unix — no issue there.

No remaining os.Create calls in either encoder file that would allow umask to widen permissions.


No New Issues Introduced in v5

The change is minimal and scoped to exactly what v4 flagged: two os.Create call sites replaced with os.OpenFile and one new test.


Carry-over Observation (pre-existing, non-blocking)

EncodeSnapshot does not guard against zero AdapterSet at the library level

internal/backup/encode_snapshot.go:144 — when called directly with opts.Adapters == AdapterSet{}, runAdapterEncoders returns an empty enabled slice with no error, producing a valid header-only .fsm. The CLI parser (parseAdapterSet) guards against this, but a future in-process caller who forgets to thread adapter selection would get a silent empty-restore artifact. The codex v5 review also flagged this (P2). Not introduced by v5; not a blocker for merge, but worth tracking.


Strengths in v5

  • encodeInfoFilePerm is now the single source of truth for file permissions across both the FSM and sidecar — if the permission is ever revisited, one constant update covers both sites.
  • The Windows skip in TestCLIPublishesFsmAndSidecarMode0600 uses runtime.GOOS == "windows" (evaluated once at package init via var isWindows), which is cleaner than an in-test runtime.GOOS call inside the table.
  • The test assertion perm&0o077 != 0 is deliberately loose: it accepts 0o400 or 0o600 (e.g., a filesystem that strips the owner-write bit) while still rejecting anything with group or other bits.

Verdict: Ready to merge. The mandatory v4 fix is correctly implemented and pinned by the new test. No new issues in v5.
| Branch

… v5 + codex v5)

Carry-over flagged in both v5 reviews: EncodeSnapshot at the library
layer accepted opts.Adapters == AdapterSet{} and produced a valid
header-only .fsm with no adapter records. The CLI's parseAdapterSet
already rejects this for flag-driven entry, but a future in-process
caller (Phase 1 live extractor, integration test) that forgets to
thread the adapter set would get a silently empty restore artifact.

Fail-closed guard added at the top of EncodeSnapshot alongside the
existing InputRoot / out-nil checks. Pinned by
TestEncodeSnapshotRejectsZeroAdapterSet which also asserts no bytes
are written to out on rejection.

Claude v5 explicitly said the existing fix was "Ready to merge" and
this was non-blocking; folding in for completeness since the guard
is a 4-line addition and the test is independent.

Caller audit per CLAUDE.md semantic-change rule: EncodeSnapshot's
new pre-condition narrows the accepted input domain. All existing
callers either (a) pass through the CLI's parseAdapterSet (which
already rejects empty selections) or (b) are same-package tests
that pass an explicit AdapterSet{...} literal. No legitimate caller
is impacted.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v6 (b56c0e6): folded in the carry-over non-blocking guard.

Claude v5 already declared "Ready to merge" but flagged the library-level zero-AdapterSet gap (also raised by codex v5). Added the guard at the top of EncodeSnapshot alongside the existing InputRoot / out-nil checks. Pinned by TestEncodeSnapshotRejectsZeroAdapterSet (asserts error AND no bytes written to out).

Caller audit: the new pre-condition narrows accepted input. All existing callers pass through the CLI's parseAdapterSet (already rejects empty) or are same-package tests with explicit literals. No legitimate caller impacted.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 1m 30s —— View job


Code Review - PR #904 (v6)

Tasks

  • Gather context and understand the request
  • Read v6 changes (internal/backup/encode_snapshot.go, encode_snapshot_test.go)
  • Verify zero-AdapterSet guard implementation
  • Verify TestEncodeSnapshotRejectsZeroAdapterSet correctness
  • Check for any new issues introduced in v6
  • Post final review

v6 Verification — Zero-AdapterSet Guard Confirmed

The change is exactly scoped to what was described: 10 lines to encode_snapshot.go (the guard) and 22 lines to encode_snapshot_test.go (the test). No other files touched.

Item Status
Guard placed before newSnapshotBuilder (no bytes written before rejection) ✅ Lines 151-160, after out == nil check
Condition covers all four current adapter fields !DynamoDB && !S3 && !Redis && !SQS
Returns EncodeResult{} + non-nil error (zero bytes)
Error message matches package prefix convention "backup: EncodeOptions.Adapters has no enabled adapter"
Comment explains WHY and references existing CLI guard ✅ Credits parseAdapterSet + codex v5 + claude v5 #904
TestEncodeSnapshotRejectsZeroAdapterSet uses explicit AdapterSet{}
Test asserts err != nil (fail-closed) t.Fatalf on success
Test asserts buf.Len() == 0 (no bytes on rejection) ✅ The second invariant, matches prior TestEncodeSnapshotSelfTestDetectsCorruption pattern

No New Issues Introduced in v6

The guard correctly fires before any snapshotBuilder state is created or any bytes flow to out. All five v5-verified fixes remain intact (the diff is purely additive: the guard block + its test).

One observation carries forward as genuinely non-blocking: if AdapterSet ever gains a fifth adapter field, this guard condition would need updating alongside adapterRunners() and enabledAdapterSubdirs(). Since all three sites are co-located in the same file and the compiler won't help, a future author adding a new adapter should search for these. Not an issue for this PR.


Verdict: Ready to merge. v6 closes the only remaining carry-over from v5 cleanly, with no regressions and no new issues.

| Branch |

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b56c0e65ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/elastickv-snapshot-encode/main.go Outdated
Comment on lines +244 to +246
result, err := writeAndPublish(cfg, encodeOpts, mismatchPath, logger)
if err != nil {
return err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Write encode info before returning self-test mismatches

When --self-test produces a real mismatch, writeAndPublish returns errSelfTestMismatch here; this early return skips the writeSidecar call below, so <output>.encode_info.json is never written with self_test.matched=false. The design says mismatch exits should leave both mismatch.txt and provenance, and without the sidecar operators lose the generated SHA/T/adapters needed to diagnose a failed self-test.

Useful? React with 👍 / 👎.

User flagged that I had missed codex reviews on v2 and v6. Re-fetched
all bot reviews; two P2 items remained unaddressed.

## Codex P2 v2: enforce manifest TS floor in EncodeSnapshot

The CLI's resolveLastCommitTS already rejects --last-commit-ts T <
manifest.last_commit_ts, but the library-level EncodeSnapshot
accepted ANY opts.LastCommitTS, including one below the manifest's
floor. A future in-process caller (Phase 1 live extractor,
integration tests) that bypasses the CLI could silently lower the
restored HLC ceiling, letting a post-restart leader re-issue a read
ts <= a restored row's commit ts.

Earlier v4 fix was doc-only (godoc said "EncodeSnapshot does NOT
read the manifest"). That's accurate but pushes the responsibility
to callers. Codex's suggestion to enforce in the library is the
fail-closed answer.

Fix: EncodeOptions gains ManifestLastCommitTS uint64. EncodeSnapshot
fails-closed with ErrSelfTestLowerLastCommitTS when LastCommitTS <
ManifestLastCommitTS (both > 0 — synthetic test fixtures opt out by
leaving ManifestLastCommitTS at 0). CLI's buildEncodeOptions now
threads manifest.LastCommitTS into the field.

Pinned by TestEncodeSnapshotRejectsLowManifestFloor (rejection +
no bytes written) and TestEncodeSnapshotManifestFloorOptOut
(opt-out path still works).

## Codex P2 v6: write encode_info.json before returning self-test mismatches

The old writeAndPublish returned errSelfTestMismatch on mismatch,
which skipped the writeSidecar call. The design says mismatch should
leave BOTH <output>.mismatch.txt AND <output>.encode_info.json
(with self_test.matched=false) for diagnostics. Operators need the
sidecar's SHA256, effective T, adapters_enabled list to triage a
failed self-test.

Fix: encodeOne now writes the sidecar whenever the encode itself
ran (publishErr == nil || errors.Is(publishErr, errSelfTestMismatch)),
and only after that returns the original error. Stale sidecar
cleanup at the start of every run (was: only mismatch.txt was
cleaned).

## Caller audit per CLAUDE.md semantic-change rule

- EncodeSnapshot gained a pre-condition (ManifestLastCommitTS floor).
  All in-tree callers either thread manifest.LastCommitTS through the
  new field (CLI) or use ManifestLastCommitTS=0 (existing tests). No
  legitimate caller is impacted.
- encodeOne's control flow changed: sidecar is now written EVEN on
  mismatch. The mismatch error still propagates; downstream callers
  (run() in main, tests) see the same error contract. Sole consumer
  is run() which maps errSelfTestMismatch to exitDataErr.

Tests + lint green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

v7 (9f69295): addresses two missed codex P2 reviews.

ユーザー指摘の通り、v2 と v6 の codex review を見落としていました。

Codex P2 v2 (manifest TS floor in EncodeSnapshot): Library-level enforcement was missing. v4 only fixed the godoc to say "EncodeSnapshot does NOT enforce" — accurate but pushed responsibility to callers. Now EncodeOptions.ManifestLastCommitTS (opt-in via non-zero) makes EncodeSnapshot fail-closed when LastCommitTS < ManifestLastCommitTS. CLI threads manifest.LastCommitTS automatically. Pinned by TestEncodeSnapshotRejectsLowManifestFloor + TestEncodeSnapshotManifestFloorOptOut.

Codex P2 v6 (sidecar on self-test mismatch): writeAndPublish returned errSelfTestMismatch which skipped writeSidecar. Design says mismatch should leave BOTH mismatch.txt AND encode_info.json (with matched:false) for diagnostics. encodeOne now writes the sidecar whenever the encode actually ran (publishErr == nil || errors.Is(publishErr, errSelfTestMismatch)), then propagates the original error.

Caller audit: EncodeSnapshot gained a pre-condition (zero-AdapterSet was added earlier, manifest floor is the new one). All in-tree callers either thread the manifest value (CLI) or use opt-out (ManifestLastCommitTS=0 for synthetic test fixtures). No legitimate caller impacted.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 3m 59s —— View job


Code Review - PR #904 (v7)

Tasks

  • Gather context and understand the request
  • Read changed files (v7 commit 9f69295)
  • Verify Fix 1: Codex P2 v2 — library-level manifest TS floor (ManifestLastCommitTS)
  • Verify Fix 2: Codex P2 v6 — sidecar written on self-test mismatch
  • Check for any new issues introduced in v7
  • Post final review

v7 Verification — Both Codex P2 Fixes Confirmed

Finding Status
Codex P2 v2: library-level manifest TS floor missing validateEncodeOptions enforces LastCommitTS >= ManifestLastCommitTS when ManifestLastCommitTS > 0; CLI threads manifest.LastCommitTS via buildEncodeOptions
Codex P2 v6: sidecar skipped on errSelfTestMismatch encodeOne now calls writeSidecar when publishErr == nil || errors.Is(publishErr, errSelfTestMismatch); sidecar written with matched: false on mismatch
Stale sidecar cleanup on re-run _ = os.Remove(backup.EncodeInfoSidecarPath(cfg.outputPath)) before encode ensures stale prior sidecar is removed
New tests for P2 v2 TestEncodeSnapshotRejectsLowManifestFloor + TestEncodeSnapshotManifestFloorOptOut
validateEncodeOptions extracted as helper ✅ Keeps EncodeSnapshot under cyclop threshold; logical grouping with comment

The dual-enforcement analysis: resolveLastCommitTS in the CLI pre-screens (returns error before buildEncodeOptions is called), so the effectiveTS passed to EncodeSnapshot satisfies effectiveTS >= manifest.LastCommitTS. With ManifestLastCommitTS = manifest.LastCommitTS, validateEncodeOptions's < check is satisfied. No double-fire at CLI entry; the library check is exclusively defense-in-depth for direct callers.


One New Issue in v7

ErrSelfTestLowerLastCommitTS var comment contradicts the implementation it lives above

internal/backup/encode_snapshot.go:23-28:

// EncodeSnapshot itself does NOT read MANIFEST.json or enforce this
// floor — the comparison happens in the CLI's resolveLastCommitTS
// BEFORE EncodeSnapshot is called, and a future library caller (Phase 1
// live extractor, integration tests) must perform its own comparison
// and return this sentinel on regression so callers can errors.Is on
// it to map to the right exit code (claude v3 doc bug #904).

This was accurate after v4's doc fix, but v7 changed the contract: validateEncodeOptions (called from EncodeSnapshot) now DOES enforce the floor when ManifestLastCommitTS > 0, and EncodeSnapshot itself now RETURNS this error. Three things in the same file now contradict this var-level comment:

  1. The EncodeSnapshot function doc at line 148-156 correctly says "it WILL enforce a floor on opts.LastCommitTS when the caller threads the manifest value through opts.ManifestLastCommitTS"
  2. The ManifestLastCommitTS field doc at line 58 correctly says "EncodeSnapshot fails-closed with ErrSelfTestLowerLastCommitTS"
  3. validateEncodeOptions:181 actually wraps and returns ErrSelfTestLowerLastCommitTS

The fix is to update lines 23-28 to say the error is now returned by both the CLI (resolveLastCommitTS) and the library (validateEncodeOptions when ManifestLastCommitTS > 0):

// EncodeSnapshot itself does NOT read MANIFEST.json, but when the caller
// sets opts.ManifestLastCommitTS > 0, validateEncodeOptions enforces this
// floor and returns ErrSelfTestLowerLastCommitTS before any bytes are
// written. The CLI's resolveLastCommitTS provides the same check for
// flag-driven entry; library callers SHOULD set ManifestLastCommitTS to
// the manifest value for defense-in-depth (codex P2 v2 + claude v3 #904).

Fix this →


Observation (non-blocking)

TestCLISelfTestMismatchWritesSidecarWithMatchedFalse covers stale-sidecar cleanup, not the actual P2 v6 fix

The test exercises the manifest-floor rejection path (resolveLastCommitTS failing before writeAndPublish is called), so the sidecar is never written — the test correctly asserts os.IsNotExist. This confirms the stale-sidecar cleanup works, but does not verify that writeSidecar is called with matched: false when writeAndPublish actually returns errSelfTestMismatch.

The codex P2 v6 fix (the new if publishErr == nil || errors.Is(publishErr, errSelfTestMismatch) block at line 254) is correct in its logic but untested at the CLI integration level. The comment acknowledges this and defers to TestEncodeSnapshotSelfTestDetectsCorruption at the library level. Non-blocking, but the test name (...WritesSidecarWithMatchedFalse) promises something slightly different from what it delivers — a note in the comment clarifying the sidecar IS NOT expected here (because the encode never ran) would prevent future readers from being confused.


Strengths in v7

  • The ManifestLastCommitTS field is correctly zero-valued to opt out — no retroactive breakage to existing tests that pass EncodeOptions{} without a manifest reference.
  • validateEncodeOptions as a standalone function is a good factoring choice: it groups all four pre-condition checks, keeps EncodeSnapshot readable, and provides a clear seam for future additions.
  • The sidecar-on-mismatch priority logic is correct: sidecar-write errors on the success path propagate as the return error; on the mismatch path they degrade to a warning so errSelfTestMismatch always wins. This matches the stated requirement that sidecar is informational-only on mismatch.
  • _ = os.Remove(backup.EncodeInfoSidecarPath(cfg.outputPath)) before encode ensures any stale sidecar from a prior successful run is cleared regardless of the current run's outcome — matches the mismatch.txt cleanup pattern at the same site.

Verdict: One doc fix before merge. All functional behavior is correct; both v6 Codex P2 items are correctly implemented. The ErrSelfTestLowerLastCommitTS var comment at lines 23-28 is the only item that needs updating — it still says "EncodeSnapshot itself does NOT enforce this floor," which is now false after v7.

| Branch |

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f692950fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +299 to +300
{"dynamodb", func(s AdapterSet) bool { return s.DynamoDB }, func(b *snapshotBuilder, root string) error {
return errors.Wrap(NewDynamoDBEncoder(root).Encode(b), "dynamodb encoder")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject DynamoDB JSONL inputs before encoding

When an input manifest has dynamodb_layout: "jsonl" and --self-test is not enabled (the default), this path still runs the per-item-only DynamoDB reverse encoder. That encoder only walks items/*.json / items/*/*.json, so a decoded tree with items/data-*.jsonl is accepted but all DynamoDB items are skipped while the CLI exits successfully and publishes an .fsm containing only table metadata. Please fail closed on JSONL manifests until the reverse encoder supports that layout, or pass a JSONL-aware mode through this library path.

Useful? React with 👍 / 👎.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2ed548f4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// attacker pre-placing a symlink at <output>.encode_info.json
// could redirect the truncate-and-write into a target of their
// choosing (codex P2 v25 #904).
f, err := backup.OpenSidecarFile(sidecarPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve 0600 mode when replacing sidecars

When <output>.encode_info.json already exists with a permissive mode (for example a 0644 sidecar created by an older encoder under umask 022), this open/truncate path does not change its permissions; the 0600 mode in openSidecarFile only applies when the file is newly created. A successful re-encode therefore leaves the freshly written sidecar group/other-readable despite the owner-only contract and the fresh-file test, so this path should chmod after opening or publish via a new 0600 temp file and rename.

Useful? React with 👍 / 👎.

Comment on lines +638 to +640
{"s3", func(s AdapterSet) bool { return s.S3 }, func(b *snapshotBuilder, root string) error {
return errors.Wrap(NewS3RecordEncoder(root).Encode(b), "s3 encoder")
}},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject non-directory S3 bucket entries

When an enabled S3 dump is malformed or partially truncated so s3/<bucket> is a regular file or symlink, this dispatch still succeeds because S3RecordEncoder.Encode skips top-level entries where !ent.IsDir() (internal/backup/encode_s3.go:85-87). Since the manifest currently has only an empty S3 scope, the CLI cannot otherwise detect that bucket was expected, and it can publish a partial .fsm with the bucket silently omitted; the library path should fail closed on non-directory entries under selected adapter roots before encoding.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

Re-triggering v31 — the original job at https://github.com/bootjp/elastickv/actions/runs/26824025875 errored on the bot side. v31 is doc-only (commit a2ed548); addresses the three doc gaps claude v30 flagged:

  1. validateAdaptersAgainstManifest godoc trimmed to the nil-pointer-only check (v30 retract).
  2. errAdapterNotInManifest var comment trimmed to the nil-scope case only.
  3. classifyEncodeError Sources list gains the errAdapterNotInManifest bullet (was missing since v27).

No behavior change. CI all green (build/test/golangci-lint/CodeQL/CodeRabbit SUCCESS).

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 2m 5s —— View job


Code Review - PR #904 (v31)

Tasks

  • Gather context and understand the request
  • Read v31 changes (cmd/elastickv-snapshot-encode/main.go)
  • Verify Fix 1: validateAdaptersAgainstManifest godoc trimmed to nil-pointer check only
  • Verify Fix 2: errAdapterNotInManifest var comment trimmed to nil-scope case only
  • Verify Fix 3: classifyEncodeError Sources list gains errAdapterNotInManifest bullet
  • Check for any new issues introduced in v31
  • Post final review

v31 Verification — All Three v30 Doc Fixes Confirmed

Item Status
Fix 1: validateAdaptersAgainstManifest godoc — "Two failure modes" → "One failure mode"; scenario-A removed; deferred-enumeration root cause documented; pointer to checkOneAdapterScope for per-shape matrix ✅ Lines 139–162
Fix 2: errAdapterNotInManifest var comment — second clause ("manifest lists adapter whose subdir is missing") removed; "Both are silent-data-loss" → single scenario B only; historical retraction note for v27–v29 added ✅ Lines 376–389
Fix 3: classifyEncodeError summary clause — "adapter scope mismatch with manifest" added to exit-2 category list ✅ Lines 95–96
Fix 3: classifyEncodeError Sources list — 9th bullet for errAdapterNotInManifest added with correct source (validateAdaptersAgainstManifest) and codex citations ✅ Lines 117–120
Switch arms vs. Sources list parity — 10 errors.Is arms; 9 bullets (ErrInvalidManifest + ErrUnsupportedFormatVersion share one bullet) ✅ Confirmed correct
Only cmd/elastickv-snapshot-encode/main.go modified ✅ Pure doc-only change
No logic, test, or production code touched ✅ 42 additions / 31 deletions, all in comment blocks

All three doc comment blocks are now internally consistent with v30's nil-only checkOneAdapterScope semantics.


Two Codex P2 v31 Observations (pre-existing, non-blocking)

These were surfaced by Codex reviewing the v31 codebase — not introduced by v31, but newly raised:

writeSidecar failure leaves a published FSM without provenance

cmd/elastickv-snapshot-encode/main.go:429: When writeSidecar returns an error on the success path (publishErr == nil), encodeOne returns the sidecar error after writeAndPublish has already renamed the FSM into place. The CLI exits non-zero but a restore-visible .fsm exists without its sidecar. The design expects .fsm + sidecar to move together; a sidecar-write failure should either atomically publish both or roll back the FSM. Pre-existing; Codex flagged as P2.

openSidecarFile 0600 not enforced on existing sidecar re-open

cmd/elastickv-snapshot-encode/main.go:733: backup.OpenSidecarFile applies 0600 only when creating a new file. On re-encode, an existing sidecar with wider permissions (e.g. 0644 written by an older encoder) retains its permissions after truncation and rewrite. A chmod after open or a temp-file-rename pattern would enforce the mode on both new and existing sidecars. Pre-existing; Codex flagged as P2.


Verdict: Ready to merge. v31 is a clean doc-only fix that correctly aligns all three comment blocks with v30's nil-only scope check. No new issues introduced. The two Codex P2 observations above are pre-existing and do not block this PR.

| Branch | |

…x/claude P2 v31)

Two non-blocking P2 observations claude flagged on its v31 review,
both pre-existing but real correctness/security issues. Folding both
in.

## Codex/claude P2 v31 (1/2): writeSidecar failure leaves orphan .fsm

The encodeOne success path runs writeAndPublish (renames temp .fsm
→ output) BEFORE writeSidecar. If writeSidecar fails after the
rename, the CLI returns non-zero but a restore-visible <output>.fsm
exists without its provenance metadata — exactly the orphan state
v17 closed for the adapter-error path, but un-closed for the
sidecar-write path.

Fix: new rollbackOrphanFSMAndSidecar(outputPath, logger) helper. On
sidecar-write failure in the success path (publishErr == nil), the
caller calls this helper before returning:

  - os.Remove(outputPath) — drops the just-published .fsm so the
    restore path sees nothing where the orphan was.
  - os.Remove(sidecarPath) — drops any partially-written sidecar
    bytes (OpenSidecarFile may have already Truncate(0)'d before
    WriteEncodeInfo failed).

Both removes log-and-continue on non-ErrNotExist failures so the
caller's primary sidecar-write error remains the dominant signal.

A prior successful encode at the same output path is unrecoverable —
writeAndPublish's os.Rename already overwrote it before writeSidecar
ran. The rollback brings the state to "no .fsm, no sidecar," which
is the same end state as "encode never ran" — the cleanest
consistent outcome without filesystem transactions.

Self-test-mismatch path is unchanged: the existing v15 removeStaleOutputFSM
already handles that case, and writeSidecar's failure on that path
already logger.Warn's-and-continues.

Pinned end-to-end by extending TestCLISidecarWriteRefusesSymlinkTarget:
- Existing assertions: victim file at the symlink target survives
  byte-equal; exit 1; symlink itself unlinked.
- New v32 assertion: <output>.fsm is also removed (the rollback
  fired after the .fsm was published but before encodeOne returned).

## Codex/claude P2 v31 (2/2): existing sidecar perm not tightened on re-encode

openSidecarFile (unix) opens with O_CREATE + 0o600. The kernel
applies the mode arg ONLY when creating a new file; if the path
already exists (e.g. an older encoder wrote it at 0o644), the
pre-existing perm is preserved. The sidecar carries the source
path, cluster_id, and SHA256 of the .fsm — leaving it world-readable
on a multi-user backup host is the same leak claude v4 fixed for
the create case.

Fix: after the regular-file + Nlink checks + Truncate(0), call
f.Chmod(sidecarFileMode) to enforce the mode on the open descriptor.
New named const sidecarFileMode = 0o600 so a future widening of the
flag-arg mode forces the chmod call to be touched too.

Windows variant unchanged (different perm model; the openSidecarFile
file is already platform-split for this reason).

Pinned by TestOpenSidecarFileEnforcesOwnerOnlyMode (unix-only via
build tag): writes a pre-existing 0o644 file, calls OpenSidecarFile,
asserts the resulting descriptor's Stat reports perm 0o600.

## Caller audit per CLAUDE.md semantic-change rule

- encodeOne success-path branch: rollback called only when
  publishErr == nil AND writeSidecar fails. Self-test-mismatch path
  unchanged. Sole caller of encodeOne is run(); error return
  semantics unchanged.
- rollbackOrphanFSMAndSidecar: new helper, sole caller is the
  success-path sidecar-failure branch in encodeOne. No external
  callers.
- openSidecarFile (unix): success path now has one extra Chmod
  call. All in-package callers (encode_redis.go's multiple
  KEYMAP / TTL sidecar writers) are unaffected — they all already
  expect a 0o600 file. CLI's writeSidecar / writeMismatchTxt
  (via the exported OpenSidecarFile wrapper) are likewise
  strictly-tighter-perm only.
- sidecarFileMode const: new named const, only consumer is the
  Chmod call in openSidecarFile (unix).

Tests + lint green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

v32 folds in both P2 observations claude flagged on its v31 review (both pre-existing but worth closing).

v31 P2 (1/2) — writeSidecar failure leaves orphan .fsm: success path runs writeAndPublish (renames temp .fsm → output) BEFORE writeSidecar. If sidecar fails, the published .fsm has no matching provenance. Fix: new rollbackOrphanFSMAndSidecar(outputPath, logger) called on the success-path sidecar-failure branch — removes both the just-published .fsm and any partial sidecar so the state is consistent absent. Self-test-mismatch path unchanged. Pinned by extending TestCLISidecarWriteRefusesSymlinkTarget (the symlink-at-sidecar triggers OpenSidecarFile ELOOP → sidecar failure → rollback fires → .fsm is gone). Victim file still preserved (rollback's os.Remove on the symlink itself, not the target).

v31 P2 (2/2) — OpenSidecarFile mode not enforced on existing files: the unix openSidecarFile opened with O_CREATE + 0o600, but the mode arg only applies on creation; an older encoder's 0o644 sidecar kept its wider perms on re-encode. The sidecar carries source path + cluster_id + SHA256 — same leak surface as the v4 fix for the create case. Fix: post-Truncate(0), call f.Chmod(sidecarFileMode) to tighten. New named const sidecarFileMode = 0o600 so a future widening forces both sites to be touched. Windows variant unchanged (different perm model; already platform-split). Pinned by new unix-build-tagged TestOpenSidecarFileEnforcesOwnerOnlyMode.

Caller audit (CLAUDE.md): encodeOne rollback is gated on success-path-with-sidecar-failure only; other paths unchanged. openSidecarFile (unix) success path adds one Chmod call; all in-package callers expected 0o600 already — strictly tighter, no behavior regression.

go test -race ./internal/backup/... ./cmd/elastickv-snapshot-encode/... + golangci-lint green.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f2040858f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/elastickv-snapshot-encode/main.go Outdated
Comment on lines +641 to +642
if err := fsyncParentDir(cfg.outputPath); err != nil {
return result, errors.Wrap(err, "fsync output dir after rename")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Roll back FSM when directory fsync fails

When the rename succeeds but fsyncParentDir returns an error (for example on a filesystem that rejects directory fsync, or an I/O error), this returns before writeSidecar runs while publishedTempPath has already been cleared. The command exits failed but leaves a new restore-visible .fsm at --output without a matching .encode_info.json, so this path needs the same rollback/consistency handling as sidecar publication failures.

Useful? React with 👍 / 👎.

Comment on lines +519 to +520
if srerr := os.Remove(sidecarPath); srerr != nil && !errors.Is(srerr, os.ErrNotExist) {
logger.Warn("rollback partial sidecar after write failure", "err", srerr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid deleting pre-existing sidecar directories

When sidecar publication fails because <output>.encode_info.json is a pre-existing non-regular entry such as an empty directory or FIFO, OpenSidecarFile correctly refuses to write it, but this rollback then calls os.Remove on that same path even though no partial sidecar was created. That can delete operator-owned filesystem objects merely because the encode failed; only remove a sidecar known to be a regular file/symlink created or truncated by this run, or leave the refused entry in place.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

Re-triggering v32 — original job errored. The build CI flake (Docker Hub timeout fetching buildkit) has been re-run and is in progress; all other v32 CI checks are green (test, golangci-lint, CodeQL, CodeRabbit, Analyze-actions/js/go all SUCCESS or NEUTRAL).

v32 (commit 6f20408): two pre-existing P2 fixes — rollback orphan FSM + chmod existing sidecar to 0o600. No new logic; two new tests + one extended test.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

Third re-trigger for v32 (prior two attempts errored with bot-side infrastructure issues). All v32 CI checks are now SUCCESS (build re-run cleared, Analyze-go completed). Codex remains silent on v32 (likely rate-limited).

v32 (commit 6f20408) = the two non-blocking fixes you flagged on v31:

  1. rollbackOrphanFSMAndSidecar on success-path sidecar failure (no orphan .fsm).
  2. openSidecarFile (unix) chmod 0o600 enforcement on existing files.

No new logic beyond your own v31 guidance.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Codex's review window came back online after the rate-limit pause
and posted three P2 findings against v30, v31, and v32 that I had
missed (the inline comments were attached to PR reviews 4410223887,
4410270684, 4410442057 but the review BODIES were just the standard
"Codex Review" header — the actual findings were in the per-review
inline comments).

## Codex P2 v31 (L640): S3 encoder silently skips non-directory bucket entries

S3RecordEncoder.Encode does `if !ent.IsDir() { continue }` for
top-level entries under `s3/`. When a partially-truncated dump
leaves a regular file or symlink where a bucket directory should
be, the encoder publishes a partial .fsm with the affected bucket
silently omitted. The deferred-enumeration empty S3 scope in the
manifest cannot otherwise flag the missing data.

Fix: encode_s3.go's loop now distinguishes:
  - Reserved-prefix entries ("_*") → continue (forward compat with
    `_incomplete_uploads/` and `_orphans/`, which are handled by
    ErrEncodeUnsupportedS3IncompleteUploads / ErrEncodeUnsupportedS3Orphans).
  - Any other non-directory entry → return ErrS3EncodeNotRegular
    with the path and mode (matches the existing _bucket.json non-
    regular guard).

Pinned by TestS3EncodeRejectsNonDirectoryBucketEntry and
TestS3EncodeIgnoresReservedPrefixEntry.

## Codex P2 v32 (L642): rollback FSM when parent-dir fsync fails

v25 added fsyncParentDir after os.Rename in writeAndPublish to make
the rename durable. If fsync fails (e.g., on a filesystem that
rejects directory fsync, or transient I/O error), writeAndPublish
returned the error but the renamed .fsm stayed at --output — the
operator sees a "successful" .fsm that may not survive a crash.

Fix: extract the rename + fsync + rollback block into a new
publishAndFsync helper that os.Remove's the .fsm on fsync failure
(same rollback semantics as the v32 sidecar-failure path).
publishAndFsync also brings writeAndPublish under the cyclop bound.

No new test — the failure path is hard to drive deterministically
(would need fsync syscall injection). The success path is exercised
by every existing CLI round-trip test; the rollback branch is
inspectable by code review against the matching v32 sidecar
rollback pattern.

## Codex P2 v32 (L520): rollback must not delete operator-owned sidecar entries

v32's rollbackOrphanFSMAndSidecar called os.Remove on the sidecar
path unconditionally. When the sidecar path was a pre-existing
non-regular entry that OpenSidecarFile correctly refused to clobber
(operator-placed symlink, FIFO, empty directory, etc.), the
rollback would destructively unlink it merely because the encode
failed.

Fix: rollbackOrphanFSMAndSidecar now os.Lstats the sidecar path and
only os.Removes when info.Mode().IsRegular(). Symlinks (operator-
placed), FIFOs, directories, and other non-regular entries are
logged-and-skipped.

Pinned end-to-end by extending TestCLISidecarWriteRefusesSymlinkTarget:
- Existing assertions: victim file at the symlink target survives
  byte-equal; .fsm at --output is removed by v32 rollback.
- NEW v33 assertion: the operator-placed symlink at the sidecar
  path is PRESERVED (os.Lstat + ModeSymlink check).

The assertions are extracted into assertSymlinkSidecarRollbackInvariants
to keep the test under the cyclop bound.

## Caller audit per CLAUDE.md semantic-change rule

- S3RecordEncoder.Encode: was "silent skip on non-directory top-
  level entries"; now "fail closed with ErrS3EncodeNotRegular
  unless name starts with `_`". Sole effect on legitimate dumps
  is no behavior change (all canonical bucket dirs are
  directories). Malformed dumps that previously silently
  half-encoded now fail closed; classifyEncodeError routes via
  the v9 errors.Mark + ErrEncodeAdapterData layer to exit 2.
- publishAndFsync (new helper): sole caller is writeAndPublish.
  Behavior change ONLY on fsync failure (was: leave .fsm in
  place; now: roll back). Success path unchanged.
- rollbackOrphanFSMAndSidecar: sidecar removal is now gated on
  IsRegular(). Was destructive on operator-owned non-regular
  entries; now leaves them alone. Sole caller is encodeOne's
  success-path sidecar-failure branch.
- assertSymlinkSidecarRollbackInvariants (test helper): no
  production callers; only used by TestCLISidecarWriteRefusesSymlinkTarget.

Tests + lint green (incl. gofmt on the new helper's comment block).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

v33 closes three codex P2 findings I missed across v30/v31/v32 (codex posted them as PR-review inline comments rather than separate review-body posts, so they didn't surface in my initial scans — found them by querying /reviews/<id>/comments for each recent codex review).

codex P2 v31 L640 — S3 silently skips non-directory bucket entries: S3RecordEncoder.Encode did if !ent.IsDir() { continue }. A truncated dump with a regular file or symlink where a bucket dir should be would silently omit that bucket. Fix: fail-closed with ErrS3EncodeNotRegular; reserved-prefix _* entries (forward compat with _incomplete_uploads/, _orphans/) still skip. Pinned by two new library tests.

codex P2 v32 L642 — rollback FSM when parent-dir fsync fails: v25's fsyncParentDir runs after os.Rename. If fsync fails, the orphan .fsm survives. Fix: extracted publishAndFsync helper that rolls back the rename on fsync failure (mirrors v32's sidecar-failure rollback).

codex P2 v32 L520 — don't os.Remove operator-owned sidecar entries: v32's rollbackOrphanFSMAndSidecar unconditionally os.Removed the sidecar path. If it was a pre-existing symlink/FIFO/directory that OpenSidecarFile correctly refused to clobber, the rollback would destroy operator-owned filesystem objects. Fix: os.Lstat first; only remove if IsRegular(). Pinned by extending TestCLISidecarWriteRefusesSymlinkTarget to assert the operator-placed symlink survives.

Caller audit (CLAUDE.md): S3 encoder fail-closed flows through runAdapterEncoders' errors.Mark(ErrEncodeAdapterData) → exit-2; publishAndFsync is a pure extraction (only behavior change is rollback on fsync fail); rollback's Lstat gate is strictly safer.

go test -race ./internal/backup/... ./cmd/elastickv-snapshot-encode/... + golangci-lint green.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5adca7f679

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/elastickv-snapshot-encode/main.go Outdated
}
return
}
if !info.Mode().IsRegular() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve hard-linked sidecars on rollback

When the sidecar path is a pre-existing hard link, OpenSidecarFile refuses it before truncating, but this rollback treats it as a regular file and the following os.Remove unlinks it anyway. In a shared backup directory this means a failed encode can delete an operator-owned sidecar link that the no-clobber guard intentionally left untouched; check the link count (or track whether this run actually opened/truncated the file) before removing it.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

Re-triggering v33 — prior bot job errored. v33 (commit 5adca7f) closes three codex P2 inline findings on v30/v31/v32:

  1. S3 encoder fail-closed on non-directory top-level entries (codex v31 L640).
  2. publishAndFsync rollback on parent-dir fsync failure (codex v32 L642).
  3. rollbackOrphanFSMAndSidecar Lstat-gates the sidecar removal so operator-owned non-regular entries are preserved (codex v32 L520).

CI on v33 currently green: build/test/golangci-lint/CodeQL/Analyze all SUCCESS; ubuntu-latest test + Analyze-go still in progress.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/backup/open_nofollow_unix_test.go (1)

20-24: 💤 Low value

Consider verifying the seeded file actually has broader permissions.

The test seeds a file at 0o644 but doesn't verify that permission was honored before testing the enforcement. On systems with restrictive umasks or filesystems, WriteFile might silently produce 0o600, causing the test to pass even if the Chmod logic is broken.

♻️ Proposed robustness check (pattern from encryption test)
 	if err := os.WriteFile(path, []byte("prior"), 0o644); err != nil { //nolint:gosec // test simulates legacy permissive sidecar
 		t.Fatalf("WriteFile: %v", err)
 	}
+	// Verify environment honored the broader mode before testing enforcement.
+	st, err := os.Stat(path)
+	if err != nil {
+		t.Fatalf("Stat seeded file: %v", err)
+	}
+	if st.Mode().Perm()&0o077 == 0 {
+		t.Skipf("environment refused permissive seed (mode=%o)", st.Mode().Perm())
+	}
 	f, err := OpenSidecarFile(path)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/backup/open_nofollow_unix_test.go` around lines 20 - 24, After
seeding the sidecar file with os.WriteFile(..., 0o644), immediately stat the
file with os.Stat and assert the file mode's permissions (info.Mode().Perm())
equal 0o644; if not, fail the test (t.Fatalf) so the test doesn't silently pass
on systems that applied a restrictive umask or different FS behavior. Locate the
seed call in the test (the os.WriteFile invocation) and add the os.Stat +
Mode().Perm() check right after it to ensure the precondition for the
chmod-enforcement logic is actually met.
internal/backup/open_nofollow_unix.go (1)

61-61: ⚡ Quick win

Use sidecarFileMode constant in the OpenFile call.

The comment on line 116 states "a future edit that widens one must touch both," but line 61 still uses the literal 0o600. Replacing it with sidecarFileMode would mechanically enforce synchronization rather than relying on manual coordination.

♻️ Proposed fix to use the constant consistently
-	f, err := os.OpenFile(path, flag, 0o600) //nolint:gosec,mnd // path is composed from output-root + fixed file name; 0600 is the standard owner-only mode
+	f, err := os.OpenFile(path, flag, sidecarFileMode) //nolint:gosec,mnd // path is composed from output-root + fixed file name; sidecarFileMode is the standard owner-only mode

Also applies to: 114-118

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/backup/open_nofollow_unix.go` at line 61, Replace the hard-coded
file mode 0o600 in the os.OpenFile call with the sidecarFileMode constant so the
mode is maintained centrally; update the call that currently reads
os.OpenFile(path, flag, 0o600) to use sidecarFileMode (referencing the same
variables path and flag), and grep this file for any other literal 0o600 usages
(e.g., the similar open/create calls later in the file) and replace them with
sidecarFileMode to ensure consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/backup/open_nofollow_unix_test.go`:
- Around line 20-24: After seeding the sidecar file with os.WriteFile(...,
0o644), immediately stat the file with os.Stat and assert the file mode's
permissions (info.Mode().Perm()) equal 0o644; if not, fail the test (t.Fatalf)
so the test doesn't silently pass on systems that applied a restrictive umask or
different FS behavior. Locate the seed call in the test (the os.WriteFile
invocation) and add the os.Stat + Mode().Perm() check right after it to ensure
the precondition for the chmod-enforcement logic is actually met.

In `@internal/backup/open_nofollow_unix.go`:
- Line 61: Replace the hard-coded file mode 0o600 in the os.OpenFile call with
the sidecarFileMode constant so the mode is maintained centrally; update the
call that currently reads os.OpenFile(path, flag, 0o600) to use sidecarFileMode
(referencing the same variables path and flag), and grep this file for any other
literal 0o600 usages (e.g., the similar open/create calls later in the file) and
replace them with sidecarFileMode to ensure consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f511d1d-5715-4bc5-a9b4-21781d99f0ca

📥 Commits

Reviewing files that changed from the base of the PR and between 38f001e and 5adca7f.

📒 Files selected for processing (6)
  • cmd/elastickv-snapshot-encode/main.go
  • cmd/elastickv-snapshot-encode/main_test.go
  • internal/backup/encode_s3.go
  • internal/backup/encode_s3_test.go
  • internal/backup/open_nofollow_unix.go
  • internal/backup/open_nofollow_unix_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/elastickv-snapshot-encode/main.go
  • cmd/elastickv-snapshot-encode/main_test.go

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 4m 42s —— View job


Code Review - PR #904 (v33)

Tasks

  • Gather context and understand the request
  • Read changed files (v33 commit 5adca7f)
  • Review Fix 1: S3 encoder fail-closed on non-directory top-level entries
  • Review Fix 2: publishAndFsync rollback on parent-dir fsync failure
  • Review Fix 3: rollbackOrphanFSMAndSidecar Lstat-gate for sidecar removal
  • Check for any new issues introduced in v33
  • Post final review

v33 Verification — All Three Codex P2 Fixes Confirmed

Item Status
Codex P2 v31 L640: !ent.IsDir() in S3 encoder was a silent continue ✅ Now returns errors.Wrapf(ErrS3EncodeNotRegular, "s3/%s is not a directory…") for non-directory, non-_* entries
_* reserved-prefix entries (forward compat for _incomplete_uploads/, _orphans/) still skipped strings.HasPrefix(name, "_")continue inside the !ent.IsDir() branch
TestS3EncodeRejectsNonDirectoryBucketEntry: stray.txt under s3/ErrS3EncodeNotRegular
TestS3EncodeIgnoresReservedPrefixEntry: _marker file under s3/ → no error
Codex P2 v32 L642: writeAndPublish did not roll back after fsync failure publishAndFsync extracted; on fsync failure calls os.Remove(outputPath) before returning the error
publishedTempPath deferred cleanup on fsync-failure path: os.Remove(tempPath) → ErrNotExist silently ignored ✅ Deferred _ = os.Remove(publishedTempPath) handles ErrNotExist gracefully
Codex P2 v32 L520: rollbackOrphanFSMAndSidecar unconditionally removed sidecar path ✅ Now os.Lstat first; only removes if info.Mode().IsRegular(); non-regular (symlink/FIFO/dir) logged-and-skipped
assertSymlinkSidecarRollbackInvariants extended with third assertion: os.Lstat(sidecarPath) non-nil AND mode has ModeSymlink bit ✅ Pins that the operator-placed symlink is preserved by the v33 Lstat gate

One Minor Doc Nit (non-blocking)

encode_s3.go line 93 cites "codex P2 v32 #904" — should be "codex P2 v31 #904"

The S3 non-directory bucket finding appeared in the codex review of commit a2ed548f (v31 commit, reviewed at 2026-06-02T13:52:11Z). Codex's v32 review (6f2040858f at 2026-06-02T14:10:18Z) covered the fsync and sidecar-rollback findings. The inline comment at line 93 currently reads:

// ...silently skipping would let the encoder publish a partial .fsm with
// the affected bucket omitted (codex P2 v32 #904; the manifest's…

This should read (codex P2 v31 #904; …). Not a blocker — the code behavior is entirely correct; only the attribution round-number is off by one.


Carry-Over Observations (non-blocking, from CodeRabbit v33 review)

open_nofollow_unix_test.go line 22 — seed file permissions not verified before enforcement test

TestOpenSidecarFileEnforcesOwnerOnlyMode seeds a file at 0o644 but doesn't verify that permission was honored before testing the enforcement. On systems with a very restrictive umask, os.WriteFile(..., 0o644) might produce 0o600, causing the test to pass even if Chmod in openSidecarFile is broken. Adding a t.Skipf guard ("environment refused permissive seed") before calling OpenSidecarFile would make the test self-aware of this edge. CodeRabbit flagged this as a low-priority nitpick.

open_nofollow_unix.go line 61 — 0o600 literal instead of sidecarFileMode constant

f, err := os.OpenFile(path, flag, 0o600)

sidecarFileMode is defined at line 118 and is used at line 107 (f.Chmod(sidecarFileMode)). The comment at lines 114–117 explicitly says "a future edit that widens one must touch both" — but if line 61 used the constant, a single-location update would cover both. Replacing the literal with the constant removes the coordination burden. CodeRabbit flagged this as a quick-win nitpick.


Strengths in v33

  • The _* skip in !ent.IsDir() is correctly nested inside the not-a-directory branch, not at the top-level loop. This means a _incomplete_uploads that IS a directory continues to reach encodeBucket (pre-existing behavior, unchanged), while a regular-file _* entry is skipped rather than failing. The reserved-prefix test (TestS3EncodeIgnoresReservedPrefixEntry) pins this edge directly.
  • publishAndFsync is cleanly extracted as a standalone function with a single-responsibility scope (rename → fsync → rollback-on-failure), which is why writeAndPublish remained under the cyclop bound. The rollback comment is precise about why os.Remove is safe here: the rename just succeeded, so we know outputPath is a regular file (or doesn't exist if the kernel concurrently removed it).
  • The rollbackOrphanFSMAndSidecar Lstat gate at line 537 correctly distinguishes "this run left a partial sidecar we should clean up" from "there's operator-owned non-regular content at the sidecar path that OpenSidecarFile correctly refused to touch." The log-and-skip path on non-regular entries is the right behavior — the sidecar-write error already propagates to encodeOne as the dominant signal.
  • assertSymlinkSidecarRollbackInvariants is a clean extraction that pins all three post-failure invariants (victim unchanged, FSM gone, symlink preserved) in a single named helper. Future tests that exercise similar sidecar-failure paths can reuse it.

Verdict: Ready to merge. All three codex P2 findings are correctly implemented and pinned by tests. The one doc nit (round-number citation in encode_s3.go:93) and the two CodeRabbit carry-overs are non-blocking. All prior rounds' (v1–v32) findings remain intact.

| Branch | |

…odex P2 v33)

## Codex P2 v33: rollback destroys hard-linked sidecars

v32's rollbackOrphanFSMAndSidecar gated sidecar removal on
info.Mode().IsRegular(). A pre-existing hard link at the sidecar
path passes IsRegular(), so the rollback would os.Remove it —
destroying an operator-owned link that OpenSidecarFile's Nlink > 1
guard correctly refused to clobber. In a shared backup directory
this lets a failed encode unlink files outside the dump tree.

## Fix

Track whether OpenSidecarFile actually succeeded on this run.
writeSidecar's signature changes from `error` to
`(truncated bool, err error)`:

- OpenSidecarFile fails (symlink ELOOP, hard-link Nlink>1, FIFO
  ENXIO, non-regular file, etc.) → return (false, err). Caller
  MUST NOT remove the sidecar path; the pre-existing entry is
  operator-owned and was never touched.
- OpenSidecarFile succeeds (truncated and chmod'd a fresh or
  single-link regular file) → return (true, err) on any subsequent
  failure. Caller rollback removes the partial bytes.

rollbackOrphanFSMAndSidecar gains a sidecarTruncated bool
parameter. The Lstat + IsRegular check from v33 is removed — the
truncated flag is a precise signal of "this run owns the bytes,"
not a stat-time heuristic that can race with operator changes.

## Pinned by

TestCLISidecarWriteRefusesHardLinkTarget (unix-only): pre-plants a
victim file with a sentinel body, os.Link's the sidecar path to it,
runs the CLI, asserts:
- Exit code = exitUserErr.
- Victim contents are byte-equal preserved (Nlink guard refused
  truncation).
- Hard link at sidecar path is preserved (sidecarTruncated=false
  → rollback skips it).
- .fsm at outputPath is removed (rollback's FSM branch always
  fires on sidecar-write failure with successful publish).

The v33 TestCLISidecarWriteRefusesSymlinkTarget continues to pass:
OpenSidecarFile ELOOP → sidecarTruncated=false → symlink survives.

## CodeRabbit nits folded in

1. `internal/backup/open_nofollow_unix.go:61`: `os.OpenFile(...,
   flag, 0o600)` now uses the `sidecarFileMode` const (was
   literal). Keeps the OpenFile mode and the post-Truncate
   Chmod mode synchronized — a future widening of one is
   mechanically caught.

2. `internal/backup/open_nofollow_unix_test.go:20-24`:
   TestOpenSidecarFileEnforcesOwnerOnlyMode now `os.Stat`s the
   seeded 0o644 file and `t.Skip`s if the environment refused the
   permissive mode (restrictive umask / FS). Without this, the
   test could pass spuriously on environments where WriteFile
   silently produces 0o600.

## Caller audit per CLAUDE.md semantic-change rule

- writeSidecar return signature changed from `error` to
  `(bool, error)`. Sole caller is encodeOne (success-path branch);
  updated in lock-step to capture both values. The bool flows into
  rollbackOrphanFSMAndSidecar via a new parameter.
- rollbackOrphanFSMAndSidecar gained a sidecarTruncated bool
  parameter. Sole caller is encodeOne's success-path
  sidecar-failure branch; updated in lock-step. v33's Lstat +
  IsRegular check removed (replaced by the precise truncated
  flag), so non-regular paths are no longer destroyed even by
  paths the v33 check would have accidentally caught.
- openSidecarFile (unix) OpenFile mode arg changed from 0o600
  literal to sidecarFileMode const. Same numeric value; no
  behavior change. All callers (in-package adapter writers + the
  exported OpenSidecarFile) are unaffected.

Tests + lint green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

v34 addresses codex P2 v33 (hard-linked sidecars) plus two CodeRabbit nits.

codex P2 v33 — v32/v33's rollback gated on IsRegular(), which is true for hard links. OpenSidecarFile refuses hard links (Nlink > 1) but the rollback would unlink them anyway, destroying operator-owned files outside the dump tree. Fix: track sidecarTruncated bool from writeSidecar. The bool is true iff OpenSidecarFile succeeded — a precise "this run owns the bytes" signal rather than a stat-time heuristic. Rollback only removes the sidecar when truncated; refused entries (symlinks, hard links, FIFOs, etc.) are operator-owned and preserved. Pinned by new TestCLISidecarWriteRefusesHardLinkTarget (unix-only; victim contents byte-equal preserved, hard link survives, .fsm removed).

CodeRabbit nit (quick win)os.OpenFile in openSidecarFile now uses sidecarFileMode const instead of 0o600 literal; keeps it synchronized with the post-Truncate Chmod.

CodeRabbit nit (low value)TestOpenSidecarFileEnforcesOwnerOnlyMode now os.Stats the seeded 0o644 file and t.Skips if the environment refused the permissive seed (restrictive umask), so the test doesn't pass spuriously.

Caller audit (CLAUDE.md): writeSidecar signature changed from error to (bool, error); sole caller is encodeOne (updated in lock-step). rollbackOrphanFSMAndSidecar gained sidecarTruncated bool; sole caller updated. v33's Lstat+IsRegular gate replaced by the precise truncated flag, so hard-link / FIFO / dir cases are now correctly preserved.

go test -race ./internal/backup/... ./cmd/elastickv-snapshot-encode/... + golangci-lint green.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Claude v33 caught: encode_s3.go:93 attributed the non-directory
bucket guard to "codex P2 v32 #904" but the finding was actually
flagged in codex's review of commit a2ed548 (v31 commit). v32 was
when codex flagged the fsync + sidecar-rollback issues. The
attribution round-number was off by one.

Doc-only one-character fix (v32 → v31). No behavior change; lint
green.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

v35 = doc-only one-character fix per claude v33 nit (encode_s3.go:93 citation v32 → v31).

(Also note: v34 from earlier addresses codex P2 v33 hard-linked sidecar rollback + two CodeRabbit nits; v34's claude bot job errored without producing output, but CI on v34 was all green. v35 stacks on v34.)

go build + golangci-lint green.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if !ent.IsDir() {
continue

P2 Badge Fail closed on non-directory DynamoDB tables

When dynamodb/ contains a canonical table entry that is a regular file or symlink (for example from a truncated or reused decoded tree), this branch silently skips it and the encoder can publish an .fsm with that table omitted; the S3 path in this same change already treats non-directory top-level entries as malformed for this reason. Please return an adapter-data error for non-directory entries under the selected DynamoDB root instead of treating them as absent.


if !ent.IsDir() {
continue

P2 Badge Fail closed on non-directory SQS queues

When sqs/ contains a queue entry that is a regular file or symlink, this silently skips it, so a malformed or partially cleaned input tree can encode successfully while dropping the affected queue and its messages. Since enabled adapter errors are now classified as data failures, this should reject non-directory top-level SQS entries rather than publish a partial restore artifact.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 2, 2026

@claude review

Re-triggering for v35 — prior two bot jobs errored. Status:

  • v34 (commit 03a9883): codex review came back with 0 inline findings — clean approval.
  • v33 (commit 5adca7f): claude verdict was "Ready to merge"; the one v32→v31 citation nit it flagged is now fixed in v35.
  • v35 (commit 365744e): one-char doc-only fix.
  • CI on v34: all green; v35 CI in progress.

If this v35 review also errors, the prior bot signals form a coherent review-OK.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 2, 2026

Claude finished @bootjp's task in 59s —— View job


Code Review - PR #904 (v35)

Tasks

  • Gather context and understand the request
  • Read changed files (v35 commit 365744e)
  • Verify mandatory v33 nit: citation v32 → v31 in encode_s3.go
  • Check for any new issues introduced in v35
  • Post final review

v35 Verification — v33 Citation Nit Confirmed

The diff is exactly one token change in one comment line:

Item Status
encode_s3.go:93: codex P2 v32 #904codex P2 v31 #904
The S3 non-directory bucket finding appeared in codex's review of v31 (commit a2ed548f, reviewed at 2026-06-02T13:52:11Z) ✅ Confirmed against git log
v32 citations elsewhere in the codebase (main.go:656, main.go:674, main_test.go:1066, encode_s3_test.go:140,166) are all accurate — v32 introduced the fsync, rollback, and sidecar permission changes ✅ No over-correction
No logic, test, or other production files touched ✅ Pure doc-only change

Verdict: Ready to merge. v35 closes the single citation nit from the v33 review. All findings across all 35 rounds of review are resolved. The PR is clean.

  • v1–v5: parsing strictness, file permissions, dead code, doc correctness
  • v6–v10: zero-adapter guard, sidecar-on-mismatch, DynamoDB JSONL fail-closed, adapter-data exit-2, phantom sentinel name
  • v11–v15: doc only, test name parity, test error handling, file permission enforcement, manifest error exit-2
  • v16–v20: Redis multi-DB fail-closed, buf.Len() assertion parity, EncodeSnapshot godoc placement, zero-AdapterSet test correction
  • v21–v25: symlink stale output removal, writeAndPublish doc, test refactors, fsync parent dir, sidecar no-follow open
  • v26–v30: adapter scope validation, v28/v29 scope-content heuristic retract (codex P1)
  • v31–v35: doc alignment with v30, rollback orphan FSM + chmod sidecar, S3/fsync/sidecar-rollback P2 fixes, hard-link sidecar preservation, citation fix

| Branch | |

@bootjp bootjp merged commit 9859f00 into main Jun 3, 2026
10 checks passed
@bootjp bootjp deleted the backup/m6-cli-design branch June 3, 2026 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant