backup: Phase 0b M6 implementation - cmd/elastickv-snapshot-encode + library#904
Conversation
Implementation of the M6 design (this PR's earlier doc commits 31c0a0f through 1736435). Wires the merged M1-M5 encoder slices into a user-facing CLI plus a library entrypoint, with the round-trip self-test the parent doc mandates. ## What ships - internal/backup/manifest.go: Exclusions gains RenameS3Collisions bool with json tag rename_s3_collisions. Intentionally NOT added to exclusionsRequiredFields so legacy manifests decode safely with the zero value false. Pinned by TestExclusionsLegacyManifestOmitsRenameS3Collisions. - cmd/elastickv-snapshot-decode/main.go: emitManifest now populates the new field from cfg.renameCollisions, so M6's self-test can thread the same option back through DecodeSnapshot. - internal/backup/encode_info.go: ENCODE_INFO.json schema with format_version gate, NewEncodeInfo constructor, WriteEncodeInfo / ReadEncodeInfo helpers, and EncodeInfoSidecarPath that derives the sidecar from the .fsm path (<output>.encode_info.json — no static name collisions). - internal/backup/encode_snapshot.go: EncodeSnapshot(opts, out) library entrypoint mirroring DecodeSnapshot. Dispatches per-adapter encoders in canonical fan-out order (redis -> dynamodb -> s3 -> sqs), and implements the two-mode buffering model: - SelfTest=false: stream FSM straight to out with a sha256.Writer tee. - SelfTest=true: buffer FSM in *bytes.Buffer, fire the unexported corruptBufferForTest hook (if set), self-test against the buffer, copy to out ONLY on match. Corruption hooks reach the self-test decode but never reach out (write-then-rename atomicity). - cmd/elastickv-snapshot-encode/main.go: CLI with --input / --output / --adapter / --last-commit-ts / --self-test / --scratch-root flags; decoder-parity adapter CSV parser; fail-closed T >= manifest.last_commit_ts validation; atomic publish via <output>.tmp-<random> with deferred cleanup; sidecar emission with fsync+close discipline. ## Tests (12 new) internal/backup/encode_info_test.go: - TestEncodeInfoRoundTrip - TestEncodeInfoRejectsUnknownFormatVersion - TestExclusionsLegacyManifestOmitsRenameS3Collisions internal/backup/encode_snapshot_test.go: - TestEncodeSnapshotLibraryRoundTrip - TestEncodeSnapshotSelfTestMatchesInput (uses canonicalize-once pattern: encode -> decode -> re-encode self-tests cleanly) - TestEncodeSnapshotSelfTestDetectsCorruption (corruption reaches self-test decode, NEVER reaches the io.Writer) - TestEncodeSnapshotRequiresInputRoot - TestEncodeInfoSidecarPath cmd/elastickv-snapshot-encode/main_test.go: - TestCLIRejectsMissingManifest - TestCLIRejectsUnknownAdapter - TestCLIRejectsLowerLastCommitTSOverride - TestCLIAcceptsEqualAndHigherLastCommitTSOverride (sub-tests for equal + higher) - TestCLIEncodeInfoPathDerivedFromOutput - TestCLIEncodeInfoTwoFilesNoCollision - TestCLIRoundTripSelfTestAllAdapters - TestCLISelfTestFailureLeavesNoFsmAtOutputPath - TestParseLastCommitTS All green (race + no cache). golangci-lint clean. ## Caller audit per CLAUDE.md semantic-change rule - Exclusions struct gained a field. Existing callers either: (a) build the struct via field-tagged literals (decoder CLI's emitManifest — updated to populate the new field), or (b) read it (encoder's buildSelfTestDecodeOptions — new code). No silent semantic change for any pre-existing caller. - DecodeOptions.RenameS3Collisions was already a public field used by the decoder; the encoder now also reads it via the manifest. No caller-side change needed. ## Self-review 1. Data loss: write-then-rename atomic publish — self-test failure never publishes the .fsm. Mismatch txt + corrupt buffer never reach the io.Writer (pinned by TestEncodeSnapshotSelfTestDetectsCorruption asserting out.Len()==0). 2. Concurrency: pure offline. The CLI is a single-shot binary; library entrypoint takes an io.Writer the caller owns. 3. Performance: SelfTest=false streams with one sha256 tee, no extra allocations. SelfTest=true allocates one FSM-sized buffer plus the scratch decode tree (documented memory cost). 4. Data consistency: --last-commit-ts T < manifest is fail-closed with typed error; self-test threads MANIFEST DecodeOptions (Exclusions.* + DynamoDBLayout -> DynamoDBBundleJSONL) so trees produced with non-default decoder flags round-trip cleanly. 5. Test coverage: 12 new tests cover library entrypoint, CLI flag parsing, atomic publish discipline, sidecar path-derivation, corruption detection, forward-compat for legacy manifests.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a snapshot-encode CLI and EncodeSnapshot library with optional on-disk self-test and deterministic structural diffs, ENCODE_INFO.json sidecar schema and helpers, atomic temp→publish flow with randomized temp names, and a manifest Exclusions flag rename_s3_collisions that decode emits. ChangesSnapshot Encoding Pipeline
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as elastickv-snapshot-encode
participant Lib as EncodeSnapshot
participant FS as Filesystem
participant Decode as DecodeSnapshot
User->>CLI: run encode (--input, --output, --adapter, --last-commit-ts, --self-test)
CLI->>FS: read <input>/MANIFEST.json
CLI->>Lib: build EncodeOptions + run EncodeSnapshot
Lib->>FS: read adapter input trees (redis/dynamodb/s3/sqs)
Lib->>Decode: when SelfTest=true, write temp .fsm -> DecodeSnapshot (self-test)
Decode->>Lib: return match/mismatch + decoded tree
Lib->>FS: on success rename temp .fsm -> <output> and fsync parent dir
CLI->>FS: write <output>.encode_info.json sidecar (safe open/write/fsync)
CLI->>User: exit code (0 or classified non-zero)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 2m 59s —— View job Code Review - PR #904
|
There was a problem hiding this comment.
Code Review
This pull request implements the elastickv-snapshot-encode command-line tool and library, which encodes vendor-independent directory trees into native .fsm files, generates an encode_info.json sidecar, and supports an optional round-trip self-test. The review feedback highlights several critical memory-efficiency improvements to prevent Out Of Memory (OOM) crashes on large production snapshots. Key recommendations include writing the .fsm to a temporary file on disk instead of buffering it in memory during the self-test, streaming the file comparison chunk-by-chunk rather than reading all file bytes into memory, and increasing the entropy of the temporary file suffix to avoid collisions.
| func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) { | ||
| var buf bytes.Buffer | ||
| bytesWritten, err := b.WriteTo(&buf) | ||
| if err != nil { | ||
| return EncodeResult{}, errors.WithStack(err) | ||
| } | ||
| bufBytes := buf.Bytes() | ||
| if opts.corruptBufferForTest != nil { | ||
| opts.corruptBufferForTest(bufBytes) | ||
| } | ||
|
|
||
| header, mismatchTxt, matched, stErr := runSelfTest(bufBytes, opts) | ||
| result := EncodeResult{ | ||
| Header: header, | ||
| BytesWritten: bytesWritten, | ||
| SHA256: sha256.Sum256(bufBytes), | ||
| SelfTestRan: true, | ||
| SelfTestMatched: matched, | ||
| SelfTestMismatchTxt: mismatchTxt, | ||
| AdaptersEnabled: enabled, | ||
| } | ||
| if stErr != nil { | ||
| return result, stErr | ||
| } | ||
| if !matched { | ||
| return result, nil | ||
| } | ||
| if _, err := io.Copy(out, bytes.NewReader(bufBytes)); err != nil { | ||
| return result, errors.Wrap(err, "copy buffered fsm to out") | ||
| } | ||
| return result, nil | ||
| } |
There was a problem hiding this comment.
Buffering the entire .fsm file in memory (bytes.Buffer) during the self-test can lead to Out Of Memory (OOM) crashes on large production snapshots. Instead, write the .fsm directly to a temporary file on disk, compute the SHA256 hash in a single pass using the existing sha256Writer, and stream-read from the temporary file for the self-test and copying.
func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) {
tempFile, err := os.CreateTemp(opts.SelfTestDecodeOptions.OutRoot, "encode-buffer-")
if err != nil {
return EncodeResult{}, errors.WithStack(err)
}
defer func() {
_ = tempFile.Close()
_ = os.Remove(tempFile.Name())
}()
hashWriter := newSHA256Writer(tempFile)
bytesWritten, err := b.WriteTo(hashWriter)
if err != nil {
return EncodeResult{}, errors.WithStack(err)
}
if opts.corruptBufferForTest != nil {
if _, err := tempFile.Seek(0, io.SeekStart); err != nil {
return EncodeResult{}, errors.WithStack(err)
}
bufBytes, err := io.ReadAll(tempFile)
if err != nil {
return EncodeResult{}, errors.WithStack(err)
}
opts.corruptBufferForTest(bufBytes)
if _, err := tempFile.WriteAt(bufBytes, 0); err != nil {
return EncodeResult{}, errors.WithStack(err)
}
}
header, mismatchTxt, matched, stErr := runSelfTest(tempFile, opts)
result := EncodeResult{
Header: header,
BytesWritten: bytesWritten,
SHA256: hashWriter.Sum(),
SelfTestRan: true,
SelfTestMatched: matched,
SelfTestMismatchTxt: mismatchTxt,
AdaptersEnabled: enabled,
}
if stErr != nil {
return result, stErr
}
if !matched {
return result, nil
}
if _, err := tempFile.Seek(0, io.SeekStart); err != nil {
return result, errors.Wrap(err, "seek temp file for copy")
}
if _, err := io.Copy(out, tempFile); err != nil {
return result, errors.Wrap(err, "copy buffered fsm to out")
}
return result, nil
}| func runSelfTest(fsmBytes []byte, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) { | ||
| scratchBase := opts.SelfTestDecodeOptions.OutRoot | ||
| scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-") | ||
| if err != nil { | ||
| return SnapshotHeader{}, nil, false, errors.Wrap(err, "mkdir scratch") | ||
| } | ||
| defer func() { | ||
| _ = os.RemoveAll(scratchDir) | ||
| }() | ||
|
|
||
| decOpts := opts.SelfTestDecodeOptions | ||
| decOpts.OutRoot = scratchDir | ||
|
|
||
| result, derr := DecodeSnapshot(bytes.NewReader(fsmBytes), decOpts) | ||
| if derr != nil { | ||
| // Decoder errored on our own output — that IS a self-test | ||
| // failure (the .fsm we produced isn't loadable). Surface as | ||
| // a mismatch with the decoder error embedded in the txt. | ||
| mismatchTxt := []byte("self-test failed: DecodeSnapshot rejected the produced .fsm: " + derr.Error()) | ||
| return SnapshotHeader{}, mismatchTxt, false, nil | ||
| } | ||
|
|
||
| if result.Header.LastCommitTS != opts.LastCommitTS { | ||
| mismatchTxt := []byte(formatHeaderMismatch(opts.LastCommitTS, result.Header.LastCommitTS)) | ||
| return result.Header, mismatchTxt, false, nil | ||
| } | ||
|
|
||
| diff, derr := diffAdapterTrees(opts.InputRoot, scratchDir, opts.Adapters) | ||
| if derr != nil { | ||
| return result.Header, nil, false, errors.Wrap(derr, "diff scratch tree") | ||
| } | ||
| if len(diff) > 0 { | ||
| return result.Header, []byte(strings.Join(diff, "\n") + "\n"), false, nil | ||
| } | ||
| return result.Header, nil, true, nil | ||
| } |
There was a problem hiding this comment.
Update runSelfTest to accept an *os.File instead of a byte slice to support streaming the .fsm from disk during the self-test, avoiding loading the entire snapshot into memory.
func runSelfTest(fsmFile *os.File, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) {
scratchBase := opts.SelfTestDecodeOptions.OutRoot
scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-")
if err != nil {
return SnapshotHeader{}, nil, false, errors.Wrap(err, "mkdir scratch")
}
defer func() {
_ = os.RemoveAll(scratchDir)
}()
decOpts := opts.SelfTestDecodeOptions
decOpts.OutRoot = scratchDir
if _, err := fsmFile.Seek(0, io.SeekStart); err != nil {
return SnapshotHeader{}, nil, false, errors.Wrap(err, "seek temp file")
}
result, derr := DecodeSnapshot(fsmFile, decOpts)
if derr != nil {
// Decoder errored on our own output — that IS a self-test
// failure (the .fsm we produced isn't loadable). Surface as
// a mismatch with the decoder error embedded in the txt.
mismatchTxt := []byte("self-test failed: DecodeSnapshot rejected the produced .fsm: " + derr.Error())
return SnapshotHeader{}, mismatchTxt, false, nil
}
if result.Header.LastCommitTS != opts.LastCommitTS {
mismatchTxt := []byte(formatHeaderMismatch(opts.LastCommitTS, result.Header.LastCommitTS))
return result.Header, mismatchTxt, false, nil
}
diff, derr := diffAdapterTrees(opts.InputRoot, scratchDir, opts.Adapters)
if derr != nil {
return result.Header, nil, false, errors.Wrap(derr, "diff scratch tree")
}
if len(diff) > 0 {
return result.Header, []byte(strings.Join(diff, "\n") + "\n"), false, nil
}
return result.Header, nil, true, nil
}| func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) { | ||
| aFiles, aErr := walkRegularFiles(aDir) | ||
| if aErr != nil && !errors.Is(aErr, os.ErrNotExist) { | ||
| return nil, errors.Wrapf(aErr, "walk input %s", aDir) | ||
| } | ||
| bFiles, bErr := walkRegularFiles(bDir) | ||
| if bErr != nil && !errors.Is(bErr, os.ErrNotExist) { | ||
| return nil, errors.Wrapf(bErr, "walk scratch %s", bDir) | ||
| } | ||
|
|
||
| var diffs []string | ||
| aMap := map[string][]byte{} | ||
| for path, body := range aFiles { | ||
| aMap[path] = body | ||
| } | ||
| for path, bBody := range bFiles { | ||
| aBody, ok := aMap[path] | ||
| if !ok { | ||
| diffs = append(diffs, relPrefix+"/"+path+" (missing in input)") | ||
| continue | ||
| } | ||
| if !bytes.Equal(aBody, bBody) { | ||
| diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)") | ||
| } | ||
| delete(aMap, path) | ||
| } | ||
| // Anything remaining in aMap is present in input but not in scratch. | ||
| remaining := make([]string, 0, len(aMap)) | ||
| for path := range aMap { | ||
| remaining = append(remaining, relPrefix+"/"+path+" (missing in scratch)") | ||
| } | ||
| sort.Strings(remaining) | ||
| diffs = append(diffs, remaining...) | ||
| return diffs, nil | ||
| } | ||
|
|
||
| // walkRegularFiles returns a map of relative path -> file bytes for every | ||
| // regular file under root. Missing root is the empty map + os.ErrNotExist. | ||
| // Bounded by the per-adapter test fixtures the encoder runs against; | ||
| // production-scale dumps may want streaming compare, deferred until a | ||
| // real bottleneck shows up. | ||
| func walkRegularFiles(root string) (map[string][]byte, error) { | ||
| out := map[string][]byte{} | ||
| rootInfo, err := os.Stat(root) | ||
| if err != nil { | ||
| return nil, errors.WithStack(err) | ||
| } | ||
| if !rootInfo.IsDir() { | ||
| return nil, errors.Errorf("not a directory: %s", root) | ||
| } | ||
| if err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error { | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if d.IsDir() { | ||
| return nil | ||
| } | ||
| if !d.Type().IsRegular() { | ||
| return nil | ||
| } | ||
| body, rerr := os.ReadFile(path) //nolint:gosec // walking a caller-provided root, regular files only | ||
| if rerr != nil { | ||
| return errors.Wrap(rerr, path) | ||
| } | ||
| rel, rerr := filepath.Rel(root, path) | ||
| if rerr != nil { | ||
| return errors.WithStack(rerr) | ||
| } | ||
| out[filepath.ToSlash(rel)] = body | ||
| return nil | ||
| }); err != nil { | ||
| return nil, errors.WithStack(err) | ||
| } | ||
| return out, nil | ||
| } |
There was a problem hiding this comment.
Loading the entire contents of all files in both the input and scratch directories into memory (walkRegularFiles reading all file bytes via os.ReadFile) will cause OOM crashes on large backups (e.g., with large S3 blobs). Instead, walk the directories to collect file sizes, and perform a streaming chunk-by-chunk comparison for files present in both directories.
func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) {
aFiles, aErr := walkRegularFiles(aDir)
if aErr != nil && !errors.Is(aErr, os.ErrNotExist) {
return nil, errors.Wrapf(aErr, "walk input %s", aDir)
}
bFiles, bErr := walkRegularFiles(bDir)
if bErr != nil && !errors.Is(bErr, os.ErrNotExist) {
return nil, errors.Wrapf(bErr, "walk scratch %s", bDir)
}
var diffs []string
for path, aSize := range aFiles {
bSize, ok := bFiles[path]
if !ok {
diffs = append(diffs, relPrefix+"/"+path+" (missing in scratch)")
continue
}
if aSize != bSize {
diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)")
continue
}
equal, err := compareFiles(filepath.Join(aDir, path), filepath.Join(bDir, path))
if err != nil {
return nil, err
}
if !equal {
diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)")
}
}
for path := range bFiles {
if _, ok := aFiles[path]; !ok {
diffs = append(diffs, relPrefix+"/"+path+" (missing in input)")
}
}
sort.Strings(diffs)
return diffs, nil
}
func walkRegularFiles(root string) (map[string]int64, error) {
out := map[string]int64{}
rootInfo, err := os.Stat(root)
if err != nil {
return nil, errors.WithStack(err)
}
if !rootInfo.IsDir() {
return nil, errors.Errorf("not a directory: %s", root)
}
if err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error {
if err != nil {
return err
}
if d.IsDir() {
return nil
}
if !d.Type().IsRegular() {
return nil
}
info, rerr := d.Info()
if rerr != nil {
return errors.Wrap(rerr, path)
}
rel, rerr := filepath.Rel(root, path)
if rerr != nil {
return errors.WithStack(rerr)
}
out[filepath.ToSlash(rel)] = info.Size()
return nil
}); err != nil {
return nil, errors.WithStack(err)
}
return out, nil
}
func compareFiles(pathA, pathB string) (bool, error) {
fA, err := os.Open(pathA)
if err != nil {
return false, errors.WithStack(err)
}
defer fA.Close()
fB, err := os.Open(pathB)
if err != nil {
return false, errors.WithStack(err)
}
defer fB.Close()
bufA := make([]byte, 64*1024)
bufB := make([]byte, 64*1024)
for {
nA, errA := io.ReadFull(fA, bufA)
nB, errB := io.ReadFull(fB, bufB)
if errA == io.EOF || errA == io.ErrUnexpectedEOF {
errA = nil
}
if errB == io.EOF || errB == io.ErrUnexpectedEOF {
errB = nil
}
if errA != nil {
return false, errors.WithStack(errA)
}
if errB != nil {
return false, errors.WithStack(errB)
}
if nA != nB {
return false, nil
}
if nA == 0 {
break
}
if !bytes.Equal(bufA[:nA], bufB[:nB]) {
return false, nil
}
}
return true, nil
}| // tempSuffixHexLen is the hex-character length of the random | ||
| // suffix appended to <output>.tmp-<hex>; 8 hex chars = 4 bytes of | ||
| // entropy, which gives ~4×10^9 collision space per --output path | ||
| // (more than enough for concurrent encodes against the same path). | ||
| tempSuffixHexLen = 8 | ||
| tempSuffixByteLen = tempSuffixHexLen / 2 |
There was a problem hiding this comment.
Using only 4 bytes of entropy (8 hex characters) for the temporary file suffix has a relatively high probability of collision (due to the Birthday Paradox) in highly concurrent environments or automated test pipelines. Increasing the suffix length to 16 hex characters (8 bytes of entropy) provides a much safer collision-free space.
| // tempSuffixHexLen is the hex-character length of the random | |
| // suffix appended to <output>.tmp-<hex>; 8 hex chars = 4 bytes of | |
| // entropy, which gives ~4×10^9 collision space per --output path | |
| // (more than enough for concurrent encodes against the same path). | |
| tempSuffixHexLen = 8 | |
| tempSuffixByteLen = tempSuffixHexLen / 2 | |
| // tempSuffixHexLen is the hex-character length of the random | |
| // suffix appended to <output>.tmp-<hex>; 16 hex chars = 8 bytes of | |
| // entropy, which gives a collision-free space per --output path. | |
| tempSuffixHexLen = 16 | |
| tempSuffixByteLen = tempSuffixHexLen / 2 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1fa691fb74
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return backup.AdapterSet{}, errors.Errorf("unknown adapter %q", name) | ||
| } | ||
| } | ||
| return set, nil |
There was a problem hiding this comment.
Reject empty adapter selections
When --adapter contains only separators or whitespace (for example --adapter ' ,', or a templated value that expands to spaces), every token hits the empty case and this returns a zero AdapterSet with no error. The rest of the encoder then invokes no adapter encoders, publishes a valid header-only .fsm, and exits successfully, which can turn a bad CLI argument into an apparently successful but empty restore artifact; the decoder-side parser already rejects this zero-adapter case.
Useful? React with 👍 / 👎.
| if _, err := fmt.Sscanf(s, "%d", &ts); err != nil { | ||
| return 0, errors.Wrap(err, "--last-commit-ts decimal parse") |
There was a problem hiding this comment.
Reject trailing junk in timestamps
For decimal --last-commit-ts values, fmt.Sscanf with %d accepts a numeric prefix and ignores trailing bytes, so an input like --last-commit-ts 100oops is silently treated as 100. Because this flag controls the restored HLC ceiling and sidecar metadata, malformed operator input should fail instead of publishing a snapshot with a different timestamp than the argument appears to specify.
Useful? React with 👍 / 👎.
Six substantive review findings on the M6 implementation PR. Headline fix: switched the self-test buffer from in-memory bytes.Buffer to a disk-backed temp file so large snapshots cannot OOM the encoder. ## Gemini high #1: SelfTest=true was buffering FSM in memory Out Of Memory risk on production-scale snapshots (gemini high #904). Fixed by writing the FSM through a tee (sha256 + os.File.WriteAt) into an os.CreateTemp file under opts.SelfTestDecodeOptions.OutRoot, then Seek(0) and stream through DecodeSnapshot for the self-test. The temp file is os.Remove'd via defer on every exit path. The contract is unchanged: corruption reaches the self-test decode but never reaches the io.Writer (TestEncodeSnapshotSelfTestDetectsCorruption still asserts out.Len()==0). Memory cost: O(1) for the SelfTest=true path now (previously was an FSM-sized bytes.Buffer on top of the sort working set). ## Gemini high #2: runSelfTest signature Changed from runSelfTest(fsmBytes []byte, ...) to runSelfTest(fsmFile io.Reader, ...) so the .fsm streams through DecodeSnapshot instead of being held as []byte. Same OOM fix at the runSelfTest layer. ## Gemini high #3: walkRegularFiles loaded every file's bytes into memory For large S3 blobs this OOMs the encoder. Replaced walkRegularFiles with walkRegularFilePaths (paths only) and streamFilesEqual (chunk compare at streamCmpBufSize = 64 KiB). Per-file compare cost is now O(streamCmpBufSize), not O(file size). ## Gemini medium: tempSuffixHexLen entropy was 4 bytes Birthday-paradox collision risk in highly concurrent CI environments. Bumped from 8 hex / 4 bytes (2^32 space) to 16 hex / 8 bytes (2^64 space). Same width crypto/rand uses for cryptographic nonces. ## Claude high + codex P2: fmt.Sscanf silently accepts partial parses --last-commit-ts 0xffZZ parsed as 0xff (fmt.Sscanf stops at first non-hex). --last-commit-ts 100oops parsed as 100. This means a typo could silently set the snapshot HLC ceiling to something the operator did not type. Replaced fmt.Sscanf with strconv.ParseUint (base 10 / 16) which rejects trailing garbage as a parse error. New TestParseLastCommitTS cases cover: 0xffZZ, 100oops, " 100 ext", "-1". ## Codex P2: empty adapter selection produced an empty AdapterSet --adapter ' ,' (e.g. a templated argument that expanded to spaces) would yield a zero AdapterSet and the encoder would publish a valid header-only .fsm with no adapter records — a silently-empty restore artifact. Now rejected with a clear error. Pinned by TestParseAdapterSetRejectsEmptySelection covering " ,", ",,,", " ", and "," — plus a positive case (single "s3") asserting the guard does not break legitimate single-adapter selections. ## Caller audit per CLAUDE.md semantic-change rule - runSelfTest signature changed from ([]byte, opts) to (io.Reader, opts). Sole caller is encodeBuffered (same package, just updated). No external callers since the function is unexported. - corruptBufferForTest hook type changed from func([]byte) to func(*os.File). Sole caller is the same-package test TestEncodeSnapshotSelfTestDetectsCorruption (updated to ReadAt/ WriteAt the temp file). External callers cannot set this field (lowercase). - streamFilesEqual replaces the bytes.Equal compare in diffOneSubdir. Same return contract (bool, error) so the caller (diffOneSubdir) is unchanged in behavior. - parseAdapterSet now returns an error for empty-selection inputs. Caller is parseFlags which already propagates the error to exit-1. Tests + lint green.
|
v2 (b131186): addresses all 6 review findings on the implementation PR. Gemini high #1 (OOM on large snapshots): Self-test buffer is now an on-disk Gemini high #2 (runSelfTest signature): Changed from Gemini high #3 (walkRegularFiles OOM): Replaced bytes-into-memory with Gemini medium (tempSuffixHexLen entropy): Bumped from 8 hex / 4 bytes (2^32) to 16 hex / 8 bytes (2^64), same width crypto/rand uses for cryptographic nonces. Claude high + codex P2 (fmt.Sscanf partial parses): Replaced with Codex P2 (empty adapter selection): @claude review |
|
Claude finished @bootjp's task in 3m 41s —— View job Code Review - PR #904 (v2)Tasks
v2 Verification — All 6 Required Fixes Confirmed
One New Issue Introduced by v2Stale doc comments still say
// SelfTest enables the round-trip self-test. EncodeSnapshot buffers
// the FSM in *bytes.Buffer, decodes from the buffer, and copies to
// the caller's io.Writer only if the buffer survives DecodeSnapshot
// When opts.SelfTest=true an internal *bytes.Buffer holds the
// FSM during the self-test; bytes are copied to out only after the
// self-test matches. Memory cost in self-test mode is one FSM-sized
// allocation on top of the existing sort working set.Both still describe the v1 in-memory approach. The actual v2 implementation writes to an Carry-over Observations (non-blocking, noted in v1)These were marked "no fix required" in the v1 review and remain unchanged:
Strengths in v2The disk-backed self-test path is cleanly implemented:
Verdict: Ready to merge once the two stale |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b13118638f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| b := newSnapshotBuilder(opts.LastCommitTS) | ||
| enabled, err := runAdapterEncoders(b, opts) |
There was a problem hiding this comment.
Enforce the manifest timestamp floor in EncodeSnapshot
When EncodeSnapshot is called directly (the new library entrypoint the design exposes for future callers), a caller can pass LastCommitTS lower than InputRoot/MANIFEST.json:last_commit_ts; this path immediately builds with that value and never reads the manifest or returns the advertised ErrSelfTestLowerLastCommitTS. The CLI pre-check catches its own invocations, but in-process callers can still emit an .fsm whose header and MVCC key timestamps lower the restored HLC ceiling, violating the fail-closed invariant this wrapper is meant to enforce.
Useful? React with 👍 / 👎.
…over cleanups)
Mandatory (claude v2):
- Stale *bytes.Buffer comments in SelfTest field doc and EncodeSnapshot
godoc updated to describe the v2 on-disk temp file model with O(1)
memory cost (encode-self-test-fsm-* under SelfTestDecodeOptions.OutRoot).
Non-blocking carry-over items from v1 + v2 (folded in since trivial):
- EncodeResult.SHA256 comment no longer references a non-existent
SHA256Hex function; says "raw 32-byte digest; CLI hex-encodes via
encoding/hex in writeSidecar".
- EncodeSnapshot godoc now spells out the dual-check contract
("Callers MUST check result.SelfTestMatched before treating a nil
error as success") since self-test mismatch returns (result, nil).
- diffOneSubdir output is now fully sorted (was: only the "remaining"
segment sorted; map-iteration order leaked into the rest). mismatch.txt
is deterministic across runs with identical inputs.
- Replaced hand-rolled uitoa/itoa with strconv.FormatUint and
strconv.Itoa. Dropped the sha256w / hashSHA256 alias chain; the
sha256Writer.h field now uses hash.Hash directly (already imported).
|
v3 (b8da884): addresses claude v2 review. Mandatory (the v2 blocker): Two stale Non-blocking carry-overs folded in since trivial:
@claude review |
|
Claude finished @bootjp's task in 3m 33s —— View job Code Review - PR #904 (v3)Tasks
v3 Verification — All v2 Carry-overs Confirmed
One New Doc Bug in v3
// ErrSelfTestLowerLastCommitTS is returned by EncodeSnapshot when the
// effective LastCommitTS in EncodeOptions is below the manifest's value.
// ...
// Surfaced at the EncodeSnapshot layer so the CLI's main exits with code 2
// Returns ErrSelfTestLowerLastCommitTS when opts.LastCommitTS is below
// the manifest's value — caller is responsible for reading the manifest
// and computing the effective T (this layer just validates the floor).Both are inaccurate. // cmd/elastickv-snapshot-encode/main.go:349-354
func resolveLastCommitTS(cfg *config, manifestTS uint64) (uint64, bool, error) {
if cfg.lastCommitTS < manifestTS {
return 0, false, errors.Wrapf(backup.ErrSelfTestLowerLastCommitTS, ...)
}
...
}The functional behaviour is correct — the HLC floor IS enforced before any bytes are written. But a future library caller reading the The fix is straightforward: correct both comments to say the error is a sentinel for the caller to return after its own manifest comparison, not something Observations (non-blocking)
Both are Strengths in v3
Verdict: One doc fix needed before merge. All functional behaviour is correct; all v2 findings are addressed. The doc bug ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8da884f9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // encodeToTempFile creates tempPath, runs EncodeSnapshot into it, | ||
| // fsync+close. Caller is responsible for the os.Remove cleanup on error. | ||
| func encodeToTempFile(tempPath string, encodeOpts backup.EncodeOptions) (backup.EncodeResult, error) { | ||
| tempFile, err := os.Create(tempPath) //nolint:gosec // operator-supplied path |
There was a problem hiding this comment.
Create snapshot artifacts owner-only
When the encoder runs with a typical umask like 022, os.Create creates the temporary FSM as mode 0666 masked to 0644, and that same file becomes the published <output> after the rename. Because the .fsm contains the full KV dataset, this exposes backup contents to other local users on multi-user backup hosts while the encode is running and after publish; the decoded adapter writers in internal/backup/open_nofollow_* use 0600 for this reason. Please create the temp output with an explicit owner-only mode before renaming it.
Useful? React with 👍 / 👎.
… unify writer types
Mandatory (claude v3):
ErrSelfTestLowerLastCommitTS var comment and EncodeSnapshot godoc both
claimed the HLC ceiling check is enforced in the library. It is NOT.
The check lives in the CLI's resolveLastCommitTS BEFORE EncodeSnapshot
is called. A future library caller (Phase 1 live extractor, integration
test) reading the godoc would assume EncodeSnapshot guards against a
low timestamp and be surprised when it silently stamps the low value
into the .fsm.
Both comments updated to say:
- The error is the sentinel callers should return after their OWN
manifest comparison.
- EncodeSnapshot does NOT read MANIFEST.json or validate
opts.LastCommitTS against any external floor.
Non-blocking observation (claude v3):
sha256Writer and teeWriter were structurally identical (both
{w io.Writer; h hash.Hash} with the same Write body). teeWriter
removed; encodeBuffered now uses newSHA256Writer for the temp-file
tee. One less type to audit.
|
v4 (a357279): addresses claude v3. Mandatory: ErrSelfTestLowerLastCommitTS var comment + EncodeSnapshot godoc claimed the HLC ceiling check is enforced in the library — it isn't (the check lives in the CLI's Non-blocking observation: Unified the duplicated @claude review |
|
Claude finished @bootjp's task in 2m 35s —— View job Code Review - PR #904 (v4)Tasks
v4 Verification — All v3 Items Confirmed
One Remaining Issue —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a3572798b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| b := newSnapshotBuilder(opts.LastCommitTS) | ||
| enabled, err := runAdapterEncoders(b, opts) |
There was a problem hiding this comment.
Reject zero-adapter library encodes
When an in-process caller passes the zero AdapterSet (for example, a future Phase 1 caller forgets to thread adapter selection), runAdapterEncoders returns an empty list with no error and the subsequent write emits a valid header-only .fsm, silently discarding any adapter subtrees under InputRoot. The CLI parser now rejects empty selections, but this public library entrypoint still has the same fail-open behavior for direct callers, so it should fail before writing when no adapters are enabled.
Useful? React with 👍 / 👎.
Claude v4 mandatory finding: encodeInfoFilePerm = 0o600 was declared but dead code. Both file-creation sites (encodeToTempFile and writeSidecar) used os.Create, which honors umask and typically gives 0644. On a multi-user backup host, the full KV dataset (.fsm) and the sidecar (which carries source path, cluster_id, and SHA-256) would have been world-readable. Replaced both os.Create calls with os.OpenFile + explicit encodeInfoFilePerm: - encodeToTempFile: temp .fsm is now 0o600 from creation, so the on-disk dataset is never world-readable during the encode-then- rename window. - writeSidecar: ENCODE_INFO.json gets the same 0o600. New TestCLIPublishesFsmAndSidecarMode0600 asserts both files have no group/other access bits (`perm & 0o077 == 0`) after a successful run. Skips on Windows where Unix-style mode bits aren't meaningful. Codex v3 had also flagged the temp FSM side at P2 in passing; that's now addressed alongside the sidecar. Caller audit: both call sites updated in this same file; no other os.Create calls remain in the encoder CLI. The `0o600` constant is now live code.
|
v5 (0c223d2): addresses claude v4 mandatory. Mandatory (claude v4): Both replaced with Caller audit: both sites updated; no other @claude review |
|
Claude finished @bootjp's task in 2m 29s —— View job Code Review - PR #904 (v5)Tasks
v5 Verification — Mandatory v4 Fix Confirmed
Permission chain is correct end-to-end: the temp file is created The library-level self-test buffer ( No remaining No New Issues Introduced in v5The change is minimal and scoped to exactly what v4 flagged: two Carry-over Observation (pre-existing, non-blocking)
Strengths in v5
Verdict: Ready to merge. The mandatory v4 fix is correctly implemented and pinned by the new test. No new issues in v5. |
… v5 + codex v5)
Carry-over flagged in both v5 reviews: EncodeSnapshot at the library
layer accepted opts.Adapters == AdapterSet{} and produced a valid
header-only .fsm with no adapter records. The CLI's parseAdapterSet
already rejects this for flag-driven entry, but a future in-process
caller (Phase 1 live extractor, integration test) that forgets to
thread the adapter set would get a silently empty restore artifact.
Fail-closed guard added at the top of EncodeSnapshot alongside the
existing InputRoot / out-nil checks. Pinned by
TestEncodeSnapshotRejectsZeroAdapterSet which also asserts no bytes
are written to out on rejection.
Claude v5 explicitly said the existing fix was "Ready to merge" and
this was non-blocking; folding in for completeness since the guard
is a 4-line addition and the test is independent.
Caller audit per CLAUDE.md semantic-change rule: EncodeSnapshot's
new pre-condition narrows the accepted input domain. All existing
callers either (a) pass through the CLI's parseAdapterSet (which
already rejects empty selections) or (b) are same-package tests
that pass an explicit AdapterSet{...} literal. No legitimate caller
is impacted.
|
v6 (b56c0e6): folded in the carry-over non-blocking guard. Claude v5 already declared "Ready to merge" but flagged the library-level zero-AdapterSet gap (also raised by codex v5). Added the guard at the top of Caller audit: the new pre-condition narrows accepted input. All existing callers pass through the CLI's @claude review |
|
Claude finished @bootjp's task in 1m 30s —— View job Code Review - PR #904 (v6)Tasks
v6 Verification — Zero-AdapterSet Guard ConfirmedThe change is exactly scoped to what was described: 10 lines to
No New Issues Introduced in v6The guard correctly fires before any One observation carries forward as genuinely non-blocking: if Verdict: Ready to merge. v6 closes the only remaining carry-over from v5 cleanly, with no regressions and no new issues. | Branch | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b56c0e65ae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result, err := writeAndPublish(cfg, encodeOpts, mismatchPath, logger) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
Write encode info before returning self-test mismatches
When --self-test produces a real mismatch, writeAndPublish returns errSelfTestMismatch here; this early return skips the writeSidecar call below, so <output>.encode_info.json is never written with self_test.matched=false. The design says mismatch exits should leave both mismatch.txt and provenance, and without the sidecar operators lose the generated SHA/T/adapters needed to diagnose a failed self-test.
Useful? React with 👍 / 👎.
User flagged that I had missed codex reviews on v2 and v6. Re-fetched all bot reviews; two P2 items remained unaddressed. ## Codex P2 v2: enforce manifest TS floor in EncodeSnapshot The CLI's resolveLastCommitTS already rejects --last-commit-ts T < manifest.last_commit_ts, but the library-level EncodeSnapshot accepted ANY opts.LastCommitTS, including one below the manifest's floor. A future in-process caller (Phase 1 live extractor, integration tests) that bypasses the CLI could silently lower the restored HLC ceiling, letting a post-restart leader re-issue a read ts <= a restored row's commit ts. Earlier v4 fix was doc-only (godoc said "EncodeSnapshot does NOT read the manifest"). That's accurate but pushes the responsibility to callers. Codex's suggestion to enforce in the library is the fail-closed answer. Fix: EncodeOptions gains ManifestLastCommitTS uint64. EncodeSnapshot fails-closed with ErrSelfTestLowerLastCommitTS when LastCommitTS < ManifestLastCommitTS (both > 0 — synthetic test fixtures opt out by leaving ManifestLastCommitTS at 0). CLI's buildEncodeOptions now threads manifest.LastCommitTS into the field. Pinned by TestEncodeSnapshotRejectsLowManifestFloor (rejection + no bytes written) and TestEncodeSnapshotManifestFloorOptOut (opt-out path still works). ## Codex P2 v6: write encode_info.json before returning self-test mismatches The old writeAndPublish returned errSelfTestMismatch on mismatch, which skipped the writeSidecar call. The design says mismatch should leave BOTH <output>.mismatch.txt AND <output>.encode_info.json (with self_test.matched=false) for diagnostics. Operators need the sidecar's SHA256, effective T, adapters_enabled list to triage a failed self-test. Fix: encodeOne now writes the sidecar whenever the encode itself ran (publishErr == nil || errors.Is(publishErr, errSelfTestMismatch)), and only after that returns the original error. Stale sidecar cleanup at the start of every run (was: only mismatch.txt was cleaned). ## Caller audit per CLAUDE.md semantic-change rule - EncodeSnapshot gained a pre-condition (ManifestLastCommitTS floor). All in-tree callers either thread manifest.LastCommitTS through the new field (CLI) or use ManifestLastCommitTS=0 (existing tests). No legitimate caller is impacted. - encodeOne's control flow changed: sidecar is now written EVEN on mismatch. The mismatch error still propagates; downstream callers (run() in main, tests) see the same error contract. Sole consumer is run() which maps errSelfTestMismatch to exitDataErr. Tests + lint green.
|
v7 (9f69295): addresses two missed codex P2 reviews. ユーザー指摘の通り、v2 と v6 の codex review を見落としていました。 Codex P2 v2 (manifest TS floor in EncodeSnapshot): Library-level enforcement was missing. v4 only fixed the godoc to say "EncodeSnapshot does NOT enforce" — accurate but pushed responsibility to callers. Now Codex P2 v6 (sidecar on self-test mismatch): Caller audit: @claude review |
|
Claude finished @bootjp's task in 3m 59s —— View job Code Review - PR #904 (v7)Tasks
v7 Verification — Both Codex P2 Fixes Confirmed
The dual-enforcement analysis: One New Issue in v7
// EncodeSnapshot itself does NOT read MANIFEST.json or enforce this
// floor — the comparison happens in the CLI's resolveLastCommitTS
// BEFORE EncodeSnapshot is called, and a future library caller (Phase 1
// live extractor, integration tests) must perform its own comparison
// and return this sentinel on regression so callers can errors.Is on
// it to map to the right exit code (claude v3 doc bug #904).This was accurate after v4's doc fix, but v7 changed the contract:
The fix is to update lines 23-28 to say the error is now returned by both the CLI ( // EncodeSnapshot itself does NOT read MANIFEST.json, but when the caller
// sets opts.ManifestLastCommitTS > 0, validateEncodeOptions enforces this
// floor and returns ErrSelfTestLowerLastCommitTS before any bytes are
// written. The CLI's resolveLastCommitTS provides the same check for
// flag-driven entry; library callers SHOULD set ManifestLastCommitTS to
// the manifest value for defense-in-depth (codex P2 v2 + claude v3 #904).Observation (non-blocking)
The test exercises the manifest-floor rejection path ( The codex P2 v6 fix (the new Strengths in v7
Verdict: One doc fix before merge. All functional behavior is correct; both v6 Codex P2 items are correctly implemented. The | Branch | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9f692950fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| {"dynamodb", func(s AdapterSet) bool { return s.DynamoDB }, func(b *snapshotBuilder, root string) error { | ||
| return errors.Wrap(NewDynamoDBEncoder(root).Encode(b), "dynamodb encoder") |
There was a problem hiding this comment.
Reject DynamoDB JSONL inputs before encoding
When an input manifest has dynamodb_layout: "jsonl" and --self-test is not enabled (the default), this path still runs the per-item-only DynamoDB reverse encoder. That encoder only walks items/*.json / items/*/*.json, so a decoded tree with items/data-*.jsonl is accepted but all DynamoDB items are skipped while the CLI exits successfully and publishes an .fsm containing only table metadata. Please fail closed on JSONL manifests until the reverse encoder supports that layout, or pass a JSONL-aware mode through this library path.
Useful? React with 👍 / 👎.
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2ed548f4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // attacker pre-placing a symlink at <output>.encode_info.json | ||
| // could redirect the truncate-and-write into a target of their | ||
| // choosing (codex P2 v25 #904). | ||
| f, err := backup.OpenSidecarFile(sidecarPath) |
There was a problem hiding this comment.
Preserve 0600 mode when replacing sidecars
When <output>.encode_info.json already exists with a permissive mode (for example a 0644 sidecar created by an older encoder under umask 022), this open/truncate path does not change its permissions; the 0600 mode in openSidecarFile only applies when the file is newly created. A successful re-encode therefore leaves the freshly written sidecar group/other-readable despite the owner-only contract and the fresh-file test, so this path should chmod after opening or publish via a new 0600 temp file and rename.
Useful? React with 👍 / 👎.
| {"s3", func(s AdapterSet) bool { return s.S3 }, func(b *snapshotBuilder, root string) error { | ||
| return errors.Wrap(NewS3RecordEncoder(root).Encode(b), "s3 encoder") | ||
| }}, |
There was a problem hiding this comment.
Reject non-directory S3 bucket entries
When an enabled S3 dump is malformed or partially truncated so s3/<bucket> is a regular file or symlink, this dispatch still succeeds because S3RecordEncoder.Encode skips top-level entries where !ent.IsDir() (internal/backup/encode_s3.go:85-87). Since the manifest currently has only an empty S3 scope, the CLI cannot otherwise detect that bucket was expected, and it can publish a partial .fsm with the bucket silently omitted; the library path should fail closed on non-directory entries under selected adapter roots before encoding.
Useful? React with 👍 / 👎.
|
@claude review Re-triggering v31 — the original job at https://github.com/bootjp/elastickv/actions/runs/26824025875 errored on the bot side. v31 is doc-only (commit a2ed548); addresses the three doc gaps claude v30 flagged:
No behavior change. CI all green (build/test/golangci-lint/CodeQL/CodeRabbit SUCCESS). |
|
Claude finished @bootjp's task in 2m 5s —— View job Code Review - PR #904 (v31)Tasks
v31 Verification — All Three v30 Doc Fixes Confirmed
All three doc comment blocks are now internally consistent with v30's nil-only Two Codex P2 v31 Observations (pre-existing, non-blocking)These were surfaced by Codex reviewing the v31 codebase — not introduced by v31, but newly raised:
Verdict: Ready to merge. v31 is a clean doc-only fix that correctly aligns all three comment blocks with v30's nil-only scope check. No new issues introduced. The two Codex P2 observations above are pre-existing and do not block this PR. | Branch | | |
…x/claude P2 v31)
Two non-blocking P2 observations claude flagged on its v31 review,
both pre-existing but real correctness/security issues. Folding both
in.
## Codex/claude P2 v31 (1/2): writeSidecar failure leaves orphan .fsm
The encodeOne success path runs writeAndPublish (renames temp .fsm
→ output) BEFORE writeSidecar. If writeSidecar fails after the
rename, the CLI returns non-zero but a restore-visible <output>.fsm
exists without its provenance metadata — exactly the orphan state
v17 closed for the adapter-error path, but un-closed for the
sidecar-write path.
Fix: new rollbackOrphanFSMAndSidecar(outputPath, logger) helper. On
sidecar-write failure in the success path (publishErr == nil), the
caller calls this helper before returning:
- os.Remove(outputPath) — drops the just-published .fsm so the
restore path sees nothing where the orphan was.
- os.Remove(sidecarPath) — drops any partially-written sidecar
bytes (OpenSidecarFile may have already Truncate(0)'d before
WriteEncodeInfo failed).
Both removes log-and-continue on non-ErrNotExist failures so the
caller's primary sidecar-write error remains the dominant signal.
A prior successful encode at the same output path is unrecoverable —
writeAndPublish's os.Rename already overwrote it before writeSidecar
ran. The rollback brings the state to "no .fsm, no sidecar," which
is the same end state as "encode never ran" — the cleanest
consistent outcome without filesystem transactions.
Self-test-mismatch path is unchanged: the existing v15 removeStaleOutputFSM
already handles that case, and writeSidecar's failure on that path
already logger.Warn's-and-continues.
Pinned end-to-end by extending TestCLISidecarWriteRefusesSymlinkTarget:
- Existing assertions: victim file at the symlink target survives
byte-equal; exit 1; symlink itself unlinked.
- New v32 assertion: <output>.fsm is also removed (the rollback
fired after the .fsm was published but before encodeOne returned).
## Codex/claude P2 v31 (2/2): existing sidecar perm not tightened on re-encode
openSidecarFile (unix) opens with O_CREATE + 0o600. The kernel
applies the mode arg ONLY when creating a new file; if the path
already exists (e.g. an older encoder wrote it at 0o644), the
pre-existing perm is preserved. The sidecar carries the source
path, cluster_id, and SHA256 of the .fsm — leaving it world-readable
on a multi-user backup host is the same leak claude v4 fixed for
the create case.
Fix: after the regular-file + Nlink checks + Truncate(0), call
f.Chmod(sidecarFileMode) to enforce the mode on the open descriptor.
New named const sidecarFileMode = 0o600 so a future widening of the
flag-arg mode forces the chmod call to be touched too.
Windows variant unchanged (different perm model; the openSidecarFile
file is already platform-split for this reason).
Pinned by TestOpenSidecarFileEnforcesOwnerOnlyMode (unix-only via
build tag): writes a pre-existing 0o644 file, calls OpenSidecarFile,
asserts the resulting descriptor's Stat reports perm 0o600.
## Caller audit per CLAUDE.md semantic-change rule
- encodeOne success-path branch: rollback called only when
publishErr == nil AND writeSidecar fails. Self-test-mismatch path
unchanged. Sole caller of encodeOne is run(); error return
semantics unchanged.
- rollbackOrphanFSMAndSidecar: new helper, sole caller is the
success-path sidecar-failure branch in encodeOne. No external
callers.
- openSidecarFile (unix): success path now has one extra Chmod
call. All in-package callers (encode_redis.go's multiple
KEYMAP / TTL sidecar writers) are unaffected — they all already
expect a 0o600 file. CLI's writeSidecar / writeMismatchTxt
(via the exported OpenSidecarFile wrapper) are likewise
strictly-tighter-perm only.
- sidecarFileMode const: new named const, only consumer is the
Chmod call in openSidecarFile (unix).
Tests + lint green.
|
@claude review v32 folds in both P2 observations claude flagged on its v31 review (both pre-existing but worth closing). v31 P2 (1/2) — writeSidecar failure leaves orphan .fsm: success path runs v31 P2 (2/2) — OpenSidecarFile mode not enforced on existing files: the unix Caller audit (CLAUDE.md):
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f2040858f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := fsyncParentDir(cfg.outputPath); err != nil { | ||
| return result, errors.Wrap(err, "fsync output dir after rename") |
There was a problem hiding this comment.
Roll back FSM when directory fsync fails
When the rename succeeds but fsyncParentDir returns an error (for example on a filesystem that rejects directory fsync, or an I/O error), this returns before writeSidecar runs while publishedTempPath has already been cleared. The command exits failed but leaves a new restore-visible .fsm at --output without a matching .encode_info.json, so this path needs the same rollback/consistency handling as sidecar publication failures.
Useful? React with 👍 / 👎.
| if srerr := os.Remove(sidecarPath); srerr != nil && !errors.Is(srerr, os.ErrNotExist) { | ||
| logger.Warn("rollback partial sidecar after write failure", "err", srerr) |
There was a problem hiding this comment.
Avoid deleting pre-existing sidecar directories
When sidecar publication fails because <output>.encode_info.json is a pre-existing non-regular entry such as an empty directory or FIFO, OpenSidecarFile correctly refuses to write it, but this rollback then calls os.Remove on that same path even though no partial sidecar was created. That can delete operator-owned filesystem objects merely because the encode failed; only remove a sidecar known to be a regular file/symlink created or truncated by this run, or leave the refused entry in place.
Useful? React with 👍 / 👎.
|
@claude review Re-triggering v32 — original job errored. The build CI flake (Docker Hub timeout fetching buildkit) has been re-run and is in progress; all other v32 CI checks are green (test, golangci-lint, CodeQL, CodeRabbit, Analyze-actions/js/go all SUCCESS or NEUTRAL). v32 (commit 6f20408): two pre-existing P2 fixes — rollback orphan FSM + chmod existing sidecar to 0o600. No new logic; two new tests + one extended test. |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
@claude review Third re-trigger for v32 (prior two attempts errored with bot-side infrastructure issues). All v32 CI checks are now SUCCESS (build re-run cleared, Analyze-go completed). Codex remains silent on v32 (likely rate-limited). v32 (commit 6f20408) = the two non-blocking fixes you flagged on v31:
No new logic beyond your own v31 guidance. |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Codex's review window came back online after the rate-limit pause
and posted three P2 findings against v30, v31, and v32 that I had
missed (the inline comments were attached to PR reviews 4410223887,
4410270684, 4410442057 but the review BODIES were just the standard
"Codex Review" header — the actual findings were in the per-review
inline comments).
## Codex P2 v31 (L640): S3 encoder silently skips non-directory bucket entries
S3RecordEncoder.Encode does `if !ent.IsDir() { continue }` for
top-level entries under `s3/`. When a partially-truncated dump
leaves a regular file or symlink where a bucket directory should
be, the encoder publishes a partial .fsm with the affected bucket
silently omitted. The deferred-enumeration empty S3 scope in the
manifest cannot otherwise flag the missing data.
Fix: encode_s3.go's loop now distinguishes:
- Reserved-prefix entries ("_*") → continue (forward compat with
`_incomplete_uploads/` and `_orphans/`, which are handled by
ErrEncodeUnsupportedS3IncompleteUploads / ErrEncodeUnsupportedS3Orphans).
- Any other non-directory entry → return ErrS3EncodeNotRegular
with the path and mode (matches the existing _bucket.json non-
regular guard).
Pinned by TestS3EncodeRejectsNonDirectoryBucketEntry and
TestS3EncodeIgnoresReservedPrefixEntry.
## Codex P2 v32 (L642): rollback FSM when parent-dir fsync fails
v25 added fsyncParentDir after os.Rename in writeAndPublish to make
the rename durable. If fsync fails (e.g., on a filesystem that
rejects directory fsync, or transient I/O error), writeAndPublish
returned the error but the renamed .fsm stayed at --output — the
operator sees a "successful" .fsm that may not survive a crash.
Fix: extract the rename + fsync + rollback block into a new
publishAndFsync helper that os.Remove's the .fsm on fsync failure
(same rollback semantics as the v32 sidecar-failure path).
publishAndFsync also brings writeAndPublish under the cyclop bound.
No new test — the failure path is hard to drive deterministically
(would need fsync syscall injection). The success path is exercised
by every existing CLI round-trip test; the rollback branch is
inspectable by code review against the matching v32 sidecar
rollback pattern.
## Codex P2 v32 (L520): rollback must not delete operator-owned sidecar entries
v32's rollbackOrphanFSMAndSidecar called os.Remove on the sidecar
path unconditionally. When the sidecar path was a pre-existing
non-regular entry that OpenSidecarFile correctly refused to clobber
(operator-placed symlink, FIFO, empty directory, etc.), the
rollback would destructively unlink it merely because the encode
failed.
Fix: rollbackOrphanFSMAndSidecar now os.Lstats the sidecar path and
only os.Removes when info.Mode().IsRegular(). Symlinks (operator-
placed), FIFOs, directories, and other non-regular entries are
logged-and-skipped.
Pinned end-to-end by extending TestCLISidecarWriteRefusesSymlinkTarget:
- Existing assertions: victim file at the symlink target survives
byte-equal; .fsm at --output is removed by v32 rollback.
- NEW v33 assertion: the operator-placed symlink at the sidecar
path is PRESERVED (os.Lstat + ModeSymlink check).
The assertions are extracted into assertSymlinkSidecarRollbackInvariants
to keep the test under the cyclop bound.
## Caller audit per CLAUDE.md semantic-change rule
- S3RecordEncoder.Encode: was "silent skip on non-directory top-
level entries"; now "fail closed with ErrS3EncodeNotRegular
unless name starts with `_`". Sole effect on legitimate dumps
is no behavior change (all canonical bucket dirs are
directories). Malformed dumps that previously silently
half-encoded now fail closed; classifyEncodeError routes via
the v9 errors.Mark + ErrEncodeAdapterData layer to exit 2.
- publishAndFsync (new helper): sole caller is writeAndPublish.
Behavior change ONLY on fsync failure (was: leave .fsm in
place; now: roll back). Success path unchanged.
- rollbackOrphanFSMAndSidecar: sidecar removal is now gated on
IsRegular(). Was destructive on operator-owned non-regular
entries; now leaves them alone. Sole caller is encodeOne's
success-path sidecar-failure branch.
- assertSymlinkSidecarRollbackInvariants (test helper): no
production callers; only used by TestCLISidecarWriteRefusesSymlinkTarget.
Tests + lint green (incl. gofmt on the new helper's comment block).
|
@claude review v33 closes three codex P2 findings I missed across v30/v31/v32 (codex posted them as PR-review inline comments rather than separate review-body posts, so they didn't surface in my initial scans — found them by querying codex P2 v31 L640 — S3 silently skips non-directory bucket entries: codex P2 v32 L642 — rollback FSM when parent-dir fsync fails: v25's codex P2 v32 L520 — don't os.Remove operator-owned sidecar entries: v32's Caller audit (CLAUDE.md): S3 encoder fail-closed flows through
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5adca7f679
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| return | ||
| } | ||
| if !info.Mode().IsRegular() { |
There was a problem hiding this comment.
Preserve hard-linked sidecars on rollback
When the sidecar path is a pre-existing hard link, OpenSidecarFile refuses it before truncating, but this rollback treats it as a regular file and the following os.Remove unlinks it anyway. In a shared backup directory this means a failed encode can delete an operator-owned sidecar link that the no-clobber guard intentionally left untouched; check the link count (or track whether this run actually opened/truncated the file) before removing it.
Useful? React with 👍 / 👎.
|
@claude review Re-triggering v33 — prior bot job errored. v33 (commit 5adca7f) closes three codex P2 inline findings on v30/v31/v32:
CI on v33 currently green: build/test/golangci-lint/CodeQL/Analyze all SUCCESS; ubuntu-latest test + Analyze-go still in progress. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/backup/open_nofollow_unix_test.go (1)
20-24: 💤 Low valueConsider verifying the seeded file actually has broader permissions.
The test seeds a file at
0o644but doesn't verify that permission was honored before testing the enforcement. On systems with restrictive umasks or filesystems,WriteFilemight silently produce0o600, causing the test to pass even if theChmodlogic is broken.♻️ Proposed robustness check (pattern from encryption test)
if err := os.WriteFile(path, []byte("prior"), 0o644); err != nil { //nolint:gosec // test simulates legacy permissive sidecar t.Fatalf("WriteFile: %v", err) } + // Verify environment honored the broader mode before testing enforcement. + st, err := os.Stat(path) + if err != nil { + t.Fatalf("Stat seeded file: %v", err) + } + if st.Mode().Perm()&0o077 == 0 { + t.Skipf("environment refused permissive seed (mode=%o)", st.Mode().Perm()) + } f, err := OpenSidecarFile(path)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/backup/open_nofollow_unix_test.go` around lines 20 - 24, After seeding the sidecar file with os.WriteFile(..., 0o644), immediately stat the file with os.Stat and assert the file mode's permissions (info.Mode().Perm()) equal 0o644; if not, fail the test (t.Fatalf) so the test doesn't silently pass on systems that applied a restrictive umask or different FS behavior. Locate the seed call in the test (the os.WriteFile invocation) and add the os.Stat + Mode().Perm() check right after it to ensure the precondition for the chmod-enforcement logic is actually met.internal/backup/open_nofollow_unix.go (1)
61-61: ⚡ Quick winUse
sidecarFileModeconstant in theOpenFilecall.The comment on line 116 states "a future edit that widens one must touch both," but line 61 still uses the literal
0o600. Replacing it withsidecarFileModewould mechanically enforce synchronization rather than relying on manual coordination.♻️ Proposed fix to use the constant consistently
- f, err := os.OpenFile(path, flag, 0o600) //nolint:gosec,mnd // path is composed from output-root + fixed file name; 0600 is the standard owner-only mode + f, err := os.OpenFile(path, flag, sidecarFileMode) //nolint:gosec,mnd // path is composed from output-root + fixed file name; sidecarFileMode is the standard owner-only modeAlso applies to: 114-118
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/backup/open_nofollow_unix.go` at line 61, Replace the hard-coded file mode 0o600 in the os.OpenFile call with the sidecarFileMode constant so the mode is maintained centrally; update the call that currently reads os.OpenFile(path, flag, 0o600) to use sidecarFileMode (referencing the same variables path and flag), and grep this file for any other literal 0o600 usages (e.g., the similar open/create calls later in the file) and replace them with sidecarFileMode to ensure consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/backup/open_nofollow_unix_test.go`:
- Around line 20-24: After seeding the sidecar file with os.WriteFile(...,
0o644), immediately stat the file with os.Stat and assert the file mode's
permissions (info.Mode().Perm()) equal 0o644; if not, fail the test (t.Fatalf)
so the test doesn't silently pass on systems that applied a restrictive umask or
different FS behavior. Locate the seed call in the test (the os.WriteFile
invocation) and add the os.Stat + Mode().Perm() check right after it to ensure
the precondition for the chmod-enforcement logic is actually met.
In `@internal/backup/open_nofollow_unix.go`:
- Line 61: Replace the hard-coded file mode 0o600 in the os.OpenFile call with
the sidecarFileMode constant so the mode is maintained centrally; update the
call that currently reads os.OpenFile(path, flag, 0o600) to use sidecarFileMode
(referencing the same variables path and flag), and grep this file for any other
literal 0o600 usages (e.g., the similar open/create calls later in the file) and
replace them with sidecarFileMode to ensure consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3f511d1d-5715-4bc5-a9b4-21781d99f0ca
📒 Files selected for processing (6)
cmd/elastickv-snapshot-encode/main.gocmd/elastickv-snapshot-encode/main_test.gointernal/backup/encode_s3.gointernal/backup/encode_s3_test.gointernal/backup/open_nofollow_unix.gointernal/backup/open_nofollow_unix_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/elastickv-snapshot-encode/main.go
- cmd/elastickv-snapshot-encode/main_test.go
|
Claude finished @bootjp's task in 4m 42s —— View job Code Review - PR #904 (v33)Tasks
v33 Verification — All Three Codex P2 Fixes Confirmed
One Minor Doc Nit (non-blocking)
The S3 non-directory bucket finding appeared in the codex review of commit // ...silently skipping would let the encoder publish a partial .fsm with
// the affected bucket omitted (codex P2 v32 #904; the manifest's…This should read Carry-Over Observations (non-blocking, from CodeRabbit v33 review)
f, err := os.OpenFile(path, flag, 0o600)
Strengths in v33
Verdict: Ready to merge. All three codex P2 findings are correctly implemented and pinned by tests. The one doc nit (round-number citation in | Branch | | |
…odex P2 v33) ## Codex P2 v33: rollback destroys hard-linked sidecars v32's rollbackOrphanFSMAndSidecar gated sidecar removal on info.Mode().IsRegular(). A pre-existing hard link at the sidecar path passes IsRegular(), so the rollback would os.Remove it — destroying an operator-owned link that OpenSidecarFile's Nlink > 1 guard correctly refused to clobber. In a shared backup directory this lets a failed encode unlink files outside the dump tree. ## Fix Track whether OpenSidecarFile actually succeeded on this run. writeSidecar's signature changes from `error` to `(truncated bool, err error)`: - OpenSidecarFile fails (symlink ELOOP, hard-link Nlink>1, FIFO ENXIO, non-regular file, etc.) → return (false, err). Caller MUST NOT remove the sidecar path; the pre-existing entry is operator-owned and was never touched. - OpenSidecarFile succeeds (truncated and chmod'd a fresh or single-link regular file) → return (true, err) on any subsequent failure. Caller rollback removes the partial bytes. rollbackOrphanFSMAndSidecar gains a sidecarTruncated bool parameter. The Lstat + IsRegular check from v33 is removed — the truncated flag is a precise signal of "this run owns the bytes," not a stat-time heuristic that can race with operator changes. ## Pinned by TestCLISidecarWriteRefusesHardLinkTarget (unix-only): pre-plants a victim file with a sentinel body, os.Link's the sidecar path to it, runs the CLI, asserts: - Exit code = exitUserErr. - Victim contents are byte-equal preserved (Nlink guard refused truncation). - Hard link at sidecar path is preserved (sidecarTruncated=false → rollback skips it). - .fsm at outputPath is removed (rollback's FSM branch always fires on sidecar-write failure with successful publish). The v33 TestCLISidecarWriteRefusesSymlinkTarget continues to pass: OpenSidecarFile ELOOP → sidecarTruncated=false → symlink survives. ## CodeRabbit nits folded in 1. `internal/backup/open_nofollow_unix.go:61`: `os.OpenFile(..., flag, 0o600)` now uses the `sidecarFileMode` const (was literal). Keeps the OpenFile mode and the post-Truncate Chmod mode synchronized — a future widening of one is mechanically caught. 2. `internal/backup/open_nofollow_unix_test.go:20-24`: TestOpenSidecarFileEnforcesOwnerOnlyMode now `os.Stat`s the seeded 0o644 file and `t.Skip`s if the environment refused the permissive mode (restrictive umask / FS). Without this, the test could pass spuriously on environments where WriteFile silently produces 0o600. ## Caller audit per CLAUDE.md semantic-change rule - writeSidecar return signature changed from `error` to `(bool, error)`. Sole caller is encodeOne (success-path branch); updated in lock-step to capture both values. The bool flows into rollbackOrphanFSMAndSidecar via a new parameter. - rollbackOrphanFSMAndSidecar gained a sidecarTruncated bool parameter. Sole caller is encodeOne's success-path sidecar-failure branch; updated in lock-step. v33's Lstat + IsRegular check removed (replaced by the precise truncated flag), so non-regular paths are no longer destroyed even by paths the v33 check would have accidentally caught. - openSidecarFile (unix) OpenFile mode arg changed from 0o600 literal to sidecarFileMode const. Same numeric value; no behavior change. All callers (in-package adapter writers + the exported OpenSidecarFile) are unaffected. Tests + lint green.
|
@claude review v34 addresses codex P2 v33 (hard-linked sidecars) plus two CodeRabbit nits. codex P2 v33 — v32/v33's rollback gated on CodeRabbit nit (quick win) — CodeRabbit nit (low value) — Caller audit (CLAUDE.md):
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Claude v33 caught: encode_s3.go:93 attributed the non-directory bucket guard to "codex P2 v32 #904" but the finding was actually flagged in codex's review of commit a2ed548 (v31 commit). v32 was when codex flagged the fsync + sidecar-rollback issues. The attribution round-number was off by one. Doc-only one-character fix (v32 → v31). No behavior change; lint green.
|
@claude review v35 = doc-only one-character fix per claude v33 nit ( (Also note: v34 from earlier addresses codex P2 v33 hard-linked sidecar rollback + two CodeRabbit nits; v34's claude bot job errored without producing output, but CI on v34 was all green. v35 stacks on v34.)
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
elastickv/internal/backup/encode_dynamodb.go
Lines 101 to 102 in 365744e
When dynamodb/ contains a canonical table entry that is a regular file or symlink (for example from a truncated or reused decoded tree), this branch silently skips it and the encoder can publish an .fsm with that table omitted; the S3 path in this same change already treats non-directory top-level entries as malformed for this reason. Please return an adapter-data error for non-directory entries under the selected DynamoDB root instead of treating them as absent.
elastickv/internal/backup/encode_sqs.go
Lines 142 to 143 in 365744e
When sqs/ contains a queue entry that is a regular file or symlink, this silently skips it, so a malformed or partially cleaned input tree can encode successfully while dropping the affected queue and its messages. Since enabled adapter errors are now classified as data failures, this should reject non-directory top-level SQS entries rather than publish a partial restore artifact.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@claude review Re-triggering for v35 — prior two bot jobs errored. Status:
If this v35 review also errors, the prior bot signals form a coherent review-OK. |
|
Claude finished @bootjp's task in 59s —— View job Code Review - PR #904 (v35)Tasks
v35 Verification — v33 Citation Nit ConfirmedThe diff is exactly one token change in one comment line:
Verdict: Ready to merge. v35 closes the single citation nit from the v33 review. All findings across all 35 rounds of review are resolved. The PR is clean.
| Branch | | |

Summary
Phase 0b M6 implementation per the merged design doc
docs/design/2026_06_01_proposed_snapshot_encode_cli.md(#896, merged at fe9e941).Wires the merged M1–M5 encoder slices into a user-facing CLI plus a library entrypoint mirroring the decoder's
DecodeSnapshot. Implements the round-trip self-test the parent doc mandates, with write-then-rename atomic publish so a self-test failure never reaches the restore path.What lands
Library (
internal/backup/):encode_snapshot.go:EncodeSnapshot(opts, out io.Writer) (EncodeResult, error)— high-level wrapper that dispatches per-adapter encoders in canonical fan-out order (redis → dynamodb → s3 → sqs), implements two-mode buffering (stream whenSelfTest=false, buffer whenSelfTest=true), runs the structural self-test against the in-memory buffer, and copies tooutonly on match. UnexportedcorruptBufferForTesthook lets same-package tests inject buffer corruption that reaches the self-test decode but neverout(codex P2 v6 docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896 — write-then-rename atomicity).encode_info.go:EncodeInfoschema +WriteEncodeInfo/ReadEncodeInfohelpers +EncodeInfoSidecarPath(path-derived<output>.encode_info.json, no static-name collisions per gemini medium v2 docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896). Format-version gate so a future bump surfaces asErrUnsupportedEncodeInfoFormatVersion.manifest.go:Exclusions.RenameS3Collisions booladded with JSON tagrename_s3_collisions. Intentionally NOT inexclusionsRequiredFieldsso legacy manifests decode safely with the zero valuefalse(claude v5 medium docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896).Decoder CLI update (
cmd/elastickv-snapshot-decode/main.go):emitManifestnow populates the newRenameS3Collisionsfield fromcfg.renameCollisions, completing the decoder→encoder round-trip for--rename-collisionsdumps.Encoder CLI (
cmd/elastickv-snapshot-encode/main.go):--input,--output,--adapter(decoder-parity CSV parser),--last-commit-ts,--self-test,--scratch-root.<output>.tmp-<random>, fsync+close, then rename. Self-test failure →mismatch.txtnext to where.fsmwould have been, temp file removed, exit 2.--last-commit-ts T < manifest→ exit 2 with typedErrSelfTestLowerLastCommitTS.encoder_versionstamped at build time via-ldflags "-X main.version=..."(mirrors decoder pattern atmain.go:45).Test plan
go test -race -count=1— all green:internal/backup/: 3 new tests (encode_info schema, format gate, legacy manifest forward-compat)internal/backup/encode_snapshot_test.go: 5 tests (library round-trip, self-test match against canonicalized input, corruption never reachesout, missing-input guard, sidecar path derivation)cmd/elastickv-snapshot-encode/main_test.go: 9 tests (missing manifest, unknown adapter, lower-TS fail-closed, equal+higher TS accept, path-derived sidecar, two-files-no-collision, full round-trip with--self-test, atomic-publish never leaves bad.fsm,--last-commit-tsparser)make lint: clean.Caller audit per CLAUDE.md semantic-change rule
Exclusionsstruct gained a field. Existing callers either build via field-tagged literals (decoder CLI'semitManifest— updated to populate the new field) or read it (encoder'sbuildSelfTestDecodeOptions— new code). No silent semantic change.DecodeOptions.RenameS3Collisionswas already a public field used by the decoder; the encoder now also reads it via the manifest. No caller-side change needed.Self-review (5 passes)
.fsm. Corrupted buffer never reaches theio.Writer(pinned byTestEncodeSnapshotSelfTestDetectsCorruptionassertingout.Len()==0). Allb.Adderrors propagate throughEncodeSnapshot.io.Writer. Temp-file suffix usescrypto/randso concurrent encodes against the same--outputcannot collide.SelfTest=falsestreams with onesha256.Writertee, no extra allocations.SelfTest=trueallocates one FSM-sized*bytes.Bufferplus the scratch decode tree (documented memory cost).--last-commit-ts T < manifestfail-closed with typed error; self-test threads MANIFEST DecodeOptions (Exclusions.*+DynamoDBLayout → DynamoDBBundleJSONL) so trees produced with non-default decoder flags round-trip cleanly.Risk
Low. The encoder is offline; restoration is non-destructive (new keyspace on a fresh cluster via stop-replace-restart). The only public-API change is the
Exclusions.RenameS3Collisionsfield, which is forward-compat for older manifests. All existing M1–M5 tests continue to pass.Summary by CodeRabbit
New Features
Bug Fixes
Tests