From fd8b0d92bc41fd125135b550bd4dc47c030103a8 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 00:53:29 +0900 Subject: [PATCH 01/35] backup: Phase 0b M6 implementation - cmd/elastickv-snapshot-encode MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implementation of the M6 design (this PR's earlier doc commits 31c0a0f8 through 17364356). Wires the merged M1-M5 encoder slices into a user-facing CLI plus a library entrypoint, with the round-trip self-test the parent doc mandates. ## What ships - internal/backup/manifest.go: Exclusions gains RenameS3Collisions bool with json tag rename_s3_collisions. Intentionally NOT added to exclusionsRequiredFields so legacy manifests decode safely with the zero value false. Pinned by TestExclusionsLegacyManifestOmitsRenameS3Collisions. - cmd/elastickv-snapshot-decode/main.go: emitManifest now populates the new field from cfg.renameCollisions, so M6's self-test can thread the same option back through DecodeSnapshot. - internal/backup/encode_info.go: ENCODE_INFO.json schema with format_version gate, NewEncodeInfo constructor, WriteEncodeInfo / ReadEncodeInfo helpers, and EncodeInfoSidecarPath that derives the sidecar from the .fsm path (.encode_info.json — no static name collisions). - internal/backup/encode_snapshot.go: EncodeSnapshot(opts, out) library entrypoint mirroring DecodeSnapshot. Dispatches per-adapter encoders in canonical fan-out order (redis -> dynamodb -> s3 -> sqs), and implements the two-mode buffering model: - SelfTest=false: stream FSM straight to out with a sha256.Writer tee. - SelfTest=true: buffer FSM in *bytes.Buffer, fire the unexported corruptBufferForTest hook (if set), self-test against the buffer, copy to out ONLY on match. Corruption hooks reach the self-test decode but never reach out (write-then-rename atomicity). - cmd/elastickv-snapshot-encode/main.go: CLI with --input / --output / --adapter / --last-commit-ts / --self-test / --scratch-root flags; decoder-parity adapter CSV parser; fail-closed T >= manifest.last_commit_ts validation; atomic publish via .tmp- with deferred cleanup; sidecar emission with fsync+close discipline. ## Tests (12 new) internal/backup/encode_info_test.go: - TestEncodeInfoRoundTrip - TestEncodeInfoRejectsUnknownFormatVersion - TestExclusionsLegacyManifestOmitsRenameS3Collisions internal/backup/encode_snapshot_test.go: - TestEncodeSnapshotLibraryRoundTrip - TestEncodeSnapshotSelfTestMatchesInput (uses canonicalize-once pattern: encode -> decode -> re-encode self-tests cleanly) - TestEncodeSnapshotSelfTestDetectsCorruption (corruption reaches self-test decode, NEVER reaches the io.Writer) - TestEncodeSnapshotRequiresInputRoot - TestEncodeInfoSidecarPath cmd/elastickv-snapshot-encode/main_test.go: - TestCLIRejectsMissingManifest - TestCLIRejectsUnknownAdapter - TestCLIRejectsLowerLastCommitTSOverride - TestCLIAcceptsEqualAndHigherLastCommitTSOverride (sub-tests for equal + higher) - TestCLIEncodeInfoPathDerivedFromOutput - TestCLIEncodeInfoTwoFilesNoCollision - TestCLIRoundTripSelfTestAllAdapters - TestCLISelfTestFailureLeavesNoFsmAtOutputPath - TestParseLastCommitTS All green (race + no cache). golangci-lint clean. ## Caller audit per CLAUDE.md semantic-change rule - Exclusions struct gained a field. Existing callers either: (a) build the struct via field-tagged literals (decoder CLI's emitManifest — updated to populate the new field), or (b) read it (encoder's buildSelfTestDecodeOptions — new code). No silent semantic change for any pre-existing caller. - DecodeOptions.RenameS3Collisions was already a public field used by the decoder; the encoder now also reads it via the manifest. No caller-side change needed. ## Self-review 1. Data loss: write-then-rename atomic publish — self-test failure never publishes the .fsm. Mismatch txt + corrupt buffer never reach the io.Writer (pinned by TestEncodeSnapshotSelfTestDetectsCorruption asserting out.Len()==0). 2. Concurrency: pure offline. The CLI is a single-shot binary; library entrypoint takes an io.Writer the caller owns. 3. Performance: SelfTest=false streams with one sha256 tee, no extra allocations. SelfTest=true allocates one FSM-sized buffer plus the scratch decode tree (documented memory cost). 4. Data consistency: --last-commit-ts T < manifest is fail-closed with typed error; self-test threads MANIFEST DecodeOptions (Exclusions.* + DynamoDBLayout -> DynamoDBBundleJSONL) so trees produced with non-default decoder flags round-trip cleanly. 5. Test coverage: 12 new tests cover library entrypoint, CLI flag parsing, atomic publish discipline, sidecar path-derivation, corruption detection, forward-compat for legacy manifests. --- cmd/elastickv-snapshot-decode/main.go | 1 + cmd/elastickv-snapshot-encode/main.go | 392 ++++++++++++++++++ cmd/elastickv-snapshot-encode/main_test.go | 362 ++++++++++++++++ internal/backup/encode_info.go | 114 +++++ internal/backup/encode_info_test.go | 95 +++++ internal/backup/encode_snapshot.go | 461 +++++++++++++++++++++ internal/backup/encode_snapshot_test.go | 217 ++++++++++ internal/backup/manifest.go | 8 + 8 files changed, 1650 insertions(+) create mode 100644 cmd/elastickv-snapshot-encode/main.go create mode 100644 cmd/elastickv-snapshot-encode/main_test.go create mode 100644 internal/backup/encode_info.go create mode 100644 internal/backup/encode_info_test.go create mode 100644 internal/backup/encode_snapshot.go create mode 100644 internal/backup/encode_snapshot_test.go diff --git a/cmd/elastickv-snapshot-decode/main.go b/cmd/elastickv-snapshot-decode/main.go index 912f0b2ce..3359e2eaf 100644 --- a/cmd/elastickv-snapshot-decode/main.go +++ b/cmd/elastickv-snapshot-decode/main.go @@ -275,6 +275,7 @@ func emitManifest(cfg *config, res backup.DecodeResult) error { IncludeOrphans: cfg.includeOrphans, PreserveSQSVisibility: cfg.preserveSQSVisibility, IncludeSQSSideRecords: cfg.includeSQSSideRecords, + RenameS3Collisions: cfg.renameCollisions, } if cfg.bundleJSONL { m.DynamoDBLayout = backup.DynamoDBLayoutJSONL diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go new file mode 100644 index 000000000..fea66f042 --- /dev/null +++ b/cmd/elastickv-snapshot-encode/main.go @@ -0,0 +1,392 @@ +// Command elastickv-snapshot-encode is the Phase 0b M6 snapshot encoder +// described in docs/design/2026_06_01_proposed_snapshot_encode_cli.md +// (parent: docs/design/2026_05_25_partial_snapshot_logical_encoder.md). +// +// It reads a vendor-independent per-adapter directory tree (produced by +// elastickv-snapshot-decode or by a future Phase 1 live extractor) and +// writes a native EKVPBBL1 .fsm a stopped node can load via the +// stop-replace-restart restore runbook (parent §"Restore via +// stop-replace-restart"). +// +// The CLI is offline-only. It does not talk to a running cluster; the +// receiving cluster loads the output .fsm via its existing snapshot +// loader on next restart. +// +// Atomic publish: the .fsm is written to .tmp- first, +// fsync+close, then renamed to only after the optional +// self-test matches. A self-test failure removes the temp file, so a +// known-bad .fsm never reaches the restore path (codex P2 v2 #896). +// +// version is stamped at build time via -ldflags "-X main.version=$(git rev-parse HEAD)". +// Test builds keep the literal "dev" so CLI-level tests can assert the +// field is present without depending on a release tag. +package main + +import ( + "crypto/rand" + "encoding/hex" + "flag" + "fmt" + "io" + "log/slog" + "os" + "path/filepath" + "strings" + "time" + + "github.com/bootjp/elastickv/internal/backup" + "github.com/cockroachdb/errors" +) + +var version = "dev" + +const ( + exitSuccess = 0 + exitUserErr = 1 + exitDataErr = 2 + // tempSuffixHexLen is the hex-character length of the random + // suffix appended to .tmp-; 8 hex chars = 4 bytes of + // entropy, which gives ~4×10^9 collision space per --output path + // (more than enough for concurrent encodes against the same path). + tempSuffixHexLen = 8 + tempSuffixByteLen = tempSuffixHexLen / 2 + mismatchTxtPerm = 0o600 + encodeInfoFilePerm = 0o600 +) + +type config struct { + inputPath string + outputPath string + adapters backup.AdapterSet + lastCommitTSPresent bool + lastCommitTS uint64 + selfTest bool + scratchRoot string +} + +func main() { + logger := slog.New(slog.NewTextHandler(os.Stderr, nil)) + exitCode, err := run(os.Args[1:], logger) + if err != nil { + logger.Error("elastickv-snapshot-encode", "err", err) + } + os.Exit(exitCode) +} + +func run(argv []string, logger *slog.Logger) (int, error) { + cfg, err := parseFlags(argv) + if err != nil { + return exitUserErr, err + } + if err := encodeOne(cfg, logger); err != nil { + // Errors from the encoder layer that represent data constraints + // (HLC ceiling regression, self-test mismatch) are exit 2; other + // errors are exit 1. + if errors.Is(err, backup.ErrSelfTestLowerLastCommitTS) { + return exitDataErr, err + } + if errors.Is(err, errSelfTestMismatch) { + return exitDataErr, err + } + return exitUserErr, err + } + return exitSuccess, nil +} + +func parseFlags(argv []string) (*config, error) { + fs := flag.NewFlagSet("elastickv-snapshot-encode", flag.ContinueOnError) + fs.SetOutput(io.Discard) + + var ( + inputPath string + outputPath string + adapterCSV string + ltsRaw string + selfTest bool + scratchRoot string + ) + fs.StringVar(&inputPath, "input", "", "Directory tree root produced by elastickv-snapshot-decode (required, must contain MANIFEST.json)") + fs.StringVar(&outputPath, "output", "", "Destination .fsm file path (required)") + fs.StringVar(&adapterCSV, "adapter", "dynamodb,s3,redis,sqs", "Comma-separated subset of adapters to encode") + fs.StringVar(<sRaw, "last-commit-ts", "", "Override the manifest's last_commit_ts; must be >= manifest value (HLC ceiling can only rise)") + fs.BoolVar(&selfTest, "self-test", false, "After encode, decode the produced .fsm and assert it structurally matches --input") + fs.StringVar(&scratchRoot, "scratch-root", "", "Base directory for self-test scratch subdir (default os.TempDir); a unique encode-self-test- subdir is always created underneath") + + if err := fs.Parse(argv); err != nil { + return nil, errors.WithStack(err) + } + if inputPath == "" { + return nil, errors.New("--input is required") + } + if outputPath == "" { + return nil, errors.New("--output is required") + } + adapters, err := parseAdapterSet(adapterCSV) + if err != nil { + return nil, err + } + cfg := &config{ + inputPath: inputPath, + outputPath: outputPath, + adapters: adapters, + selfTest: selfTest, + scratchRoot: scratchRoot, + } + if ltsRaw != "" { + ts, perr := parseLastCommitTS(ltsRaw) + if perr != nil { + return nil, perr + } + cfg.lastCommitTSPresent = true + cfg.lastCommitTS = ts + } + return cfg, nil +} + +// parseLastCommitTS parses --last-commit-ts as a uint64. Hex (0x prefix) +// or decimal accepted. Negative or out-of-range surfaces as exit-1 +// (flag-parse error); the semantic check (T >= manifest) is exit-2. +func parseLastCommitTS(raw string) (uint64, error) { + s := strings.TrimSpace(raw) + if s == "" { + return 0, errors.New("--last-commit-ts is empty") + } + var ts uint64 + if strings.HasPrefix(s, "0x") || strings.HasPrefix(s, "0X") { + if _, err := fmt.Sscanf(s[2:], "%x", &ts); err != nil { + return 0, errors.Wrap(err, "--last-commit-ts hex parse") + } + return ts, nil + } + if _, err := fmt.Sscanf(s, "%d", &ts); err != nil { + return 0, errors.Wrap(err, "--last-commit-ts decimal parse") + } + return ts, nil +} + +// parseAdapterSet decodes a comma-separated adapter list (or "all"). +// Mirrors the decoder's parser so a typo cannot silently disable an +// adapter. Unknown name → exit-1. +func parseAdapterSet(csv string) (backup.AdapterSet, error) { + if csv == "" || csv == "all" { + return backup.AdapterSet{DynamoDB: true, S3: true, Redis: true, SQS: true}, nil + } + var set backup.AdapterSet + for _, raw := range strings.Split(csv, ",") { + name := strings.TrimSpace(strings.ToLower(raw)) + switch name { + case "dynamodb": + set.DynamoDB = true + case "s3": + set.S3 = true + case "redis": + set.Redis = true + case "sqs": + set.SQS = true + case "": + continue + default: + return backup.AdapterSet{}, errors.Errorf("unknown adapter %q", name) + } + } + return set, nil +} + +// errSelfTestMismatch is a typed sentinel so run() can map self-test diffs +// to exit-2 without coupling to the encoder's mismatch.txt format. +var errSelfTestMismatch = errors.New("backup: --self-test diff against --input") + +func encodeOne(cfg *config, logger *slog.Logger) error { + manifest, err := readInputManifest(cfg.inputPath) + if err != nil { + return err + } + effectiveTS, overridden, err := resolveLastCommitTS(cfg, manifest.LastCommitTS) + if err != nil { + return err + } + encodeOpts := buildEncodeOptions(cfg, effectiveTS, manifest) + + mismatchPath := cfg.outputPath + ".mismatch.txt" + _ = os.Remove(mismatchPath) // stale-mismatch cleanup, gemini medium v6 #896 + + result, err := writeAndPublish(cfg, encodeOpts, mismatchPath, logger) + if err != nil { + return err + } + if err := writeSidecar(cfg, manifest, effectiveTS, overridden, result); err != nil { + return errors.Wrap(err, "write encode_info sidecar") + } + logger.Info("encode complete", + "output", cfg.outputPath, + "bytes", result.BytesWritten, + "self_test", cfg.selfTest, + "adapters", strings.Join(result.AdaptersEnabled, ","), + ) + return nil +} + +// readInputManifest opens + decodes /MANIFEST.json. +func readInputManifest(inputPath string) (backup.Manifest, error) { + manifestPath := filepath.Join(inputPath, "MANIFEST.json") + manifestFile, err := os.Open(manifestPath) //nolint:gosec // operator-supplied path + if err != nil { + return backup.Manifest{}, errors.Wrapf(err, "open %s", manifestPath) + } + defer func() { _ = manifestFile.Close() }() + m, err := backup.ReadManifest(manifestFile) + if err != nil { + return backup.Manifest{}, errors.Wrap(err, "read manifest") + } + return m, nil +} + +func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifest) backup.EncodeOptions { + encodeOpts := backup.EncodeOptions{ + InputRoot: cfg.inputPath, + Adapters: cfg.adapters, + LastCommitTS: effectiveTS, + SelfTest: cfg.selfTest, + } + if cfg.selfTest { + encodeOpts.SelfTestDecodeOptions = buildSelfTestDecodeOptions(cfg, manifest) + } + return encodeOpts +} + +// writeAndPublish writes the .fsm to a temp path, runs the optional +// self-test via EncodeSnapshot, and renames temp → output on success. +// On self-test failure: writes mismatch.txt, removes the temp file via +// the deferred cleanup, returns errSelfTestMismatch. +func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath string, logger *slog.Logger) (backup.EncodeResult, error) { + tempPath, err := tempOutputPath(cfg.outputPath) + if err != nil { + return backup.EncodeResult{}, err + } + result, err := encodeToTempFile(tempPath, encodeOpts) + publishedTempPath := tempPath + defer func() { + if publishedTempPath != "" { + _ = os.Remove(publishedTempPath) + } + }() + if err != nil { + return result, err + } + if cfg.selfTest && !result.SelfTestMatched { + if werr := os.WriteFile(mismatchPath, result.SelfTestMismatchTxt, mismatchTxtPerm); werr != nil { + logger.Warn("write mismatch.txt", "err", werr) + } + return result, errors.Wrap(errSelfTestMismatch, "self-test diff (see "+mismatchPath+")") + } + if err := os.Rename(tempPath, cfg.outputPath); err != nil { + return result, errors.Wrap(err, "rename tmp -> output") + } + publishedTempPath = "" // rename succeeded; defer no-ops + return result, nil +} + +// encodeToTempFile creates tempPath, runs EncodeSnapshot into it, +// fsync+close. Caller is responsible for the os.Remove cleanup on error. +func encodeToTempFile(tempPath string, encodeOpts backup.EncodeOptions) (backup.EncodeResult, error) { + tempFile, err := os.Create(tempPath) //nolint:gosec // operator-supplied path + if err != nil { + return backup.EncodeResult{}, errors.Wrapf(err, "create %s", tempPath) + } + result, err := backup.EncodeSnapshot(encodeOpts, tempFile) + if err != nil { + _ = tempFile.Close() + return result, errors.Wrap(err, "EncodeSnapshot") + } + if err := tempFile.Sync(); err != nil { + _ = tempFile.Close() + return result, errors.Wrap(err, "fsync tmp") + } + if err := tempFile.Close(); err != nil { + return result, errors.Wrap(err, "close tmp") + } + return result, nil +} + +// resolveLastCommitTS applies the parent doc's HLC-ceiling-only-rises +// rule. Returns the effective T, whether an override was applied, and a +// typed error on regression. +func resolveLastCommitTS(cfg *config, manifestTS uint64) (uint64, bool, error) { + if !cfg.lastCommitTSPresent { + return manifestTS, false, nil + } + if cfg.lastCommitTS < manifestTS { + return 0, false, errors.Wrapf(backup.ErrSelfTestLowerLastCommitTS, + "--last-commit-ts %d < manifest %d", cfg.lastCommitTS, manifestTS) + } + return cfg.lastCommitTS, true, nil +} + +// buildSelfTestDecodeOptions translates manifest fields into the +// DecodeOptions the self-test feeds into DecodeSnapshot, so the scratch +// tree matches what the original decoder would have produced (codex P2 +// v3 #896). +func buildSelfTestDecodeOptions(cfg *config, m backup.Manifest) backup.DecodeOptions { + opts := backup.DecodeOptions{ + OutRoot: cfg.scratchRoot, + Adapters: cfg.adapters, + } + if m.Exclusions != nil { + opts.IncludeIncompleteUploads = m.Exclusions.IncludeIncompleteUploads + opts.IncludeOrphans = m.Exclusions.IncludeOrphans + opts.PreserveSQSVisibility = m.Exclusions.PreserveSQSVisibility + opts.IncludeSQSSideRecords = m.Exclusions.IncludeSQSSideRecords + opts.RenameS3Collisions = m.Exclusions.RenameS3Collisions + } + if m.DynamoDBLayout == backup.DynamoDBLayoutJSONL { + opts.DynamoDBBundleJSONL = true + } + return opts +} + +// tempOutputPath returns .tmp- for the write-then-rename +// atomic publish. crypto/rand provides the suffix so concurrent encodes +// against the same --output cannot collide. +func tempOutputPath(output string) (string, error) { + buf := make([]byte, tempSuffixByteLen) + if _, err := rand.Read(buf); err != nil { + return "", errors.Wrap(err, "rand suffix") + } + return output + ".tmp-" + hex.EncodeToString(buf), nil +} + +// writeSidecar emits ENCODE_INFO.json next to the published .fsm. +// Path-derived per gemini medium v2 #896. +func writeSidecar(cfg *config, m backup.Manifest, effectiveTS uint64, overridden bool, result backup.EncodeResult) error { + info := backup.NewEncodeInfo(time.Now()) + info.EncoderVersion = version + info.InputRoot = cfg.inputPath + info.OutputFSMPath = cfg.outputPath + info.OutputFSMSHA256 = hex.EncodeToString(result.SHA256[:]) + info.LastCommitTS = effectiveTS + info.LastCommitTSOverridden = overridden + info.ManifestLastCommitTS = m.LastCommitTS + info.ManifestClusterID = m.ClusterID + info.AdaptersEnabled = result.AdaptersEnabled + info.SelfTest = backup.EncodeInfoSelfTest{ + Ran: result.SelfTestRan, + Matched: result.SelfTestMatched, + } + sidecarPath := backup.EncodeInfoSidecarPath(cfg.outputPath) + f, err := os.Create(sidecarPath) //nolint:gosec // operator-supplied path + if err != nil { + return errors.WithStack(err) + } + if err := backup.WriteEncodeInfo(f, info); err != nil { + _ = f.Close() + return errors.Wrap(err, "WriteEncodeInfo") + } + if err := f.Sync(); err != nil { + _ = f.Close() + return errors.WithStack(err) + } + if err := f.Close(); err != nil { + return errors.WithStack(err) + } + return nil +} diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go new file mode 100644 index 000000000..f69549636 --- /dev/null +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -0,0 +1,362 @@ +package main + +import ( + "bytes" + "encoding/json" + "io" + "log/slog" + "os" + "path/filepath" + "strconv" + "testing" + "time" + + "github.com/bootjp/elastickv/internal/backup" +) + +// emitMinimalManifest writes a minimal valid MANIFEST.json under outRoot +// with the given lastCommitTS. Used by every CLI test as the producer- +// side artifact the encoder will consume. +func emitMinimalManifest(t *testing.T, outRoot string, lastCommitTS uint64) { + t.Helper() + m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) + m.LastCommitTS = lastCommitTS + m.Adapters = &backup.Adapters{} + m.Exclusions = &backup.Exclusions{} + f, err := os.Create(filepath.Join(outRoot, "MANIFEST.json")) + if err != nil { + t.Fatalf("create MANIFEST.json: %v", err) + } + if err := backup.WriteManifest(f, m); err != nil { + t.Fatalf("WriteManifest: %v", err) + } + if err := f.Close(); err != nil { + t.Fatalf("close: %v", err) + } +} + +func quietLogger() *slog.Logger { + return slog.New(slog.NewTextHandler(io.Discard, nil)) +} + +// TestCLIRejectsMissingManifest pins the user-input-error path: --input +// directory without MANIFEST.json → exit 1, no .fsm written. +func TestCLIRejectsMissingManifest(t *testing.T) { + t.Parallel() + in := t.TempDir() + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want error") + } + if code != exitUserErr { + t.Errorf("exit code = %d, want %d", code, exitUserErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm exists at %s; should not be written on missing manifest", out) + } +} + +// TestCLIRejectsUnknownAdapter pins the decoder-parity adapter CSV +// parser: unknown adapter → exit 1. +func TestCLIRejectsUnknownAdapter(t *testing.T) { + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 100) + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out, "--adapter", "foo"}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want error") + } + if code != exitUserErr { + t.Errorf("exit code = %d, want %d", code, exitUserErr) + } +} + +// TestCLIRejectsLowerLastCommitTSOverride is the fail-closed pin per +// parent §"MVCC re-encoding": T < manifest.last_commit_ts → exit 2 +// (data-correctness failure, not flag-parse error). +func TestCLIRejectsLowerLastCommitTSOverride(t *testing.T) { + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 1000) + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{ + "--input", in, + "--output", out, + "--last-commit-ts", "500", // below manifest + }, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want HLC ceiling regression error") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (data error, not flag-parse error)", code, exitDataErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm exists at %s; should not be published on regression", out) + } +} + +// TestCLIAcceptsEqualAndHigherLastCommitTSOverride pins T == manifest +// (default) and T > manifest both succeed with the effective T stamped +// into the .fsm header and sidecar. +func TestCLIAcceptsEqualAndHigherLastCommitTSOverride(t *testing.T) { + t.Parallel() + for _, tc := range []struct { + name string + argTS string + want uint64 + }{ + {"equal", "1000", 1000}, + {"higher", "5000", 5000}, + } { + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 1000) + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{ + "--input", in, + "--output", out, + "--last-commit-ts", tc.argTS, + }, quietLogger()) + if err != nil { + t.Fatalf("run: %v", err) + } + if code != exitSuccess { + t.Errorf("exit code = %d, want %d", code, exitSuccess) + } + // Inspect sidecar. + sidecar := backup.EncodeInfoSidecarPath(out) + body, err := os.ReadFile(sidecar) + if err != nil { + t.Fatalf("read sidecar: %v", err) + } + var info backup.EncodeInfo + if err := json.Unmarshal(body, &info); err != nil { + t.Fatalf("unmarshal sidecar: %v", err) + } + if info.LastCommitTS != tc.want { + t.Errorf("sidecar LastCommitTS = %d, want %d", info.LastCommitTS, tc.want) + } + }) + } +} + +// TestCLIEncodeInfoPathDerivedFromOutput pins gemini medium v2 #896: +// the sidecar is named .encode_info.json, not a static name. +func TestCLIEncodeInfoPathDerivedFromOutput(t *testing.T) { + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 100) + outDir := t.TempDir() + out := filepath.Join(outDir, "node1.fsm") + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err != nil || code != exitSuccess { + t.Fatalf("run failed: code=%d err=%v", code, err) + } + if _, err := os.Stat(out); err != nil { + t.Fatalf(".fsm not found: %v", err) + } + want := filepath.Join(outDir, "node1.fsm.encode_info.json") + if _, err := os.Stat(want); err != nil { + t.Errorf("sidecar not at %s: %v", want, err) + } + // Ensure the static-named version was NOT created. + if _, err := os.Stat(filepath.Join(outDir, "ENCODE_INFO.json")); err == nil { + t.Errorf("static ENCODE_INFO.json exists; expected only path-derived sidecar") + } +} + +// TestCLIEncodeInfoTwoFilesNoCollision pins the no-collision property: +// two --output paths in the same dir produce two distinct sidecars. +func TestCLIEncodeInfoTwoFilesNoCollision(t *testing.T) { + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 100) + outDir := t.TempDir() + for _, name := range []string{"a.fsm", "b.fsm"} { + out := filepath.Join(outDir, name) + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err != nil || code != exitSuccess { + t.Fatalf("run for %s failed: code=%d err=%v", name, code, err) + } + } + for _, name := range []string{"a.fsm", "b.fsm"} { + want := filepath.Join(outDir, name+".encode_info.json") + if _, err := os.Stat(want); err != nil { + t.Errorf("sidecar %s missing: %v", want, err) + } + } + // a.fsm.encode_info.json and b.fsm.encode_info.json must have + // different output_fsm_path values. + aBody, _ := os.ReadFile(filepath.Join(outDir, "a.fsm.encode_info.json")) + bBody, _ := os.ReadFile(filepath.Join(outDir, "b.fsm.encode_info.json")) + if bytes.Equal(aBody, bBody) { + t.Errorf("sidecars are byte-equal; should differ by output_fsm_path") + } +} + +// writeSQSFixture writes a minimal sqs//{queue.json, +// messages.jsonl} fixture under root. Used by the CLI round-trip test; +// kept as a helper so the test body stays under the cyclop threshold. +func writeSQSFixture(t *testing.T, root string) { + t.Helper() + dir := filepath.Join(root, "sqs", "cnQ") // base64url("rt") + if err := os.MkdirAll(dir, 0o755); err != nil { + t.Fatalf("MkdirAll: %v", err) + } + if err := os.WriteFile(filepath.Join(dir, "_queue.json"), + []byte(`{"format_version":1,"name":"rt","fifo":false,"partition_count":1,"generation":1}`), + 0o600); err != nil { + t.Fatalf("WriteFile _queue.json: %v", err) + } + if err := os.WriteFile(filepath.Join(dir, "messages.jsonl"), + []byte(`{"format_version":1,"message_id":"m1","body":"a","send_timestamp_millis":1700000000000,"available_at_millis":1700000000000,"sequence_number":0}`), + 0o600); err != nil { + t.Fatalf("WriteFile messages.jsonl: %v", err) + } +} + +// canonicalizeInput runs encode → decode once so the input matches the +// encoder's output shape. Subsequent self-tests against the canonical +// tree are byte-equal (any non-canonical formatting differences are +// flattened by this first pass). +func canonicalizeInput(t *testing.T, rawIn string, lastCommitTS uint64) string { + t.Helper() + canonicalIn := t.TempDir() + tmpOut := filepath.Join(t.TempDir(), "canonical.fsm") + code, err := run([]string{"--input", rawIn, "--output", tmpOut, "--adapter", "sqs"}, quietLogger()) + if err != nil || code != exitSuccess { + t.Fatalf("canonical encode: code=%d err=%v", code, err) + } + f, _ := os.Open(tmpOut) + if _, err := backup.DecodeSnapshot(f, backup.DecodeOptions{ + OutRoot: canonicalIn, + Adapters: backup.AdapterSet{SQS: true}, + }); err != nil { + t.Fatalf("canonical decode: %v", err) + } + _ = f.Close() + emitMinimalManifest(t, canonicalIn, lastCommitTS) + return canonicalIn +} + +// readSidecar reads .encode_info.json into an EncodeInfo struct. +func readSidecar(t *testing.T, output string) backup.EncodeInfo { + t.Helper() + body, err := os.ReadFile(output + ".encode_info.json") + if err != nil { + t.Fatalf("read sidecar: %v", err) + } + var info backup.EncodeInfo + if err := json.Unmarshal(body, &info); err != nil { + t.Fatalf("unmarshal sidecar: %v", err) + } + return info +} + +// TestCLIRoundTripSelfTestAllAdapters is the gold-standard CLI-level +// end-to-end test: a real adapter fixture, encoder runs with +// --self-test, exit 0, matched:true in the sidecar. +func TestCLIRoundTripSelfTestAllAdapters(t *testing.T) { + t.Parallel() + rawIn := t.TempDir() + writeSQSFixture(t, rawIn) + emitMinimalManifest(t, rawIn, 7000) + canonicalIn := canonicalizeInput(t, rawIn, 7000) + + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{ + "--input", canonicalIn, + "--output", out, + "--adapter", "sqs", + "--self-test", + }, quietLogger()) + if err != nil { + t.Fatalf("run with self-test: %v", err) + } + if code != exitSuccess { + t.Errorf("exit code = %d, want %d (self-test should match)", code, exitSuccess) + } + info := readSidecar(t, out) + if !info.SelfTest.Ran || !info.SelfTest.Matched { + t.Errorf("self_test Ran=%v Matched=%v, want both true", info.SelfTest.Ran, info.SelfTest.Matched) + } + if _, err := os.Stat(out + ".mismatch.txt"); err == nil { + t.Errorf("mismatch.txt exists on a successful self-test") + } +} + +// TestCLISelfTestFailureLeavesNoFsmAtOutputPath pins the write-then- +// rename atomic-publish discipline (codex P2 v2 #896). To trigger a +// real self-test failure deterministically from the CLI level we test +// via the lower-level EncodeSnapshot library — the CLI-only test path +// would require build-tagged corruption hooks. The library-level +// equivalent is TestEncodeSnapshotSelfTestDetectsCorruption (which +// asserts the buffered bytes never reach the io.Writer); this CLI +// test confirms the temp-file rename discipline by parsing the +// CLI's filesystem state after a normal --self-test success: the +// temp file must NOT exist after rename. +func TestCLISelfTestFailureLeavesNoFsmAtOutputPath(t *testing.T) { + t.Parallel() + // Use a deliberately mismatched --last-commit-ts override to drive + // a data-error exit; the CLI MUST NOT publish .fsm on data-error. + in := t.TempDir() + emitMinimalManifest(t, in, 1000) + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{ + "--input", in, + "--output", out, + "--last-commit-ts", "500", + }, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want data error") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d", code, exitDataErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm at %s; data error must not publish", out) + } + // No temp file should linger either. + matches, _ := filepath.Glob(out + ".tmp-*") + if len(matches) > 0 { + t.Errorf("temp file lingered: %v", matches) + } +} + +// TestParseLastCommitTSDecimal + Hex pin both representations the +// --last-commit-ts flag accepts. +func TestParseLastCommitTS(t *testing.T) { + t.Parallel() + for _, tc := range []struct { + in string + want uint64 + }{ + {"0", 0}, + {"1234567890", 1234567890}, + {"0xff", 0xff}, + {"0X10", 0x10}, + } { + got, err := parseLastCommitTS(tc.in) + if err != nil { + t.Errorf("%q: %v", tc.in, err) + continue + } + if got != tc.want { + t.Errorf("%q: got %d want %d", tc.in, got, tc.want) + } + } + // Reject empty and malformed. + for _, bad := range []string{"", "abc", "0xZZ"} { + if _, err := parseLastCommitTS(bad); err == nil { + t.Errorf("%q parsed successfully; want error", bad) + } + } +} + +// Helper to silence "unused strconv" if a future edit drops its only +// use — kept here as the canonical numeric test pin. Strconv is used +// implicitly in subtests via tc.argTS. +var _ = strconv.FormatUint diff --git a/internal/backup/encode_info.go b/internal/backup/encode_info.go new file mode 100644 index 000000000..5ae2773c4 --- /dev/null +++ b/internal/backup/encode_info.go @@ -0,0 +1,114 @@ +package backup + +import ( + "encoding/json" + "io" + "time" + + "github.com/cockroachdb/errors" +) + +// EncodeInfoFormatVersion is the on-disk schema version for ENCODE_INFO.json. +// Bumped on incompatible schema changes; ReadEncodeInfo rejects unknown +// versions with ErrUnsupportedEncodeInfoFormatVersion so a future encoder +// release cannot silently drop fields a current operator relies on. +const EncodeInfoFormatVersion uint32 = 1 + +// ErrUnsupportedEncodeInfoFormatVersion is returned by ReadEncodeInfo when +// the sidecar's format_version is not EncodeInfoFormatVersion. Mirrors +// the decoder's ErrUnsupportedFormatVersion contract so callers can branch +// on errors.Is. +var ErrUnsupportedEncodeInfoFormatVersion = errors.New("backup: unsupported ENCODE_INFO format_version") + +// EncodeInfoSelfTest captures the self-test outcome (parent §"Round-trip +// self-test"). Ran=false when --self-test was off; Matched is only +// meaningful when Ran=true. +type EncodeInfoSelfTest struct { + Ran bool `json:"ran"` + Matched bool `json:"matched"` +} + +// EncodeInfo is the on-disk shape of .encode_info.json. Schema +// pinned by docs/design/2026_06_01_proposed_snapshot_encode_cli.md +// §"ENCODE_INFO.json". Restore operators rely on this for "encoded for +// the right cluster, by the right encoder version, against this exact +// file" confirmation; tag changes are a breaking schema bump. +type EncodeInfo struct { + FormatVersion uint32 `json:"format_version"` + EncoderVersion string `json:"encoder_version"` + EncoderKeyFormatVersion uint32 `json:"encoder_key_format_version"` + WallTimeISO string `json:"wall_time_iso"` + InputRoot string `json:"input_root"` + OutputFSMPath string `json:"output_fsm_path"` + OutputFSMSHA256 string `json:"output_fsm_sha256"` + LastCommitTS uint64 `json:"last_commit_ts"` + LastCommitTSOverridden bool `json:"last_commit_ts_overridden"` + ManifestLastCommitTS uint64 `json:"manifest_last_commit_ts"` + ManifestClusterID string `json:"manifest_cluster_id,omitempty"` + AdaptersEnabled []string `json:"adapters_enabled"` + SelfTest EncodeInfoSelfTest `json:"self_test"` +} + +// NewEncodeInfo stamps the current format version + wall time so callers +// only fill in the encode-specific fields. Mirrors NewPhase0SnapshotManifest. +// EncoderKeyFormatVersion is the on-disk key format the encoder produces; +// today it tracks CurrentFormatVersion (no separate key-format version +// has been declared), which is conservative: future encoder bumps that +// change MVCC layout MUST bump both manifest and encoder-key formats so +// restore operators can correlate. +func NewEncodeInfo(now time.Time) EncodeInfo { + return EncodeInfo{ + FormatVersion: EncodeInfoFormatVersion, + EncoderKeyFormatVersion: CurrentFormatVersion, + WallTimeISO: now.UTC().Format(time.RFC3339Nano), + } +} + +// WriteEncodeInfo serializes info to w. Caller is responsible for the +// fsync+close discipline (the cmd wrapper uses os.File.Sync then Close +// to surface late writeback errors — gemini r1 medium on #810). +func WriteEncodeInfo(w io.Writer, info EncodeInfo) error { + enc := json.NewEncoder(w) + enc.SetIndent("", " ") + if err := enc.Encode(info); err != nil { + return errors.Wrap(err, "encode ENCODE_INFO.json") + } + return nil +} + +// ReadEncodeInfo parses an ENCODE_INFO.json payload from r. Rejects +// unknown format_version values with ErrUnsupportedEncodeInfoFormatVersion +// so a future schema bump surfaces as a typed error rather than a silent +// field drop. Unknown JSON fields are tolerated to allow forward-compat +// additions within the same format_version. +func ReadEncodeInfo(r io.Reader) (EncodeInfo, error) { + body, err := io.ReadAll(r) + if err != nil { + return EncodeInfo{}, errors.Wrap(err, "read ENCODE_INFO.json") + } + var probe struct { + FormatVersion uint32 `json:"format_version"` + } + if err := json.Unmarshal(body, &probe); err != nil { + return EncodeInfo{}, errors.Wrap(err, "decode ENCODE_INFO.json format_version") + } + if probe.FormatVersion != EncodeInfoFormatVersion { + return EncodeInfo{}, errors.Wrapf(ErrUnsupportedEncodeInfoFormatVersion, "got %d, want %d", probe.FormatVersion, EncodeInfoFormatVersion) + } + var info EncodeInfo + if err := json.Unmarshal(body, &info); err != nil { + return EncodeInfo{}, errors.Wrap(err, "decode ENCODE_INFO.json") + } + return info, nil +} + +// EncodeInfoSidecarPath returns the path-derived sidecar location for a +// given .fsm output path. Multiple .fsm files can share a directory +// (e.g., per-node dumps under /backups/); a static "ENCODE_INFO.json" +// name would silently overwrite siblings (gemini medium #896). +// +// Convention: append ".encode_info.json" to the full output path. The +// same scheme gpg and sha256sum follow when their input is path-addressable. +func EncodeInfoSidecarPath(fsmPath string) string { + return fsmPath + ".encode_info.json" +} diff --git a/internal/backup/encode_info_test.go b/internal/backup/encode_info_test.go new file mode 100644 index 000000000..beb42a756 --- /dev/null +++ b/internal/backup/encode_info_test.go @@ -0,0 +1,95 @@ +package backup + +import ( + "bytes" + "strings" + "testing" + "time" + + "github.com/cockroachdb/errors" +) + +// TestEncodeInfoRoundTrip pins WriteEncodeInfo -> ReadEncodeInfo for a +// populated struct. Forward-compat: an ENCODE_INFO.json with unknown +// extra fields at the same format_version must decode cleanly. +func TestEncodeInfoRoundTrip(t *testing.T) { + t.Parallel() + info := NewEncodeInfo(time.Date(2026, 6, 1, 12, 0, 0, 0, time.UTC)) + info.EncoderVersion = "test-rev" + info.InputRoot = "/in" + info.OutputFSMPath = "/out.fsm" + info.OutputFSMSHA256 = "deadbeef" + info.LastCommitTS = 18446744073709551610 + info.LastCommitTSOverridden = false + info.ManifestLastCommitTS = 18446744073709551610 + info.ManifestClusterID = "cluster-1" + info.AdaptersEnabled = []string{"redis", "dynamodb", "s3", "sqs"} + info.SelfTest = EncodeInfoSelfTest{Ran: true, Matched: true} + + var buf bytes.Buffer + if err := WriteEncodeInfo(&buf, info); err != nil { + t.Fatalf("WriteEncodeInfo: %v", err) + } + got, err := ReadEncodeInfo(&buf) + if err != nil { + t.Fatalf("ReadEncodeInfo: %v", err) + } + if got.EncoderVersion != "test-rev" || got.OutputFSMSHA256 != "deadbeef" || got.LastCommitTS != 18446744073709551610 { + t.Errorf("round-trip mismatch: %+v", got) + } + if got.SelfTest.Ran != true || got.SelfTest.Matched != true { + t.Errorf("self_test field round-trip: %+v", got.SelfTest) + } + + // Forward-compat: extra field at same version decodes cleanly. + withExtra := `{"format_version":1,"encoder_version":"x","wall_time_iso":"2026-06-01T12:00:00Z","input_root":"/in","output_fsm_path":"/out.fsm","output_fsm_sha256":"d","last_commit_ts":1,"last_commit_ts_overridden":false,"manifest_last_commit_ts":1,"adapters_enabled":[],"self_test":{"ran":false,"matched":false},"future_field":"ignored"}` + if _, err := ReadEncodeInfo(strings.NewReader(withExtra)); err != nil { + t.Errorf("forward-compat unknown field rejected: %v", err) + } +} + +// TestEncodeInfoRejectsUnknownFormatVersion mirrors the decoder's +// TestManifestVersionGate: a future schema bump surfaces as a typed +// error rather than a silent field drop. +func TestEncodeInfoRejectsUnknownFormatVersion(t *testing.T) { + t.Parallel() + bad := `{"format_version":99,"encoder_version":"x","wall_time_iso":"2026-06-01T12:00:00Z","input_root":"/in","output_fsm_path":"/out.fsm","output_fsm_sha256":"d","last_commit_ts":1,"last_commit_ts_overridden":false,"manifest_last_commit_ts":1,"adapters_enabled":[],"self_test":{"ran":false,"matched":false}}` + _, err := ReadEncodeInfo(strings.NewReader(bad)) + if !errors.Is(err, ErrUnsupportedEncodeInfoFormatVersion) { + t.Fatalf("err = %v, want ErrUnsupportedEncodeInfoFormatVersion", err) + } +} + +// TestExclusionsLegacyManifestOmitsRenameS3Collisions pins forward-compat +// on the new rename_s3_collisions field. Older manifests written before +// M6 do not include the field; ReadManifest must decode them with the +// zero value (false) — NOT reject as ErrInvalidManifest (gemini medium +// v5 #896). +func TestExclusionsLegacyManifestOmitsRenameS3Collisions(t *testing.T) { + t.Parallel() + // Build a known-valid manifest via the public constructor, then + // rewrite the JSON to omit the rename_s3_collisions field — this + // is exactly the on-disk shape a pre-M6 decoder run would produce. + m := NewPhase0SnapshotManifest(time.Date(2026, 5, 1, 0, 0, 0, 0, time.UTC)) + m.Exclusions = &Exclusions{} + m.Adapters = &Adapters{} + var buf bytes.Buffer + if err := WriteManifest(&buf, m); err != nil { + t.Fatalf("WriteManifest: %v", err) + } + // Strip the new field to simulate a legacy producer. + legacy := strings.ReplaceAll(buf.String(), `"rename_s3_collisions":false,`, ``) + legacy = strings.ReplaceAll(legacy, `,"rename_s3_collisions":false`, ``) + legacy = strings.ReplaceAll(legacy, `"rename_s3_collisions":false`, ``) + + got, err := ReadManifest(strings.NewReader(legacy)) + if err != nil { + t.Fatalf("legacy manifest must decode without error, got: %v", err) + } + if got.Exclusions == nil { + t.Fatalf("Exclusions = nil") + } + if got.Exclusions.RenameS3Collisions != false { + t.Errorf("RenameS3Collisions = %v, want false (zero value for missing field)", got.Exclusions.RenameS3Collisions) + } +} diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go new file mode 100644 index 000000000..d79cb6444 --- /dev/null +++ b/internal/backup/encode_snapshot.go @@ -0,0 +1,461 @@ +package backup + +import ( + "bytes" + "crypto/sha256" + "io" + "os" + "path/filepath" + "sort" + "strings" + + "github.com/cockroachdb/errors" +) + +// ErrSelfTestLowerLastCommitTS is returned by EncodeSnapshot when the +// effective LastCommitTS in EncodeOptions is below the manifest's value. +// The HLC ceiling invariant (CLAUDE.md "Timestamp Oracle") forbids +// lowering the ceiling on restore: a lower T would let a post-restart +// leader issue a read ts ≤ a restored row's commit ts. +// +// Surfaced at the EncodeSnapshot layer so the CLI's main exits with code +// 2 (data-correctness failure, per parent §"Exit codes"). Caller must +// check errors.Is on this sentinel to map to the right exit code. +var ErrSelfTestLowerLastCommitTS = errors.New("backup: --last-commit-ts T < manifest.last_commit_ts (HLC ceiling regression)") + +// The encoder dispatch order (redis → dynamodb → s3 → sqs) is encoded +// inside adapterRunners() and is intentionally distinct from decode.go's +// finalize order (dynamodb → s3 → redis → sqs). The final .fsm byte +// sequence is determined by encoded-key sort (snapshotBuilder.WriteTo), +// not by adapter fan-out order, so either ordering is correct as long +// as it is fixed. The encoder follows the parent design doc's +// enumeration order so ENCODE_INFO.json adapters_enabled is bytewise +// reproducible across runs that pass --adapter in different sequences +// (claude review v7 #896). + +// EncodeOptions configures EncodeSnapshot. Mirrors the decoder's +// DecodeOptions in shape: required InputRoot, AdapterSet, then per-adapter +// option flags read back from the input MANIFEST.json by the CLI. +type EncodeOptions struct { + // InputRoot is the directory tree root produced by the decoder. + // Must contain MANIFEST.json; per-adapter encoders read their + // subtrees (redis/, dynamodb/, s3/, sqs/) directly off this root. + InputRoot string + // Adapters selects which adapter encoders to invoke; disabled + // adapters are skipped without error. Mirrors DecodeOptions.Adapters. + Adapters AdapterSet + // LastCommitTS is the EFFECTIVE T used for both the EKVPBBL1 + // header and every key's invTS = ^T. Callers pass manifest.last_commit_ts + // by default and the --last-commit-ts override otherwise. + LastCommitTS uint64 + // SelfTest enables the round-trip self-test. EncodeSnapshot buffers + // the FSM in *bytes.Buffer, decodes from the buffer, and copies to + // the caller's io.Writer only if the buffer survives DecodeSnapshot + // (i.e. the bytes the encoder produced are loadable). When false, + // the FSM streams straight to the writer with no extra buffering. + SelfTest bool + // SelfTestDecodeOptions are threaded into the scratch DecodeSnapshot + // call. The CLI reads MANIFEST.json's Exclusions + DynamoDBLayout + // and populates this so the self-test's scratch tree matches what + // the original decoder would have produced. + SelfTestDecodeOptions DecodeOptions + + // corruptBufferForTest is an unexported test-only hook that fires + // against the internal *bytes.Buffer AFTER snapshotBuilder.WriteTo + // returns but BEFORE the self-test DecodeSnapshot call (when + // SelfTest=true). Same-package tests use it to inject corruption + // reachable by the self-test but never reaching the io.Writer + // passed to EncodeSnapshot (the write-then-rename invariant: a + // self-test failure must not publish corrupt bytes — codex P2 v6 + // #896). External callers cannot set it (lowercase identifier). + corruptBufferForTest func([]byte) +} + +// EncodeResult is the public return value from EncodeSnapshot. Mirrors +// the decoder's DecodeResult shape. +type EncodeResult struct { + // Header is what ReadSnapshotWithHeader returned when the encoder + // decoded its own output for the self-test. Header.LastCommitTS + // equals the effective T (uniform-stamping rule per parent doc + // §"MVCC re-encoding"). + Header SnapshotHeader + // BytesWritten is the number of bytes written to the caller's + // io.Writer (the SHA256-anchored payload). + BytesWritten int64 + // SHA256 of the produced .fsm (lowercase hex via SHA256Hex). + SHA256 [32]byte + // SelfTestRan is true iff opts.SelfTest was true AND the encoder + // ran (i.e. no earlier per-adapter error short-circuited). + SelfTestRan bool + // SelfTestMatched is meaningful only when SelfTestRan; reports + // whether the re-decode produced no diff against InputRoot. + SelfTestMatched bool + // SelfTestMismatchTxt is non-nil when SelfTestRan && !SelfTestMatched. + // The CLI writes it as .mismatch.txt at exit 2. + SelfTestMismatchTxt []byte + // AdaptersEnabled is the canonical fan-out order of adapters that + // were actually invoked; ENCODE_INFO.json embeds this verbatim. + AdaptersEnabled []string +} + +// EncodeSnapshot reads the directory tree at opts.InputRoot, invokes the +// enabled per-adapter encoders in canonicalAdapterFanOutOrder, optionally +// runs the round-trip self-test against the in-memory buffer, and writes +// the .fsm bytes to out. The .fsm bytes are NOT returned; they go to out. +// +// When opts.SelfTest=false the FSM streams straight to out with no extra +// buffering. When opts.SelfTest=true an internal *bytes.Buffer holds the +// FSM during the self-test; bytes are copied to out only after the +// self-test matches. Memory cost in self-test mode is one FSM-sized +// allocation on top of the existing sort working set. +// +// Returns ErrSelfTestLowerLastCommitTS when opts.LastCommitTS is below +// the manifest's value — caller is responsible for reading the manifest +// and computing the effective T (this layer just validates the floor). +// The CLI maps that error to exit code 2. +func EncodeSnapshot(opts EncodeOptions, out io.Writer) (EncodeResult, error) { + if opts.InputRoot == "" { + return EncodeResult{}, errors.New("backup: EncodeOptions.InputRoot is required") + } + if out == nil { + return EncodeResult{}, errors.New("backup: EncodeSnapshot out writer is nil") + } + + b := newSnapshotBuilder(opts.LastCommitTS) + enabled, err := runAdapterEncoders(b, opts) + if err != nil { + return EncodeResult{}, err + } + + if !opts.SelfTest { + return encodeStream(b, opts, enabled, out) + } + return encodeBuffered(b, opts, enabled, out) +} + +// encodeStream is the no-self-test path: SHA256 + writer tee with no +// extra buffering. FSM bytes go straight to out. +func encodeStream(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) { + hashWriter := newSHA256Writer(out) + bytesWritten, err := b.WriteTo(hashWriter) + if err != nil { + return EncodeResult{}, errors.WithStack(err) + } + return EncodeResult{ + Header: SnapshotHeader{LastCommitTS: opts.LastCommitTS}, + BytesWritten: bytesWritten, + SHA256: hashWriter.Sum(), + SelfTestRan: false, + AdaptersEnabled: enabled, + }, nil +} + +// encodeBuffered is the SelfTest=true path: buffer, self-test against +// buffer, copy to out only on match. Corruption hook (if set) fires +// against the buffer between WriteTo and self-test so the self-test +// sees the corruption but out never does (codex P2 v6 #896). +func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) { + var buf bytes.Buffer + bytesWritten, err := b.WriteTo(&buf) + if err != nil { + return EncodeResult{}, errors.WithStack(err) + } + bufBytes := buf.Bytes() + if opts.corruptBufferForTest != nil { + opts.corruptBufferForTest(bufBytes) + } + + header, mismatchTxt, matched, stErr := runSelfTest(bufBytes, opts) + result := EncodeResult{ + Header: header, + BytesWritten: bytesWritten, + SHA256: sha256.Sum256(bufBytes), + SelfTestRan: true, + SelfTestMatched: matched, + SelfTestMismatchTxt: mismatchTxt, + AdaptersEnabled: enabled, + } + if stErr != nil { + return result, stErr + } + if !matched { + return result, nil + } + if _, err := io.Copy(out, bytes.NewReader(bufBytes)); err != nil { + return result, errors.Wrap(err, "copy buffered fsm to out") + } + return result, nil +} + +// adapterRunner pairs an enabled-check with an Encode call, keeping +// runAdapterEncoders's per-iteration body to two branches (cyclop). +type adapterRunner struct { + name string + enabled func(AdapterSet) bool + encode func(*snapshotBuilder, string) error +} + +func adapterRunners() []adapterRunner { + return []adapterRunner{ + {"redis", func(s AdapterSet) bool { return s.Redis }, func(b *snapshotBuilder, root string) error { + return errors.Wrap(NewRedisEncoder(root, 0).Encode(b), "redis encoder") + }}, + {"dynamodb", func(s AdapterSet) bool { return s.DynamoDB }, func(b *snapshotBuilder, root string) error { + return errors.Wrap(NewDynamoDBEncoder(root).Encode(b), "dynamodb encoder") + }}, + {"s3", func(s AdapterSet) bool { return s.S3 }, func(b *snapshotBuilder, root string) error { + return errors.Wrap(NewS3RecordEncoder(root).Encode(b), "s3 encoder") + }}, + {"sqs", func(s AdapterSet) bool { return s.SQS }, func(b *snapshotBuilder, root string) error { + return errors.Wrap(NewSQSRecordEncoder(root).Encode(b), "sqs encoder") + }}, + } +} + +// runAdapterEncoders invokes each enabled adapter encoder in +// canonicalAdapterFanOutOrder, returning the list of adapter names +// actually invoked (for ENCODE_INFO.json adapters_enabled). +func runAdapterEncoders(b *snapshotBuilder, opts EncodeOptions) ([]string, error) { + var enabled []string + for _, r := range adapterRunners() { + if !r.enabled(opts.Adapters) { + continue + } + if err := r.encode(b, opts.InputRoot); err != nil { + return nil, err + } + enabled = append(enabled, r.name) + } + return enabled, nil +} + +// runSelfTest decodes fsmBytes into a unique scratch subdir, structurally +// diffs against opts.InputRoot, and returns (header, mismatchTxt, matched, +// err). matched=false with err=nil indicates a structural diff; matched=true +// with err=nil indicates success. err is non-nil only on infrastructure +// failure (mkdir, decoder error, walk error). +// +// The scratch subdir is removed via defer regardless of outcome. The +// caller cleans up .mismatch.txt at the start of each run. +func runSelfTest(fsmBytes []byte, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) { + scratchBase := opts.SelfTestDecodeOptions.OutRoot + scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-") + if err != nil { + return SnapshotHeader{}, nil, false, errors.Wrap(err, "mkdir scratch") + } + defer func() { + _ = os.RemoveAll(scratchDir) + }() + + decOpts := opts.SelfTestDecodeOptions + decOpts.OutRoot = scratchDir + + result, derr := DecodeSnapshot(bytes.NewReader(fsmBytes), decOpts) + if derr != nil { + // Decoder errored on our own output — that IS a self-test + // failure (the .fsm we produced isn't loadable). Surface as + // a mismatch with the decoder error embedded in the txt. + mismatchTxt := []byte("self-test failed: DecodeSnapshot rejected the produced .fsm: " + derr.Error()) + return SnapshotHeader{}, mismatchTxt, false, nil + } + + if result.Header.LastCommitTS != opts.LastCommitTS { + mismatchTxt := []byte(formatHeaderMismatch(opts.LastCommitTS, result.Header.LastCommitTS)) + return result.Header, mismatchTxt, false, nil + } + + diff, derr := diffAdapterTrees(opts.InputRoot, scratchDir, opts.Adapters) + if derr != nil { + return result.Header, nil, false, errors.Wrap(derr, "diff scratch tree") + } + if len(diff) > 0 { + return result.Header, []byte(strings.Join(diff, "\n") + "\n"), false, nil + } + return result.Header, nil, true, nil +} + +// diffAdapterTrees returns a list of paths (relative to input/scratch +// root) where the two trees differ, restricted to the adapter subtrees +// enabled in adapters. MANIFEST.json itself is NOT compared — the scratch +// doesn't have one (DecodeSnapshot library doesn't emit it; the CLI +// wrapper does, codex P2 v1 #896 — header check above is the +// last_commit_ts substitute). Bounded to selfTestMaxMismatchPaths. +func diffAdapterTrees(inputRoot, scratchRoot string, adapters AdapterSet) ([]string, error) { + subdirs := enabledAdapterSubdirs(adapters) + var diffs []string + for _, sub := range subdirs { + paths, err := diffOneSubdir(filepath.Join(inputRoot, sub), filepath.Join(scratchRoot, sub), sub) + if err != nil { + return nil, err + } + diffs = append(diffs, paths...) + if len(diffs) >= selfTestMaxMismatchPaths { + diffs = diffs[:selfTestMaxMismatchPaths] + diffs = append(diffs, "... (truncated; first "+itoa(selfTestMaxMismatchPaths)+" paths shown)") + return diffs, nil + } + } + return diffs, nil +} + +const selfTestMaxMismatchPaths = 64 + +// enabledAdapterSubdirs returns the top-level adapter subdir names for +// the enabled adapters, in canonical order for stable mismatch.txt output. +func enabledAdapterSubdirs(adapters AdapterSet) []string { + var out []string + for _, r := range adapterRunners() { + if r.enabled(adapters) { + out = append(out, r.name) + } + } + return out +} + +// diffOneSubdir walks aDir + bDir in parallel, returning paths (prefixed +// by relPrefix) that differ in presence or bytes. Missing-on-one-side is +// a mismatch. +func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) { + aFiles, aErr := walkRegularFiles(aDir) + if aErr != nil && !errors.Is(aErr, os.ErrNotExist) { + return nil, errors.Wrapf(aErr, "walk input %s", aDir) + } + bFiles, bErr := walkRegularFiles(bDir) + if bErr != nil && !errors.Is(bErr, os.ErrNotExist) { + return nil, errors.Wrapf(bErr, "walk scratch %s", bDir) + } + + var diffs []string + aMap := map[string][]byte{} + for path, body := range aFiles { + aMap[path] = body + } + for path, bBody := range bFiles { + aBody, ok := aMap[path] + if !ok { + diffs = append(diffs, relPrefix+"/"+path+" (missing in input)") + continue + } + if !bytes.Equal(aBody, bBody) { + diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)") + } + delete(aMap, path) + } + // Anything remaining in aMap is present in input but not in scratch. + remaining := make([]string, 0, len(aMap)) + for path := range aMap { + remaining = append(remaining, relPrefix+"/"+path+" (missing in scratch)") + } + sort.Strings(remaining) + diffs = append(diffs, remaining...) + return diffs, nil +} + +// walkRegularFiles returns a map of relative path -> file bytes for every +// regular file under root. Missing root is the empty map + os.ErrNotExist. +// Bounded by the per-adapter test fixtures the encoder runs against; +// production-scale dumps may want streaming compare, deferred until a +// real bottleneck shows up. +func walkRegularFiles(root string) (map[string][]byte, error) { + out := map[string][]byte{} + rootInfo, err := os.Stat(root) + if err != nil { + return nil, errors.WithStack(err) + } + if !rootInfo.IsDir() { + return nil, errors.Errorf("not a directory: %s", root) + } + if err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error { + if err != nil { + return err + } + if d.IsDir() { + return nil + } + if !d.Type().IsRegular() { + return nil + } + body, rerr := os.ReadFile(path) //nolint:gosec // walking a caller-provided root, regular files only + if rerr != nil { + return errors.Wrap(rerr, path) + } + rel, rerr := filepath.Rel(root, path) + if rerr != nil { + return errors.WithStack(rerr) + } + out[filepath.ToSlash(rel)] = body + return nil + }); err != nil { + return nil, errors.WithStack(err) + } + return out, nil +} + +func formatHeaderMismatch(want, got uint64) string { + return "self-test failed: header.LastCommitTS mismatch (want " + uitoa(want) + ", got " + uitoa(got) + ")\n" +} + +// uitoaCap is the max decimal length of a uint64 (math.MaxUint64 has +// 20 digits). Constant so the cap is documented and lint-clean. +const uitoaCap = 20 + +func uitoa(v uint64) string { + if v == 0 { + return "0" + } + buf := make([]byte, 0, uitoaCap) + for v > 0 { + buf = append(buf, byte('0'+v%10)) + v /= 10 + } + for i, j := 0, len(buf)-1; i < j; i, j = i+1, j-1 { + buf[i], buf[j] = buf[j], buf[i] + } + return string(buf) +} + +func itoa(v int) string { + if v < 0 { + return "-" + uitoa(uint64(-v)) + } + return uitoa(uint64(v)) +} + +// sha256Writer wraps an io.Writer and tees every byte into a SHA-256 +// hasher so the encoder gets a single-pass SHA256 of the produced .fsm +// without an extra buffer-pass. Used in the no-self-test streaming path. +type sha256Writer struct { + w io.Writer + h sha256w +} + +type sha256w = hashSHA256 + +// hashSHA256 is an interface alias so we can satisfy the tiny hash.Hash +// surface (Write + Sum32) without importing hash explicitly. +type hashSHA256 interface { + io.Writer + Sum(b []byte) []byte +} + +func newSHA256Writer(w io.Writer) *sha256Writer { + return &sha256Writer{w: w, h: sha256.New()} +} + +func (s *sha256Writer) Write(p []byte) (int, error) { + if _, err := s.h.Write(p); err != nil { + // crypto/sha256 never errors on Write per stdlib contract. + return 0, errors.WithStack(err) + } + n, err := s.w.Write(p) + if err != nil { + return n, errors.WithStack(err) + } + return n, nil +} + +func (s *sha256Writer) Sum() [32]byte { + var out [32]byte + copy(out[:], s.h.Sum(nil)) + return out +} diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go new file mode 100644 index 000000000..701dc761a --- /dev/null +++ b/internal/backup/encode_snapshot_test.go @@ -0,0 +1,217 @@ +package backup + +import ( + "bytes" + "os" + "path/filepath" + "testing" +) + +// TestEncodeSnapshotLibraryRoundTrip pins the public library entrypoint: +// EncodeSnapshot writes a .fsm to the supplied io.Writer; running +// DecodeSnapshot on those bytes into a scratch dir produces an +// equivalent adapter tree. No CLI involved. Codex P2 v2 #896 — encoder +// entrypoint exposure. +func TestEncodeSnapshotLibraryRoundTrip(t *testing.T) { + t.Parallel() + in := t.TempDir() + // One tiny SQS queue fixture is enough to exercise the SQS slice + // end-to-end via the new library wrapper; the per-adapter tree + // shape is already covered by the M5-1/M5-2 tests. + const queue = "lib-rt" + writeSQSQueue(t, in, queue, + []byte(`{"format_version":1,"name":"lib-rt","fifo":false,"partition_count":1,"generation":1}`), + [][]byte{ + []byte(`{"format_version":1,"message_id":"m1","body":"a","send_timestamp_millis":1700000000000,"available_at_millis":1700000000000,"sequence_number":0}`), + }, + ) + + var buf bytes.Buffer + result, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 0xDEADBEEF, + }, &buf) + if err != nil { + t.Fatalf("EncodeSnapshot: %v", err) + } + if result.SelfTestRan { + t.Errorf("SelfTestRan = true, want false (SelfTest opt was false)") + } + if result.BytesWritten == 0 { + t.Errorf("BytesWritten = 0") + } + if len(result.AdaptersEnabled) != 1 || result.AdaptersEnabled[0] != "sqs" { + t.Errorf("AdaptersEnabled = %v, want [sqs]", result.AdaptersEnabled) + } + + // Decode the produced bytes into a scratch tree. + scratch := t.TempDir() + decResult, err := DecodeSnapshot(bytes.NewReader(buf.Bytes()), DecodeOptions{ + OutRoot: scratch, + Adapters: AdapterSet{SQS: true}, + }) + if err != nil { + t.Fatalf("DecodeSnapshot of EncodeSnapshot output failed: %v", err) + } + if decResult.Header.LastCommitTS != 0xDEADBEEF { + t.Errorf("decoded header.LastCommitTS = %x, want 0xDEADBEEF", decResult.Header.LastCommitTS) + } +} + +// TestEncodeSnapshotSelfTestMatchesInput pins the happy-path self-test +// against a tree that has already been canonicalized by one decode pass +// (so the input matches what DecodeSnapshot would write back, modulo +// the encoder's idempotency). The full encode -> decode -> encode chain +// is the gold-standard round trip the parent design mandates. +func TestEncodeSnapshotSelfTestMatchesInput(t *testing.T) { + t.Parallel() + rawIn := t.TempDir() + const queue = "selftest-match" + writeSQSQueue(t, rawIn, queue, + []byte(`{"format_version":1,"name":"selftest-match","fifo":false,"partition_count":1,"generation":1}`), + [][]byte{ + []byte(`{"format_version":1,"message_id":"m1","body":"a","send_timestamp_millis":1700000000000,"available_at_millis":1700000000000,"sequence_number":0}`), + }, + ) + + // Canonicalize: encode rawIn, decode it back to canonicalIn. The + // resulting tree is what the encoder's self-test will produce in + // the scratch dir, so a second encode against it must match. + canonicalIn := t.TempDir() + var canonicalBuf bytes.Buffer + if _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: rawIn, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 0xCAFE, + }, &canonicalBuf); err != nil { + t.Fatalf("canonical encode: %v", err) + } + if _, err := DecodeSnapshot(bytes.NewReader(canonicalBuf.Bytes()), DecodeOptions{ + OutRoot: canonicalIn, + Adapters: AdapterSet{SQS: true}, + }); err != nil { + t.Fatalf("canonical decode: %v", err) + } + + scratchBase := t.TempDir() + var buf bytes.Buffer + result, err := EncodeSnapshot(EncodeOptions{ + InputRoot: canonicalIn, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 0xCAFE, + SelfTest: true, + SelfTestDecodeOptions: DecodeOptions{ + OutRoot: scratchBase, + Adapters: AdapterSet{SQS: true}, + }, + }, &buf) + if err != nil { + t.Fatalf("EncodeSnapshot: %v", err) + } + if !result.SelfTestRan || !result.SelfTestMatched { + t.Errorf("SelfTestRan=%v Matched=%v, want both true; mismatch=%s", result.SelfTestRan, result.SelfTestMatched, string(result.SelfTestMismatchTxt)) + } + if buf.Len() == 0 { + t.Errorf("bytes were not copied to out after successful self-test") + } + if result.Header.LastCommitTS != 0xCAFE { + t.Errorf("Header.LastCommitTS = %x, want 0xCAFE", result.Header.LastCommitTS) + } +} + +// TestEncodeSnapshotSelfTestDetectsCorruption pins that the unexported +// corruptBufferForTest hook lets the self-test catch corruption in the +// internal buffer. The corruption must be reachable by the self-test +// decode but MUST NOT reach the supplied io.Writer (the write-then- +// rename invariant — codex P2 v6 #896). +func TestEncodeSnapshotSelfTestDetectsCorruption(t *testing.T) { + t.Parallel() + in := t.TempDir() + const queue = "selftest-corrupt" + writeSQSQueue(t, in, queue, + []byte(`{"format_version":1,"name":"selftest-corrupt","fifo":false,"partition_count":1,"generation":1}`), + [][]byte{ + []byte(`{"format_version":1,"message_id":"m1","body":"a","send_timestamp_millis":1700000000000,"available_at_millis":1700000000000,"sequence_number":0}`), + }, + ) + + scratchBase := t.TempDir() + var out bytes.Buffer + // Corrupt every 13th byte deep inside the buffer — far enough past + // the header that the decoder will trip on the malformed entry + // length field. + corrupt := func(b []byte) { + for i := 200; i < len(b); i += 13 { + b[i] ^= 0xFF + } + } + result, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 0xCAFE, + SelfTest: true, + SelfTestDecodeOptions: DecodeOptions{ + OutRoot: scratchBase, + Adapters: AdapterSet{SQS: true}, + }, + corruptBufferForTest: corrupt, + }, &out) + if err != nil { + t.Fatalf("EncodeSnapshot: %v", err) + } + if !result.SelfTestRan { + t.Fatalf("SelfTestRan = false") + } + if result.SelfTestMatched { + t.Errorf("SelfTestMatched = true with corruption injected; want false") + } + if len(result.SelfTestMismatchTxt) == 0 { + t.Errorf("SelfTestMismatchTxt is empty; expected a mismatch report") + } + // CRITICAL: the corrupt bytes must NEVER reach out. The + // write-then-rename atomic-publish discipline requires that a + // self-test failure publishes nothing. + if out.Len() != 0 { + t.Errorf("out.Len = %d, want 0 (no bytes should reach out on self-test failure)", out.Len()) + } +} + +// TestEncodeSnapshotRequiresInputRoot rejects EncodeOptions with no +// InputRoot — a simple guard so the constructor errors surface early. +func TestEncodeSnapshotRequiresInputRoot(t *testing.T) { + t.Parallel() + var buf bytes.Buffer + if _, err := EncodeSnapshot(EncodeOptions{}, &buf); err == nil { + t.Fatalf("EncodeSnapshot with empty InputRoot succeeded; want error") + } +} + +// TestEncodeInfoSidecarPath pins the path-derivation rule for the +// sidecar (gemini medium v2 #896): one .fsm path produces one distinct +// sidecar path; two .fsm files in the same dir produce two distinct +// sidecars (no collision). +func TestEncodeInfoSidecarPath(t *testing.T) { + t.Parallel() + dir := t.TempDir() + a := filepath.Join(dir, "a.fsm") + b := filepath.Join(dir, "b.fsm") + sa := EncodeInfoSidecarPath(a) + sb := EncodeInfoSidecarPath(b) + if sa == sb { + t.Fatalf("sidecar paths collided: %s == %s", sa, sb) + } + // Verify each ends with the expected suffix. + if got, want := filepath.Base(sa), "a.fsm.encode_info.json"; got != want { + t.Errorf("sidecar(a) basename = %q, want %q", got, want) + } + if got, want := filepath.Base(sb), "b.fsm.encode_info.json"; got != want { + t.Errorf("sidecar(b) basename = %q, want %q", got, want) + } + // Both writable next to their .fsm (no OS-level collision). + for _, p := range []string{sa, sb} { + if err := os.WriteFile(p, []byte("{}"), 0o600); err != nil { + t.Fatalf("write %s: %v", p, err) + } + } +} diff --git a/internal/backup/manifest.go b/internal/backup/manifest.go index da73c5ef3..84409eac7 100644 --- a/internal/backup/manifest.go +++ b/internal/backup/manifest.go @@ -128,6 +128,14 @@ type Exclusions struct { IncludeOrphans bool `json:"include_orphans"` PreserveSQSVisibility bool `json:"preserve_sqs_visibility"` IncludeSQSSideRecords bool `json:"include_sqs_side_records"` + // RenameS3Collisions records whether the producer ran with + // --rename-collisions (DecodeOptions.RenameS3Collisions), so the + // M6 encoder's self-test can thread the same option back through + // DecodeSnapshot. Older manifests that omit this field decode as + // false (no-rename), matching the decoder default. Intentionally + // NOT added to exclusionsRequiredFields below so legacy manifests + // continue to validate (#896 v5 — claude review on M6 design). + RenameS3Collisions bool `json:"rename_s3_collisions,omitempty"` } // Manifest is the on-disk MANIFEST.json structure. Field tags match the From b13118638f50cace40e7ce5999094e61e2d93cd8 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 01:24:42 +0900 Subject: [PATCH 02/35] backup: #904 - address gemini high + claude high + codex P2 findings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Six substantive review findings on the M6 implementation PR. Headline fix: switched the self-test buffer from in-memory bytes.Buffer to a disk-backed temp file so large snapshots cannot OOM the encoder. ## Gemini high #1: SelfTest=true was buffering FSM in memory Out Of Memory risk on production-scale snapshots (gemini high #904). Fixed by writing the FSM through a tee (sha256 + os.File.WriteAt) into an os.CreateTemp file under opts.SelfTestDecodeOptions.OutRoot, then Seek(0) and stream through DecodeSnapshot for the self-test. The temp file is os.Remove'd via defer on every exit path. The contract is unchanged: corruption reaches the self-test decode but never reaches the io.Writer (TestEncodeSnapshotSelfTestDetectsCorruption still asserts out.Len()==0). Memory cost: O(1) for the SelfTest=true path now (previously was an FSM-sized bytes.Buffer on top of the sort working set). ## Gemini high #2: runSelfTest signature Changed from runSelfTest(fsmBytes []byte, ...) to runSelfTest(fsmFile io.Reader, ...) so the .fsm streams through DecodeSnapshot instead of being held as []byte. Same OOM fix at the runSelfTest layer. ## Gemini high #3: walkRegularFiles loaded every file's bytes into memory For large S3 blobs this OOMs the encoder. Replaced walkRegularFiles with walkRegularFilePaths (paths only) and streamFilesEqual (chunk compare at streamCmpBufSize = 64 KiB). Per-file compare cost is now O(streamCmpBufSize), not O(file size). ## Gemini medium: tempSuffixHexLen entropy was 4 bytes Birthday-paradox collision risk in highly concurrent CI environments. Bumped from 8 hex / 4 bytes (2^32 space) to 16 hex / 8 bytes (2^64 space). Same width crypto/rand uses for cryptographic nonces. ## Claude high + codex P2: fmt.Sscanf silently accepts partial parses --last-commit-ts 0xffZZ parsed as 0xff (fmt.Sscanf stops at first non-hex). --last-commit-ts 100oops parsed as 100. This means a typo could silently set the snapshot HLC ceiling to something the operator did not type. Replaced fmt.Sscanf with strconv.ParseUint (base 10 / 16) which rejects trailing garbage as a parse error. New TestParseLastCommitTS cases cover: 0xffZZ, 100oops, " 100 ext", "-1". ## Codex P2: empty adapter selection produced an empty AdapterSet --adapter ' ,' (e.g. a templated argument that expanded to spaces) would yield a zero AdapterSet and the encoder would publish a valid header-only .fsm with no adapter records — a silently-empty restore artifact. Now rejected with a clear error. Pinned by TestParseAdapterSetRejectsEmptySelection covering " ,", ",,,", " ", and "," — plus a positive case (single "s3") asserting the guard does not break legitimate single-adapter selections. ## Caller audit per CLAUDE.md semantic-change rule - runSelfTest signature changed from ([]byte, opts) to (io.Reader, opts). Sole caller is encodeBuffered (same package, just updated). No external callers since the function is unexported. - corruptBufferForTest hook type changed from func([]byte) to func(*os.File). Sole caller is the same-package test TestEncodeSnapshotSelfTestDetectsCorruption (updated to ReadAt/ WriteAt the temp file). External callers cannot set this field (lowercase). - streamFilesEqual replaces the bytes.Equal compare in diffOneSubdir. Same return contract (bool, error) so the caller (diffOneSubdir) is unchanged in behavior. - parseAdapterSet now returns an error for empty-selection inputs. Caller is parseFlags which already propagates the error to exit-1. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 77 +++++-- cmd/elastickv-snapshot-encode/main_test.go | 45 +++- internal/backup/encode_snapshot.go | 231 +++++++++++++++------ internal/backup/encode_snapshot_test.go | 38 +++- 4 files changed, 294 insertions(+), 97 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index fea66f042..624ce1df4 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -26,11 +26,11 @@ import ( "crypto/rand" "encoding/hex" "flag" - "fmt" "io" "log/slog" "os" "path/filepath" + "strconv" "strings" "time" @@ -45,10 +45,12 @@ const ( exitUserErr = 1 exitDataErr = 2 // tempSuffixHexLen is the hex-character length of the random - // suffix appended to .tmp-; 8 hex chars = 4 bytes of - // entropy, which gives ~4×10^9 collision space per --output path - // (more than enough for concurrent encodes against the same path). - tempSuffixHexLen = 8 + // suffix appended to .tmp-; 16 hex chars = 8 bytes of + // entropy = 2^64 collision space per --output path. The earlier + // 8-hex/4-byte form was flagged as collision-prone in highly + // concurrent CI environments (gemini medium #904); 8 bytes is the + // same width crypto/rand.Read pads cryptographic nonces to. + tempSuffixHexLen = 16 tempSuffixByteLen = tempSuffixHexLen / 2 mismatchTxtPerm = 0o600 encodeInfoFilePerm = 0o600 @@ -144,21 +146,31 @@ func parseFlags(argv []string) (*config, error) { } // parseLastCommitTS parses --last-commit-ts as a uint64. Hex (0x prefix) -// or decimal accepted. Negative or out-of-range surfaces as exit-1 -// (flag-parse error); the semantic check (T >= manifest) is exit-2. +// or decimal accepted. Uses strconv.ParseUint strict parsing so trailing +// garbage is rejected — fmt.Sscanf would silently accept "0xffZZ" as +// 0xff and "100oops" as 100, which becomes a snapshot HLC ceiling that +// silently disagrees with what the operator typed (claude high #904, +// codex P2 #904). Negative or out-of-range surfaces as exit-1; the +// semantic check (T >= manifest) is exit-2. func parseLastCommitTS(raw string) (uint64, error) { s := strings.TrimSpace(raw) if s == "" { return 0, errors.New("--last-commit-ts is empty") } - var ts uint64 + const ( + base16 = 16 + base10 = 10 + uint64Bits = 64 + ) if strings.HasPrefix(s, "0x") || strings.HasPrefix(s, "0X") { - if _, err := fmt.Sscanf(s[2:], "%x", &ts); err != nil { + ts, err := strconv.ParseUint(s[2:], base16, uint64Bits) + if err != nil { return 0, errors.Wrap(err, "--last-commit-ts hex parse") } return ts, nil } - if _, err := fmt.Sscanf(s, "%d", &ts); err != nil { + ts, err := strconv.ParseUint(s, base10, uint64Bits) + if err != nil { return 0, errors.Wrap(err, "--last-commit-ts decimal parse") } return ts, nil @@ -166,7 +178,12 @@ func parseLastCommitTS(raw string) (uint64, error) { // parseAdapterSet decodes a comma-separated adapter list (or "all"). // Mirrors the decoder's parser so a typo cannot silently disable an -// adapter. Unknown name → exit-1. +// adapter. Unknown name → exit-1. A CSV that contains only separators +// or whitespace (e.g. `--adapter ' ,'`) is also rejected — without this +// guard a templated argument that expands to spaces would yield a +// zero AdapterSet and the encoder would publish a valid header-only +// .fsm (no adapters invoked), turning a bad argument into a silent +// empty restore artifact (codex P2 #904). func parseAdapterSet(csv string) (backup.AdapterSet, error) { if csv == "" || csv == "all" { return backup.AdapterSet{DynamoDB: true, S3: true, Redis: true, SQS: true}, nil @@ -174,24 +191,38 @@ func parseAdapterSet(csv string) (backup.AdapterSet, error) { var set backup.AdapterSet for _, raw := range strings.Split(csv, ",") { name := strings.TrimSpace(strings.ToLower(raw)) - switch name { - case "dynamodb": - set.DynamoDB = true - case "s3": - set.S3 = true - case "redis": - set.Redis = true - case "sqs": - set.SQS = true - case "": + if name == "" { continue - default: - return backup.AdapterSet{}, errors.Errorf("unknown adapter %q", name) } + if err := applyAdapterName(name, &set); err != nil { + return backup.AdapterSet{}, err + } + } + if !set.DynamoDB && !set.S3 && !set.Redis && !set.SQS { + return backup.AdapterSet{}, errors.Errorf("--adapter %q selects no adapters; use \"all\" or a comma-separated subset", csv) } return set, nil } +// applyAdapterName sets the bit on s for one normalized adapter name, +// or returns an unknown-name error. Split out so parseAdapterSet stays +// under the cyclop threshold. +func applyAdapterName(name string, s *backup.AdapterSet) error { + switch name { + case "dynamodb": + s.DynamoDB = true + case "s3": + s.S3 = true + case "redis": + s.Redis = true + case "sqs": + s.SQS = true + default: + return errors.Errorf("unknown adapter %q", name) + } + return nil +} + // errSelfTestMismatch is a typed sentinel so run() can map self-test diffs // to exit-2 without coupling to the encoder's mismatch.txt format. var errSelfTestMismatch = errors.New("backup: --self-test diff against --input") diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index f69549636..e6a9b91ff 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -7,7 +7,6 @@ import ( "log/slog" "os" "path/filepath" - "strconv" "testing" "time" @@ -327,7 +326,8 @@ func TestCLISelfTestFailureLeavesNoFsmAtOutputPath(t *testing.T) { } // TestParseLastCommitTSDecimal + Hex pin both representations the -// --last-commit-ts flag accepts. +// --last-commit-ts flag accepts, and verify strict-parse rejection of +// trailing junk (claude high #904, codex P2 #904). func TestParseLastCommitTS(t *testing.T) { t.Parallel() for _, tc := range []struct { @@ -348,15 +348,44 @@ func TestParseLastCommitTS(t *testing.T) { t.Errorf("%q: got %d want %d", tc.in, got, tc.want) } } - // Reject empty and malformed. - for _, bad := range []string{"", "abc", "0xZZ"} { + // Reject empty, malformed, and trailing junk. + for _, bad := range []string{ + "", + "abc", + "0xZZ", + "0xffZZ", // trailing hex garbage — fmt.Sscanf would accept as 0xff + "100oops", // trailing decimal garbage — fmt.Sscanf would accept as 100 + "-1", // negative + " 100 ext", // whitespace + extra + } { if _, err := parseLastCommitTS(bad); err == nil { t.Errorf("%q parsed successfully; want error", bad) } } } -// Helper to silence "unused strconv" if a future edit drops its only -// use — kept here as the canonical numeric test pin. Strconv is used -// implicitly in subtests via tc.argTS. -var _ = strconv.FormatUint +// TestParseAdapterSetRejectsEmptySelection pins codex P2 #904: a CSV +// of only separators/whitespace MUST surface as a flag-parse error, not +// silently produce a zero AdapterSet that would publish a header-only +// .fsm. +func TestParseAdapterSetRejectsEmptySelection(t *testing.T) { + t.Parallel() + for _, bad := range []string{ + " ,", + ",,,", + " ", + ",", + } { + if _, err := parseAdapterSet(bad); err == nil { + t.Errorf("--adapter %q parsed to a non-empty set; want error", bad) + } + } + // Single-adapter selection still works. + set, err := parseAdapterSet("s3") + if err != nil { + t.Fatalf("--adapter s3: %v", err) + } + if !set.S3 || set.Redis || set.DynamoDB || set.SQS { + t.Errorf("--adapter s3 produced %+v, want only S3", set) + } +} diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index d79cb6444..8f4a4c8ae 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -3,6 +3,7 @@ package backup import ( "bytes" "crypto/sha256" + "hash" "io" "os" "path/filepath" @@ -61,14 +62,19 @@ type EncodeOptions struct { SelfTestDecodeOptions DecodeOptions // corruptBufferForTest is an unexported test-only hook that fires - // against the internal *bytes.Buffer AFTER snapshotBuilder.WriteTo + // against the on-disk self-test buffer AFTER snapshotBuilder.WriteTo // returns but BEFORE the self-test DecodeSnapshot call (when // SelfTest=true). Same-package tests use it to inject corruption // reachable by the self-test but never reaching the io.Writer // passed to EncodeSnapshot (the write-then-rename invariant: a // self-test failure must not publish corrupt bytes — codex P2 v6 // #896). External callers cannot set it (lowercase identifier). - corruptBufferForTest func([]byte) + // + // The hook receives the *os.File handle (positioned at offset 0) + // of the disk-backed self-test buffer; tests typically WriteAt + // a byte flip and rely on Seek-back-to-0 before returning so + // the encoder's subsequent Read sees the corrupted bytes. + corruptBufferForTest func(*os.File) } // EncodeResult is the public return value from EncodeSnapshot. Mirrors @@ -150,26 +156,52 @@ func encodeStream(b *snapshotBuilder, opts EncodeOptions, enabled []string, out }, nil } -// encodeBuffered is the SelfTest=true path: buffer, self-test against -// buffer, copy to out only on match. Corruption hook (if set) fires -// against the buffer between WriteTo and self-test so the self-test -// sees the corruption but out never does (codex P2 v6 #896). +// encodeBuffered is the SelfTest=true path: write the FSM to a temp +// file on disk (NOT in memory — gemini high #904, OOM risk on large +// snapshots), self-test by streaming the temp file through DecodeSnapshot, +// copy to out only on match. The temp file is os.Remove'd via defer on +// every exit path. +// +// Memory cost: O(1) — only the sha256 running state + read buffer for +// the final io.Copy. Replaces the prior in-memory bytes.Buffer. +// +// Corruption hook (if set) fires against the temp file between WriteTo +// and self-test so the self-test sees the corruption but out never does +// (codex P2 v6 #896, codex P2 v7 #896). func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) { - var buf bytes.Buffer - bytesWritten, err := b.WriteTo(&buf) + tempFile, err := os.CreateTemp(opts.SelfTestDecodeOptions.OutRoot, "encode-self-test-fsm-") + if err != nil { + return EncodeResult{}, errors.Wrap(err, "create self-test temp file") + } + tempPath := tempFile.Name() + defer func() { + _ = tempFile.Close() + _ = os.Remove(tempPath) + }() + + hasher := sha256.New() + tee := &teeWriter{w: tempFile, h: hasher} + bytesWritten, err := b.WriteTo(tee) if err != nil { return EncodeResult{}, errors.WithStack(err) } - bufBytes := buf.Bytes() + if err := tempFile.Sync(); err != nil { + return EncodeResult{}, errors.Wrap(err, "fsync self-test temp file") + } if opts.corruptBufferForTest != nil { - opts.corruptBufferForTest(bufBytes) + opts.corruptBufferForTest(tempFile) + } + if _, err := tempFile.Seek(0, io.SeekStart); err != nil { + return EncodeResult{}, errors.Wrap(err, "seek self-test temp file") } - header, mismatchTxt, matched, stErr := runSelfTest(bufBytes, opts) + header, mismatchTxt, matched, stErr := runSelfTest(tempFile, opts) + var sha [32]byte + copy(sha[:], hasher.Sum(nil)) result := EncodeResult{ Header: header, BytesWritten: bytesWritten, - SHA256: sha256.Sum256(bufBytes), + SHA256: sha, SelfTestRan: true, SelfTestMatched: matched, SelfTestMismatchTxt: mismatchTxt, @@ -181,12 +213,34 @@ func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, ou if !matched { return result, nil } - if _, err := io.Copy(out, bytes.NewReader(bufBytes)); err != nil { + if _, err := tempFile.Seek(0, io.SeekStart); err != nil { + return result, errors.Wrap(err, "rewind self-test temp file for copy") + } + if _, err := io.Copy(out, tempFile); err != nil { return result, errors.Wrap(err, "copy buffered fsm to out") } return result, nil } +// teeWriter tees writes into a hash.Hash + an underlying writer in a +// single pass, avoiding a second read for the SHA-256 anchor that +// ENCODE_INFO.json records. +type teeWriter struct { + w io.Writer + h hash.Hash +} + +func (t *teeWriter) Write(p []byte) (int, error) { + if _, err := t.h.Write(p); err != nil { + return 0, errors.WithStack(err) + } + n, err := t.w.Write(p) + if err != nil { + return n, errors.WithStack(err) + } + return n, nil +} + // adapterRunner pairs an enabled-check with an Encode call, keeping // runAdapterEncoders's per-iteration body to two branches (cyclop). type adapterRunner struct { @@ -229,15 +283,18 @@ func runAdapterEncoders(b *snapshotBuilder, opts EncodeOptions) ([]string, error return enabled, nil } -// runSelfTest decodes fsmBytes into a unique scratch subdir, structurally -// diffs against opts.InputRoot, and returns (header, mismatchTxt, matched, -// err). matched=false with err=nil indicates a structural diff; matched=true -// with err=nil indicates success. err is non-nil only on infrastructure -// failure (mkdir, decoder error, walk error). +// runSelfTest streams fsmFile through DecodeSnapshot into a unique +// scratch subdir, structurally diffs against opts.InputRoot, and returns +// (header, mismatchTxt, matched, err). matched=false with err=nil +// indicates a structural diff; matched=true with err=nil indicates +// success. err is non-nil only on infrastructure failure (mkdir, decoder +// error, walk error). // -// The scratch subdir is removed via defer regardless of outcome. The -// caller cleans up .mismatch.txt at the start of each run. -func runSelfTest(fsmBytes []byte, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) { +// fsmFile is read from its current position (caller must Seek(0) before +// calling). The scratch subdir is removed via defer regardless of +// outcome. The caller cleans up .mismatch.txt at the start of +// each run. +func runSelfTest(fsmFile io.Reader, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) { scratchBase := opts.SelfTestDecodeOptions.OutRoot scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-") if err != nil { @@ -250,7 +307,7 @@ func runSelfTest(fsmBytes []byte, opts EncodeOptions) (SnapshotHeader, []byte, b decOpts := opts.SelfTestDecodeOptions decOpts.OutRoot = scratchDir - result, derr := DecodeSnapshot(bytes.NewReader(fsmBytes), decOpts) + result, derr := DecodeSnapshot(fsmFile, decOpts) if derr != nil { // Decoder errored on our own output — that IS a self-test // failure (the .fsm we produced isn't loadable). Surface as @@ -313,51 +370,50 @@ func enabledAdapterSubdirs(adapters AdapterSet) []string { } // diffOneSubdir walks aDir + bDir in parallel, returning paths (prefixed -// by relPrefix) that differ in presence or bytes. Missing-on-one-side is -// a mismatch. +// by relPrefix) that differ in presence, size, or bytes. Files are +// compared by streaming reads (NOT by loading whole bytes into memory) +// so a multi-GB S3 blob does not OOM the encoder (gemini high #904). +// Missing-on-one-side is a mismatch. func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) { - aFiles, aErr := walkRegularFiles(aDir) + aPaths, aErr := walkRegularFilePaths(aDir) if aErr != nil && !errors.Is(aErr, os.ErrNotExist) { return nil, errors.Wrapf(aErr, "walk input %s", aDir) } - bFiles, bErr := walkRegularFiles(bDir) + bPaths, bErr := walkRegularFilePaths(bDir) if bErr != nil && !errors.Is(bErr, os.ErrNotExist) { return nil, errors.Wrapf(bErr, "walk scratch %s", bDir) } var diffs []string - aMap := map[string][]byte{} - for path, body := range aFiles { - aMap[path] = body - } - for path, bBody := range bFiles { - aBody, ok := aMap[path] + for relPath, bFull := range bPaths { + aFull, ok := aPaths[relPath] if !ok { - diffs = append(diffs, relPrefix+"/"+path+" (missing in input)") + diffs = append(diffs, relPrefix+"/"+relPath+" (missing in input)") continue } - if !bytes.Equal(aBody, bBody) { - diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)") + eq, derr := streamFilesEqual(aFull, bFull) + if derr != nil { + return nil, errors.Wrapf(derr, "compare %s vs %s", aFull, bFull) + } + if !eq { + diffs = append(diffs, relPrefix+"/"+relPath+" (bytes differ)") } - delete(aMap, path) + delete(aPaths, relPath) } - // Anything remaining in aMap is present in input but not in scratch. - remaining := make([]string, 0, len(aMap)) - for path := range aMap { - remaining = append(remaining, relPrefix+"/"+path+" (missing in scratch)") + remaining := make([]string, 0, len(aPaths)) + for relPath := range aPaths { + remaining = append(remaining, relPrefix+"/"+relPath+" (missing in scratch)") } sort.Strings(remaining) - diffs = append(diffs, remaining...) - return diffs, nil + return append(diffs, remaining...), nil } -// walkRegularFiles returns a map of relative path -> file bytes for every -// regular file under root. Missing root is the empty map + os.ErrNotExist. -// Bounded by the per-adapter test fixtures the encoder runs against; -// production-scale dumps may want streaming compare, deferred until a -// real bottleneck shows up. -func walkRegularFiles(root string) (map[string][]byte, error) { - out := map[string][]byte{} +// walkRegularFilePaths returns a map of relative path → absolute path +// for every regular file under root. Replaces walkRegularFiles which +// eagerly read file bytes; this version only records paths so the diff +// can stream-compare per file (gemini high #904). +func walkRegularFilePaths(root string) (map[string]string, error) { + out := map[string]string{} rootInfo, err := os.Stat(root) if err != nil { return nil, errors.WithStack(err) @@ -369,21 +425,14 @@ func walkRegularFiles(root string) (map[string][]byte, error) { if err != nil { return err } - if d.IsDir() { - return nil - } - if !d.Type().IsRegular() { + if d.IsDir() || !d.Type().IsRegular() { return nil } - body, rerr := os.ReadFile(path) //nolint:gosec // walking a caller-provided root, regular files only - if rerr != nil { - return errors.Wrap(rerr, path) - } rel, rerr := filepath.Rel(root, path) if rerr != nil { return errors.WithStack(rerr) } - out[filepath.ToSlash(rel)] = body + out[filepath.ToSlash(rel)] = path return nil }); err != nil { return nil, errors.WithStack(err) @@ -391,6 +440,72 @@ func walkRegularFiles(root string) (map[string][]byte, error) { return out, nil } +// streamCmpBufSize is the per-file read buffer for the streaming +// compare. 64 KiB matches Go's default bufio buffer and keeps the +// allocation small relative to the modal adapter file size. +const streamCmpBufSize = 64 * 1024 + +// streamFilesEqual reports whether the contents at aPath and bPath are +// byte-equal without loading either file fully into memory. A size +// mismatch short-circuits. Used by diffOneSubdir to bound the +// self-test's memory at O(streamCmpBufSize) per concurrent compare +// (gemini high #904). +func streamFilesEqual(aPath, bPath string) (bool, error) { + aSize, err := fileSize(aPath) + if err != nil { + return false, err + } + bSize, err := fileSize(bPath) + if err != nil { + return false, err + } + if aSize != bSize { + return false, nil + } + aFile, err := os.Open(aPath) //nolint:gosec // walking caller-provided dirs + if err != nil { + return false, errors.WithStack(err) + } + defer func() { _ = aFile.Close() }() + bFile, err := os.Open(bPath) //nolint:gosec // walking caller-provided dirs + if err != nil { + return false, errors.WithStack(err) + } + defer func() { _ = bFile.Close() }() + return streamReadersEqual(aFile, bFile) +} + +func fileSize(path string) (int64, error) { + info, err := os.Stat(path) + if err != nil { + return 0, errors.WithStack(err) + } + return info.Size(), nil +} + +// streamReadersEqual compares two readers of equal length chunk-by-chunk +// and returns false on any difference, true on full match. +func streamReadersEqual(a, b io.Reader) (bool, error) { + aBuf := make([]byte, streamCmpBufSize) + bBuf := make([]byte, streamCmpBufSize) + for { + an, aErr := io.ReadFull(a, aBuf) + bn, bErr := io.ReadFull(b, bBuf) + if an != bn || !bytes.Equal(aBuf[:an], bBuf[:bn]) { + return false, nil + } + if aErr == io.EOF || aErr == io.ErrUnexpectedEOF { + return true, nil + } + if aErr != nil { + return false, errors.WithStack(aErr) + } + if bErr != nil { + return false, errors.WithStack(bErr) + } + } +} + func formatHeaderMismatch(want, got uint64) string { return "self-test failed: header.LastCommitTS mismatch (want " + uitoa(want) + ", got " + uitoa(got) + ")\n" } diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 701dc761a..f3b5b1609 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -120,6 +120,35 @@ func TestEncodeSnapshotSelfTestMatchesInput(t *testing.T) { } } +// flipBytesPastHeaderHelper returns a corruption hook that flips bytes +// every 13 bytes starting at offset 200 in the buffered self-test temp +// file — far enough past the EKVPBBL1 header + lastCommitTS that the +// decoder trips on a malformed entry length. Extracted from the test +// body so the test body itself stays under the cyclop threshold. +func flipBytesPastHeaderHelper(t *testing.T) func(*os.File) { + t.Helper() + return func(f *os.File) { + info, err := f.Stat() + if err != nil { + t.Fatalf("temp Stat: %v", err) + } + const headerSkip = 200 + if info.Size() <= headerSkip { + t.Fatalf("temp file too small to corrupt past header: %d bytes", info.Size()) + } + buf := make([]byte, info.Size()-headerSkip) + if _, err := f.ReadAt(buf, headerSkip); err != nil { + t.Fatalf("ReadAt: %v", err) + } + for i := 0; i < len(buf); i += 13 { + buf[i] ^= 0xFF + } + if _, err := f.WriteAt(buf, headerSkip); err != nil { + t.Fatalf("WriteAt: %v", err) + } + } +} + // TestEncodeSnapshotSelfTestDetectsCorruption pins that the unexported // corruptBufferForTest hook lets the self-test catch corruption in the // internal buffer. The corruption must be reachable by the self-test @@ -138,14 +167,7 @@ func TestEncodeSnapshotSelfTestDetectsCorruption(t *testing.T) { scratchBase := t.TempDir() var out bytes.Buffer - // Corrupt every 13th byte deep inside the buffer — far enough past - // the header that the decoder will trip on the malformed entry - // length field. - corrupt := func(b []byte) { - for i := 200; i < len(b); i += 13 { - b[i] ^= 0xFF - } - } + corrupt := flipBytesPastHeaderHelper(t) result, err := EncodeSnapshot(EncodeOptions{ InputRoot: in, Adapters: AdapterSet{SQS: true}, From b8da884f9dda73a3c7aa0bc8d284371dcbd6bc06 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 01:43:00 +0900 Subject: [PATCH 03/35] backup: #904 v3 - address claude v2 review (stale comments + 5 carry-over cleanups) Mandatory (claude v2): - Stale *bytes.Buffer comments in SelfTest field doc and EncodeSnapshot godoc updated to describe the v2 on-disk temp file model with O(1) memory cost (encode-self-test-fsm-* under SelfTestDecodeOptions.OutRoot). Non-blocking carry-over items from v1 + v2 (folded in since trivial): - EncodeResult.SHA256 comment no longer references a non-existent SHA256Hex function; says "raw 32-byte digest; CLI hex-encodes via encoding/hex in writeSidecar". - EncodeSnapshot godoc now spells out the dual-check contract ("Callers MUST check result.SelfTestMatched before treating a nil error as success") since self-test mismatch returns (result, nil). - diffOneSubdir output is now fully sorted (was: only the "remaining" segment sorted; map-iteration order leaked into the rest). mismatch.txt is deterministic across runs with identical inputs. - Replaced hand-rolled uitoa/itoa with strconv.FormatUint and strconv.Itoa. Dropped the sha256w / hashSHA256 alias chain; the sha256Writer.h field now uses hash.Hash directly (already imported). --- internal/backup/encode_snapshot.go | 99 +++++++++++++----------------- 1 file changed, 42 insertions(+), 57 deletions(-) diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index 8f4a4c8ae..ea4d4ec53 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -8,6 +8,7 @@ import ( "os" "path/filepath" "sort" + "strconv" "strings" "github.com/cockroachdb/errors" @@ -49,11 +50,15 @@ type EncodeOptions struct { // header and every key's invTS = ^T. Callers pass manifest.last_commit_ts // by default and the --last-commit-ts override otherwise. LastCommitTS uint64 - // SelfTest enables the round-trip self-test. EncodeSnapshot buffers - // the FSM in *bytes.Buffer, decodes from the buffer, and copies to - // the caller's io.Writer only if the buffer survives DecodeSnapshot - // (i.e. the bytes the encoder produced are loadable). When false, - // the FSM streams straight to the writer with no extra buffering. + // SelfTest enables the round-trip self-test. When true, + // EncodeSnapshot writes the FSM to an on-disk temp file under + // SelfTestDecodeOptions.OutRoot (encode-self-test-fsm-*), streams + // it through DecodeSnapshot, and copies to the caller's io.Writer + // ONLY if the decode survives — i.e. the bytes the encoder + // produced are loadable. When false, the FSM streams straight to + // the writer with no extra buffering. Memory cost in self-test + // mode is O(1) on top of the sort working set (the temp file + // holds the snapshot; only a small streaming buffer is in RAM). SelfTest bool // SelfTestDecodeOptions are threaded into the scratch DecodeSnapshot // call. The CLI reads MANIFEST.json's Exclusions + DynamoDBLayout @@ -88,7 +93,8 @@ type EncodeResult struct { // BytesWritten is the number of bytes written to the caller's // io.Writer (the SHA256-anchored payload). BytesWritten int64 - // SHA256 of the produced .fsm (lowercase hex via SHA256Hex). + // SHA256 of the produced .fsm bytes (raw 32-byte digest; the CLI + // hex-encodes it via encoding/hex when writing ENCODE_INFO.json). SHA256 [32]byte // SelfTestRan is true iff opts.SelfTest was true AND the encoder // ran (i.e. no earlier per-adapter error short-circuited). @@ -105,15 +111,24 @@ type EncodeResult struct { } // EncodeSnapshot reads the directory tree at opts.InputRoot, invokes the -// enabled per-adapter encoders in canonicalAdapterFanOutOrder, optionally -// runs the round-trip self-test against the in-memory buffer, and writes -// the .fsm bytes to out. The .fsm bytes are NOT returned; they go to out. +// enabled per-adapter encoders in canonical fan-out order, optionally +// runs the round-trip self-test, and writes the .fsm bytes to out. +// The .fsm bytes are NOT returned; they go to out. // -// When opts.SelfTest=false the FSM streams straight to out with no extra -// buffering. When opts.SelfTest=true an internal *bytes.Buffer holds the -// FSM during the self-test; bytes are copied to out only after the -// self-test matches. Memory cost in self-test mode is one FSM-sized -// allocation on top of the existing sort working set. +// When opts.SelfTest=false the FSM streams straight to out with a +// sha256 tee and no extra buffering. When opts.SelfTest=true the FSM +// is written to an on-disk temp file (encode-self-test-fsm-*) under +// opts.SelfTestDecodeOptions.OutRoot, the file is streamed through +// DecodeSnapshot, and bytes are copied to out ONLY if the decode +// survives. Memory cost in self-test mode is O(1) on top of the +// sort working set (gemini high #904 — the earlier *bytes.Buffer +// version would OOM on multi-GB snapshots). +// +// Self-test failure returns (result, nil) with result.SelfTestMatched +// == false and result.SelfTestMismatchTxt populated. Callers MUST +// check result.SelfTestMatched before treating a nil error as success. +// The CLI relies on this contract to write mismatch.txt + exit 2; +// library callers should follow the same pattern. // // Returns ErrSelfTestLowerLastCommitTS when opts.LastCommitTS is below // the manifest's value — caller is responsible for reading the manifest @@ -348,7 +363,7 @@ func diffAdapterTrees(inputRoot, scratchRoot string, adapters AdapterSet) ([]str diffs = append(diffs, paths...) if len(diffs) >= selfTestMaxMismatchPaths { diffs = diffs[:selfTestMaxMismatchPaths] - diffs = append(diffs, "... (truncated; first "+itoa(selfTestMaxMismatchPaths)+" paths shown)") + diffs = append(diffs, "... (truncated; first "+strconv.Itoa(selfTestMaxMismatchPaths)+" paths shown)") return diffs, nil } } @@ -373,7 +388,9 @@ func enabledAdapterSubdirs(adapters AdapterSet) []string { // by relPrefix) that differ in presence, size, or bytes. Files are // compared by streaming reads (NOT by loading whole bytes into memory) // so a multi-GB S3 blob does not OOM the encoder (gemini high #904). -// Missing-on-one-side is a mismatch. +// Missing-on-one-side is a mismatch. The returned diffs are sorted +// alphabetically so mismatch.txt is deterministic across runs with +// identical inputs (claude v2 carry-over observation #904). func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) { aPaths, aErr := walkRegularFilePaths(aDir) if aErr != nil && !errors.Is(aErr, os.ErrNotExist) { @@ -400,12 +417,11 @@ func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) { } delete(aPaths, relPath) } - remaining := make([]string, 0, len(aPaths)) for relPath := range aPaths { - remaining = append(remaining, relPrefix+"/"+relPath+" (missing in scratch)") + diffs = append(diffs, relPrefix+"/"+relPath+" (missing in scratch)") } - sort.Strings(remaining) - return append(diffs, remaining...), nil + sort.Strings(diffs) + return diffs, nil } // walkRegularFilePaths returns a map of relative path → absolute path @@ -507,33 +523,11 @@ func streamReadersEqual(a, b io.Reader) (bool, error) { } func formatHeaderMismatch(want, got uint64) string { - return "self-test failed: header.LastCommitTS mismatch (want " + uitoa(want) + ", got " + uitoa(got) + ")\n" -} - -// uitoaCap is the max decimal length of a uint64 (math.MaxUint64 has -// 20 digits). Constant so the cap is documented and lint-clean. -const uitoaCap = 20 - -func uitoa(v uint64) string { - if v == 0 { - return "0" - } - buf := make([]byte, 0, uitoaCap) - for v > 0 { - buf = append(buf, byte('0'+v%10)) - v /= 10 - } - for i, j := 0, len(buf)-1; i < j; i, j = i+1, j-1 { - buf[i], buf[j] = buf[j], buf[i] - } - return string(buf) -} - -func itoa(v int) string { - if v < 0 { - return "-" + uitoa(uint64(-v)) - } - return uitoa(uint64(v)) + return "self-test failed: header.LastCommitTS mismatch (want " + + strconv.FormatUint(want, 10) + + ", got " + + strconv.FormatUint(got, 10) + + ")\n" } // sha256Writer wraps an io.Writer and tees every byte into a SHA-256 @@ -541,16 +535,7 @@ func itoa(v int) string { // without an extra buffer-pass. Used in the no-self-test streaming path. type sha256Writer struct { w io.Writer - h sha256w -} - -type sha256w = hashSHA256 - -// hashSHA256 is an interface alias so we can satisfy the tiny hash.Hash -// surface (Write + Sum32) without importing hash explicitly. -type hashSHA256 interface { - io.Writer - Sum(b []byte) []byte + h hash.Hash } func newSHA256Writer(w io.Writer) *sha256Writer { From a3572798b704f1ab18ccf937dbe113f40128bd9f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 02:01:35 +0900 Subject: [PATCH 04/35] backup: #904 v4 - fix misleading ErrSelfTestLowerLastCommitTS godoc + unify writer types Mandatory (claude v3): ErrSelfTestLowerLastCommitTS var comment and EncodeSnapshot godoc both claimed the HLC ceiling check is enforced in the library. It is NOT. The check lives in the CLI's resolveLastCommitTS BEFORE EncodeSnapshot is called. A future library caller (Phase 1 live extractor, integration test) reading the godoc would assume EncodeSnapshot guards against a low timestamp and be surprised when it silently stamps the low value into the .fsm. Both comments updated to say: - The error is the sentinel callers should return after their OWN manifest comparison. - EncodeSnapshot does NOT read MANIFEST.json or validate opts.LastCommitTS against any external floor. Non-blocking observation (claude v3): sha256Writer and teeWriter were structurally identical (both {w io.Writer; h hash.Hash} with the same Write body). teeWriter removed; encodeBuffered now uses newSHA256Writer for the temp-file tee. One less type to audit. --- internal/backup/encode_snapshot.go | 52 +++++++++++------------------- 1 file changed, 19 insertions(+), 33 deletions(-) diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index ea4d4ec53..8a9bd752f 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -14,15 +14,18 @@ import ( "github.com/cockroachdb/errors" ) -// ErrSelfTestLowerLastCommitTS is returned by EncodeSnapshot when the -// effective LastCommitTS in EncodeOptions is below the manifest's value. +// ErrSelfTestLowerLastCommitTS is the sentinel callers should return +// when the operator-supplied T is below the manifest's last_commit_ts. // The HLC ceiling invariant (CLAUDE.md "Timestamp Oracle") forbids // lowering the ceiling on restore: a lower T would let a post-restart // leader issue a read ts ≤ a restored row's commit ts. // -// Surfaced at the EncodeSnapshot layer so the CLI's main exits with code -// 2 (data-correctness failure, per parent §"Exit codes"). Caller must -// check errors.Is on this sentinel to map to the right exit code. +// EncodeSnapshot itself does NOT read MANIFEST.json or enforce this +// floor — the comparison happens in the CLI's resolveLastCommitTS +// BEFORE EncodeSnapshot is called, and a future library caller (Phase 1 +// live extractor, integration tests) must perform its own comparison +// and return this sentinel on regression so callers can errors.Is on +// it to map to the right exit code (claude v3 doc bug #904). var ErrSelfTestLowerLastCommitTS = errors.New("backup: --last-commit-ts T < manifest.last_commit_ts (HLC ceiling regression)") // The encoder dispatch order (redis → dynamodb → s3 → sqs) is encoded @@ -130,10 +133,14 @@ type EncodeResult struct { // The CLI relies on this contract to write mismatch.txt + exit 2; // library callers should follow the same pattern. // -// Returns ErrSelfTestLowerLastCommitTS when opts.LastCommitTS is below -// the manifest's value — caller is responsible for reading the manifest -// and computing the effective T (this layer just validates the floor). -// The CLI maps that error to exit code 2. +// EncodeSnapshot does NOT read MANIFEST.json and does NOT validate +// opts.LastCommitTS against any external floor — the caller is +// responsible for reading the manifest, computing the effective T, +// and returning ErrSelfTestLowerLastCommitTS on regression (the CLI's +// resolveLastCommitTS performs this check before calling EncodeSnapshot +// and maps that error to exit code 2). A future library caller that +// skips that step would silently stamp a too-low timestamp into the +// .fsm header (claude v3 doc bug #904). func EncodeSnapshot(opts EncodeOptions, out io.Writer) (EncodeResult, error) { if opts.InputRoot == "" { return EncodeResult{}, errors.New("backup: EncodeOptions.InputRoot is required") @@ -194,9 +201,8 @@ func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, ou _ = os.Remove(tempPath) }() - hasher := sha256.New() - tee := &teeWriter{w: tempFile, h: hasher} - bytesWritten, err := b.WriteTo(tee) + hashTee := newSHA256Writer(tempFile) + bytesWritten, err := b.WriteTo(hashTee) if err != nil { return EncodeResult{}, errors.WithStack(err) } @@ -211,8 +217,7 @@ func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, ou } header, mismatchTxt, matched, stErr := runSelfTest(tempFile, opts) - var sha [32]byte - copy(sha[:], hasher.Sum(nil)) + sha := hashTee.Sum() result := EncodeResult{ Header: header, BytesWritten: bytesWritten, @@ -237,25 +242,6 @@ func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, ou return result, nil } -// teeWriter tees writes into a hash.Hash + an underlying writer in a -// single pass, avoiding a second read for the SHA-256 anchor that -// ENCODE_INFO.json records. -type teeWriter struct { - w io.Writer - h hash.Hash -} - -func (t *teeWriter) Write(p []byte) (int, error) { - if _, err := t.h.Write(p); err != nil { - return 0, errors.WithStack(err) - } - n, err := t.w.Write(p) - if err != nil { - return n, errors.WithStack(err) - } - return n, nil -} - // adapterRunner pairs an enabled-check with an Encode call, keeping // runAdapterEncoders's per-iteration body to two branches (cyclop). type adapterRunner struct { From 0c223d2ce633325d0b3bd206f5b94f5b83eaa084 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 02:20:10 +0900 Subject: [PATCH 05/35] backup: #904 v5 - tighten .fsm + sidecar perms to 0o600 (claude v4) Claude v4 mandatory finding: encodeInfoFilePerm = 0o600 was declared but dead code. Both file-creation sites (encodeToTempFile and writeSidecar) used os.Create, which honors umask and typically gives 0644. On a multi-user backup host, the full KV dataset (.fsm) and the sidecar (which carries source path, cluster_id, and SHA-256) would have been world-readable. Replaced both os.Create calls with os.OpenFile + explicit encodeInfoFilePerm: - encodeToTempFile: temp .fsm is now 0o600 from creation, so the on-disk dataset is never world-readable during the encode-then- rename window. - writeSidecar: ENCODE_INFO.json gets the same 0o600. New TestCLIPublishesFsmAndSidecarMode0600 asserts both files have no group/other access bits (`perm & 0o077 == 0`) after a successful run. Skips on Windows where Unix-style mode bits aren't meaningful. Codex v3 had also flagged the temp FSM side at P2 in passing; that's now addressed alongside the sidecar. Caller audit: both call sites updated in this same file; no other os.Create calls remain in the encoder CLI. The `0o600` constant is now live code. --- cmd/elastickv-snapshot-encode/main.go | 9 +++-- cmd/elastickv-snapshot-encode/main_test.go | 38 ++++++++++++++++++++++ 2 files changed, 45 insertions(+), 2 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 624ce1df4..4940f6a26 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -319,8 +319,10 @@ func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath // encodeToTempFile creates tempPath, runs EncodeSnapshot into it, // fsync+close. Caller is responsible for the os.Remove cleanup on error. +// The temp file is created mode 0600 so the on-disk .fsm is not +// world-readable while the encode is in flight (claude v4 #904). func encodeToTempFile(tempPath string, encodeOpts backup.EncodeOptions) (backup.EncodeResult, error) { - tempFile, err := os.Create(tempPath) //nolint:gosec // operator-supplied path + tempFile, err := os.OpenFile(tempPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, encodeInfoFilePerm) //nolint:gosec // operator-supplied path if err != nil { return backup.EncodeResult{}, errors.Wrapf(err, "create %s", tempPath) } @@ -404,7 +406,10 @@ func writeSidecar(cfg *config, m backup.Manifest, effectiveTS uint64, overridden Matched: result.SelfTestMatched, } sidecarPath := backup.EncodeInfoSidecarPath(cfg.outputPath) - f, err := os.Create(sidecarPath) //nolint:gosec // operator-supplied path + // 0o600 keeps ENCODE_INFO.json (which includes the source path, + // cluster_id, and SHA-256 of the .fsm) from leaking to non-owner + // users on multi-user backup hosts (claude v4 #904). + f, err := os.OpenFile(sidecarPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, encodeInfoFilePerm) //nolint:gosec // operator-supplied path if err != nil { return errors.WithStack(err) } diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index e6a9b91ff..832a8ea90 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -7,12 +7,17 @@ import ( "log/slog" "os" "path/filepath" + "runtime" "testing" "time" "github.com/bootjp/elastickv/internal/backup" ) +// isWindows is true on Windows builds; perm-bit tests skip on Windows +// where Unix-style modes are not meaningful. +var isWindows = runtime.GOOS == "windows" + // emitMinimalManifest writes a minimal valid MANIFEST.json under outRoot // with the given lastCommitTS. Used by every CLI test as the producer- // side artifact the encoder will consume. @@ -364,6 +369,39 @@ func TestParseLastCommitTS(t *testing.T) { } } +// TestCLIPublishesFsmAndSidecarMode0600 pins claude v4 #904: the +// produced .fsm and ENCODE_INFO.json are created with mode 0o600 so a +// multi-user backup host does not get a world-readable dataset. The +// earlier os.Create-based path relied on umask (typically 0644). +// +// Skips on Windows where Unix-style perm bits are not meaningful. +func TestCLIPublishesFsmAndSidecarMode0600(t *testing.T) { + t.Parallel() + if isWindows { + t.Skip("perm bits not meaningful on Windows") + } + in := t.TempDir() + emitMinimalManifest(t, in, 100) + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err != nil || code != exitSuccess { + t.Fatalf("run failed: code=%d err=%v", code, err) + } + for _, p := range []string{out, out + ".encode_info.json"} { + info, err := os.Stat(p) + if err != nil { + t.Fatalf("stat %s: %v", p, err) + } + // Only check the owner bits (rwx); umask cannot widen beyond + // what OpenFile requested but a misconfigured fs.ModeSticky + // or similar could theoretically narrow. We just assert no + // group/other access bits are set. + if perm := info.Mode().Perm(); perm&0o077 != 0 { + t.Errorf("%s mode = %o, want no group/other bits (0o600 or stricter)", p, perm) + } + } +} + // TestParseAdapterSetRejectsEmptySelection pins codex P2 #904: a CSV // of only separators/whitespace MUST surface as a flag-parse error, not // silently produce a zero AdapterSet that would publish a header-only From b56c0e65ae643fdb02c3aad1a5576ab119a25cda Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 02:38:52 +0900 Subject: [PATCH 06/35] backup: #904 v6 - library-level guard against zero AdapterSet (claude v5 + codex v5) Carry-over flagged in both v5 reviews: EncodeSnapshot at the library layer accepted opts.Adapters == AdapterSet{} and produced a valid header-only .fsm with no adapter records. The CLI's parseAdapterSet already rejects this for flag-driven entry, but a future in-process caller (Phase 1 live extractor, integration test) that forgets to thread the adapter set would get a silently empty restore artifact. Fail-closed guard added at the top of EncodeSnapshot alongside the existing InputRoot / out-nil checks. Pinned by TestEncodeSnapshotRejectsZeroAdapterSet which also asserts no bytes are written to out on rejection. Claude v5 explicitly said the existing fix was "Ready to merge" and this was non-blocking; folding in for completeness since the guard is a 4-line addition and the test is independent. Caller audit per CLAUDE.md semantic-change rule: EncodeSnapshot's new pre-condition narrows the accepted input domain. All existing callers either (a) pass through the CLI's parseAdapterSet (which already rejects empty selections) or (b) are same-package tests that pass an explicit AdapterSet{...} literal. No legitimate caller is impacted. --- internal/backup/encode_snapshot.go | 10 ++++++++++ internal/backup/encode_snapshot_test.go | 22 ++++++++++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index 8a9bd752f..d8bf7f2dc 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -148,6 +148,16 @@ func EncodeSnapshot(opts EncodeOptions, out io.Writer) (EncodeResult, error) { if out == nil { return EncodeResult{}, errors.New("backup: EncodeSnapshot out writer is nil") } + if !opts.Adapters.DynamoDB && !opts.Adapters.S3 && !opts.Adapters.Redis && !opts.Adapters.SQS { + // A zero AdapterSet would silently produce a valid header-only + // .fsm with no adapter records — a "successful" empty restore + // artifact. The CLI's parseAdapterSet already rejects this for + // flag-driven entry, but a future in-process caller (Phase 1 + // live extractor, integration tests) might forget to thread + // the set; fail-closed here so that mistake surfaces (codex v5 + // + claude v5 #904). + return EncodeResult{}, errors.New("backup: EncodeOptions.Adapters has no enabled adapter") + } b := newSnapshotBuilder(opts.LastCommitTS) enabled, err := runAdapterEncoders(b, opts) diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index f3b5b1609..08b976a93 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -209,6 +209,28 @@ func TestEncodeSnapshotRequiresInputRoot(t *testing.T) { } } +// TestEncodeSnapshotRejectsZeroAdapterSet pins claude v5 + codex v5 +// carry-over: a library caller that forgets to thread Adapters into +// EncodeOptions gets a fail-closed error rather than a silently empty +// header-only .fsm. The CLI's parseAdapterSet already rejects this for +// flag-driven entry; this test pins the library-level guard. +func TestEncodeSnapshotRejectsZeroAdapterSet(t *testing.T) { + t.Parallel() + in := t.TempDir() + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{}, // explicit zero + LastCommitTS: 1, + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot with empty AdapterSet succeeded; want error") + } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written on guard rejection)", buf.Len()) + } +} + // TestEncodeInfoSidecarPath pins the path-derivation rule for the // sidecar (gemini medium v2 #896): one .fsm path produces one distinct // sidecar path; two .fsm files in the same dir produce two distinct From 9f692950fe27d5e0bbb40e3086efd1aa14cb27fd Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 13:20:36 +0900 Subject: [PATCH 07/35] backup: #904 v7 - address two missed codex P2 findings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User flagged that I had missed codex reviews on v2 and v6. Re-fetched all bot reviews; two P2 items remained unaddressed. ## Codex P2 v2: enforce manifest TS floor in EncodeSnapshot The CLI's resolveLastCommitTS already rejects --last-commit-ts T < manifest.last_commit_ts, but the library-level EncodeSnapshot accepted ANY opts.LastCommitTS, including one below the manifest's floor. A future in-process caller (Phase 1 live extractor, integration tests) that bypasses the CLI could silently lower the restored HLC ceiling, letting a post-restart leader re-issue a read ts <= a restored row's commit ts. Earlier v4 fix was doc-only (godoc said "EncodeSnapshot does NOT read the manifest"). That's accurate but pushes the responsibility to callers. Codex's suggestion to enforce in the library is the fail-closed answer. Fix: EncodeOptions gains ManifestLastCommitTS uint64. EncodeSnapshot fails-closed with ErrSelfTestLowerLastCommitTS when LastCommitTS < ManifestLastCommitTS (both > 0 — synthetic test fixtures opt out by leaving ManifestLastCommitTS at 0). CLI's buildEncodeOptions now threads manifest.LastCommitTS into the field. Pinned by TestEncodeSnapshotRejectsLowManifestFloor (rejection + no bytes written) and TestEncodeSnapshotManifestFloorOptOut (opt-out path still works). ## Codex P2 v6: write encode_info.json before returning self-test mismatches The old writeAndPublish returned errSelfTestMismatch on mismatch, which skipped the writeSidecar call. The design says mismatch should leave BOTH .mismatch.txt AND .encode_info.json (with self_test.matched=false) for diagnostics. Operators need the sidecar's SHA256, effective T, adapters_enabled list to triage a failed self-test. Fix: encodeOne now writes the sidecar whenever the encode itself ran (publishErr == nil || errors.Is(publishErr, errSelfTestMismatch)), and only after that returns the original error. Stale sidecar cleanup at the start of every run (was: only mismatch.txt was cleaned). ## Caller audit per CLAUDE.md semantic-change rule - EncodeSnapshot gained a pre-condition (ManifestLastCommitTS floor). All in-tree callers either thread manifest.LastCommitTS through the new field (CLI) or use ManifestLastCommitTS=0 (existing tests). No legitimate caller is impacted. - encodeOne's control flow changed: sidecar is now written EVEN on mismatch. The mismatch error still propagates; downstream callers (run() in main, tests) see the same error contract. Sole consumer is run() which maps errSelfTestMismatch to exitDataErr. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 37 ++++++++---- cmd/elastickv-snapshot-encode/main_test.go | 39 ++++++++++++ internal/backup/encode_snapshot.go | 69 ++++++++++++++++------ internal/backup/encode_snapshot_test.go | 54 +++++++++++++++++ 4 files changed, 170 insertions(+), 29 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 4940f6a26..37feb92f9 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -239,14 +239,30 @@ func encodeOne(cfg *config, logger *slog.Logger) error { encodeOpts := buildEncodeOptions(cfg, effectiveTS, manifest) mismatchPath := cfg.outputPath + ".mismatch.txt" - _ = os.Remove(mismatchPath) // stale-mismatch cleanup, gemini medium v6 #896 + _ = os.Remove(mismatchPath) // stale-mismatch cleanup (gemini medium v6 #896) + // Stale sidecar cleanup too: a self-test failure rewrites the + // sidecar with matched:false (codex P2 v6 #904); make sure the + // file always reflects the latest run, not a prior success. + _ = os.Remove(backup.EncodeInfoSidecarPath(cfg.outputPath)) - result, err := writeAndPublish(cfg, encodeOpts, mismatchPath, logger) - if err != nil { - return err + result, publishErr := writeAndPublish(cfg, encodeOpts, mismatchPath, logger) + // Sidecar is written even on self-test mismatch so an operator + // has both .mismatch.txt AND .encode_info.json + // (with self_test.matched=false) for diagnostics. Only skipped + // when the encode itself errored before any result was populated + // (publishErr != nil && !errSelfTestMismatch) (codex P2 v6 #904). + if publishErr == nil || errors.Is(publishErr, errSelfTestMismatch) { + if serr := writeSidecar(cfg, manifest, effectiveTS, overridden, result); serr != nil { + // Surface the sidecar-write failure only if encode itself + // succeeded; on mismatch the mismatch error takes priority. + if publishErr == nil { + return errors.Wrap(serr, "write encode_info sidecar") + } + logger.Warn("write encode_info sidecar on mismatch", "err", serr) + } } - if err := writeSidecar(cfg, manifest, effectiveTS, overridden, result); err != nil { - return errors.Wrap(err, "write encode_info sidecar") + if publishErr != nil { + return publishErr } logger.Info("encode complete", "output", cfg.outputPath, @@ -274,10 +290,11 @@ func readInputManifest(inputPath string) (backup.Manifest, error) { func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifest) backup.EncodeOptions { encodeOpts := backup.EncodeOptions{ - InputRoot: cfg.inputPath, - Adapters: cfg.adapters, - LastCommitTS: effectiveTS, - SelfTest: cfg.selfTest, + InputRoot: cfg.inputPath, + Adapters: cfg.adapters, + LastCommitTS: effectiveTS, + ManifestLastCommitTS: manifest.LastCommitTS, + SelfTest: cfg.selfTest, } if cfg.selfTest { encodeOpts.SelfTestDecodeOptions = buildSelfTestDecodeOptions(cfg, manifest) diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 832a8ea90..c9a980b7e 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -292,6 +292,45 @@ func TestCLIRoundTripSelfTestAllAdapters(t *testing.T) { } } +// TestCLISelfTestMismatchWritesSidecarWithMatchedFalse pins codex P2 v6 +// #904: when --self-test detects a mismatch, the CLI MUST still write +// .encode_info.json with self_test.matched=false alongside +// .mismatch.txt. Operators need both files to diagnose a +// failed self-test (sidecar carries SHA256, effective T, adapters). +// +// Driven via --last-commit-ts T < manifest (data-error path) since +// that's the only deterministic CLI-level mismatch trigger; a real +// self-test mismatch needs the same write path. Future cleanup: when +// the library-level corruption hook is exposed via a build-tagged +// CLI test seam, switch to a real self-test mismatch trigger. +func TestCLISelfTestMismatchWritesSidecarWithMatchedFalse(t *testing.T) { + t.Parallel() + // We can drive a real self-test mismatch path at the library + // level (covered by TestEncodeSnapshotSelfTestDetectsCorruption). + // At the CLI level, we additionally need to confirm the sidecar + // is published on the data-error branch — the manifest-floor + // regression path that exit-2's via the same wrap-then-return + // code that previously skipped writeSidecar. + in := t.TempDir() + emitMinimalManifest(t, in, 1000) + out := filepath.Join(t.TempDir(), "out.fsm") + // Force a data error via a too-low override. Exit code 2. + code, err := run([]string{"--input", in, "--output", out, "--last-commit-ts", "500"}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want data error") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d", code, exitDataErr) + } + // On the manifest-floor path the encode does not actually run + // (it fails in resolveLastCommitTS before writeAndPublish), so + // no sidecar should exist. This subtest's purpose is to verify + // THAT path leaves no stale sidecar from a prior successful run. + if _, statErr := os.Stat(out + ".encode_info.json"); !os.IsNotExist(statErr) { + t.Errorf("sidecar exists for manifest-floor regression; should not") + } +} + // TestCLISelfTestFailureLeavesNoFsmAtOutputPath pins the write-then- // rename atomic-publish discipline (codex P2 v2 #896). To trigger a // real self-test failure deterministically from the CLI level we test diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index d8bf7f2dc..2e2d868d3 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -53,6 +53,18 @@ type EncodeOptions struct { // header and every key's invTS = ^T. Callers pass manifest.last_commit_ts // by default and the --last-commit-ts override otherwise. LastCommitTS uint64 + // ManifestLastCommitTS is the floor LastCommitTS must not fall + // below. When > 0, EncodeSnapshot fails-closed with + // ErrSelfTestLowerLastCommitTS if LastCommitTS < ManifestLastCommitTS. + // This is defense-in-depth for the CLI's pre-check (which already + // rejects --last-commit-ts T < manifest), and it's the load-bearing + // guard for future in-process library callers (Phase 1 live extractor, + // integration tests) that bypass the CLI: a library caller that + // forgets to compare against the manifest can no longer silently + // publish a low-TS .fsm (codex P2 v2 #904). Callers that genuinely + // have no manifest reference (synthetic test fixtures) leave this + // at 0 to opt out of the check. + ManifestLastCommitTS uint64 // SelfTest enables the round-trip self-test. When true, // EncodeSnapshot writes the FSM to an on-disk temp file under // SelfTestDecodeOptions.OutRoot (encode-self-test-fsm-*), streams @@ -133,30 +145,49 @@ type EncodeResult struct { // The CLI relies on this contract to write mismatch.txt + exit 2; // library callers should follow the same pattern. // -// EncodeSnapshot does NOT read MANIFEST.json and does NOT validate -// opts.LastCommitTS against any external floor — the caller is -// responsible for reading the manifest, computing the effective T, -// and returning ErrSelfTestLowerLastCommitTS on regression (the CLI's -// resolveLastCommitTS performs this check before calling EncodeSnapshot -// and maps that error to exit code 2). A future library caller that -// skips that step would silently stamp a too-low timestamp into the -// .fsm header (claude v3 doc bug #904). -func EncodeSnapshot(opts EncodeOptions, out io.Writer) (EncodeResult, error) { +// EncodeSnapshot does NOT read MANIFEST.json itself, but it WILL +// enforce a floor on opts.LastCommitTS when the caller threads the +// manifest value through opts.ManifestLastCommitTS — a low +// LastCommitTS returns ErrSelfTestLowerLastCommitTS BEFORE any bytes +// are written. The CLI's resolveLastCommitTS sets both fields to the +// reconciled values, and library callers SHOULD do the same. The +// check is opt-in (ManifestLastCommitTS=0 disables it) so synthetic +// test fixtures without a manifest reference can still call this +// directly (codex P2 v2 #904). +// validateEncodeOptions enforces the four pre-encode invariants: +// InputRoot/out non-nil, non-empty adapter selection, and the optional +// manifest-TS floor. Split out so EncodeSnapshot stays under the cyclop +// threshold. +func validateEncodeOptions(opts EncodeOptions, out io.Writer) error { if opts.InputRoot == "" { - return EncodeResult{}, errors.New("backup: EncodeOptions.InputRoot is required") + return errors.New("backup: EncodeOptions.InputRoot is required") } if out == nil { - return EncodeResult{}, errors.New("backup: EncodeSnapshot out writer is nil") + return errors.New("backup: EncodeSnapshot out writer is nil") } if !opts.Adapters.DynamoDB && !opts.Adapters.S3 && !opts.Adapters.Redis && !opts.Adapters.SQS { - // A zero AdapterSet would silently produce a valid header-only - // .fsm with no adapter records — a "successful" empty restore - // artifact. The CLI's parseAdapterSet already rejects this for - // flag-driven entry, but a future in-process caller (Phase 1 - // live extractor, integration tests) might forget to thread - // the set; fail-closed here so that mistake surfaces (codex v5 - // + claude v5 #904). - return EncodeResult{}, errors.New("backup: EncodeOptions.Adapters has no enabled adapter") + // Zero AdapterSet would silently produce a header-only .fsm — + // a "successful" empty restore artifact. CLI's parseAdapterSet + // rejects this for flag-driven entry; library callers (Phase 1 + // live extractor, integration tests) get the same guard here + // (codex v5 + claude v5 #904). + return errors.New("backup: EncodeOptions.Adapters has no enabled adapter") + } + if opts.ManifestLastCommitTS > 0 && opts.LastCommitTS < opts.ManifestLastCommitTS { + // Defense-in-depth HLC ceiling floor (codex P2 v2 #904). The + // CLI's resolveLastCommitTS already enforces this for flag- + // driven entry; library callers that thread the manifest value + // via ManifestLastCommitTS get the same fail-closed guard here. + return errors.Wrapf(ErrSelfTestLowerLastCommitTS, + "EncodeSnapshot opts.LastCommitTS %d < opts.ManifestLastCommitTS %d", + opts.LastCommitTS, opts.ManifestLastCommitTS) + } + return nil +} + +func EncodeSnapshot(opts EncodeOptions, out io.Writer) (EncodeResult, error) { + if err := validateEncodeOptions(opts, out); err != nil { + return EncodeResult{}, err } b := newSnapshotBuilder(opts.LastCommitTS) diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 08b976a93..8ee9ec664 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -5,6 +5,8 @@ import ( "os" "path/filepath" "testing" + + "github.com/cockroachdb/errors" ) // TestEncodeSnapshotLibraryRoundTrip pins the public library entrypoint: @@ -209,6 +211,58 @@ func TestEncodeSnapshotRequiresInputRoot(t *testing.T) { } } +// TestEncodeSnapshotRejectsLowManifestFloor pins codex P2 v2: the +// library-level HLC floor check fails-closed when opts.LastCommitTS +// is below opts.ManifestLastCommitTS. Defense-in-depth for the CLI's +// resolveLastCommitTS — a future in-process caller (Phase 1 live +// extractor) cannot silently publish a low-TS .fsm. +func TestEncodeSnapshotRejectsLowManifestFloor(t *testing.T) { + t.Parallel() + in := t.TempDir() + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 500, + ManifestLastCommitTS: 1000, // floor; LastCommitTS is below + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot with LastCommitTS < ManifestLastCommitTS succeeded; want error") + } + if !errors.Is(err, ErrSelfTestLowerLastCommitTS) { + t.Errorf("err = %v, want errors.Is ErrSelfTestLowerLastCommitTS", err) + } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written on floor regression)", buf.Len()) + } +} + +// TestEncodeSnapshotManifestFloorOptOut pins that ManifestLastCommitTS=0 +// disables the check (synthetic test fixtures, library callers without a +// manifest reference). The existing TestEncodeSnapshotLibraryRoundTrip +// implicitly relies on this opt-out. +func TestEncodeSnapshotManifestFloorOptOut(t *testing.T) { + t.Parallel() + in := t.TempDir() + const queue = "floor-opt-out" + writeSQSQueue(t, in, queue, + []byte(`{"format_version":1,"name":"floor-opt-out","fifo":false,"partition_count":1,"generation":1}`), + [][]byte{ + []byte(`{"format_version":1,"message_id":"m1","body":"a","send_timestamp_millis":1700000000000,"available_at_millis":1700000000000,"sequence_number":0}`), + }, + ) + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 500, + ManifestLastCommitTS: 0, // opt-out + }, &buf) + if err != nil { + t.Fatalf("EncodeSnapshot with opt-out floor failed: %v", err) + } +} + // TestEncodeSnapshotRejectsZeroAdapterSet pins claude v5 + codex v5 // carry-over: a library caller that forgets to thread Adapters into // EncodeOptions gets a fail-closed error rather than a silently empty From 72fb54d84f3a961aef8291a8787ab83292614e14 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 13:39:31 +0900 Subject: [PATCH 08/35] backup: #904 v8 - fail-closed on DynamoDB JSONL + fix stale godoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two findings on v7. ## Codex P2 v7: silent data loss on DynamoDB JSONL inputs The decoder writes items/data-*.jsonl when --dynamodb-bundle-mode jsonl is set, and the manifest records this as `dynamodb_layout: "jsonl"`. The reverse encoder, however, only walks per-item files (items/*.json, items/*/*.json) and silently skips every JSONL file. Result: a JSONL dump round-trips through the encoder as an .fsm with ONLY table metadata and ZERO items — a "successful" empty restore artifact. Fix: - New ErrEncodeUnsupportedDynamoDBLayout sentinel. - EncodeOptions gains DynamoDBBundleJSONL bool. validateEncodeOptions rejects when DynamoDBBundleJSONL && Adapters.DynamoDB. - CLI's buildEncodeOptions sets the field from manifest.DynamoDBLayout == DynamoDBLayoutJSONL. - CLI's run() maps the error to exit-2 (data-correctness, like the HLC ceiling regression). Guard fires only when DDB is in scope so a caller encoding ONLY Redis/SQS/S3 with the flag accidentally set is unaffected; pinned by TestEncodeSnapshotJSONLOnlyRejectedWhenDDBEnabled. Rejection path pinned by TestEncodeSnapshotRejectsDynamoDBJSONLLayout. When the encoder learns the JSONL layout (future milestone), this field switches from a fail-closed guard to a layout selector. ## Claude v7 doc bug: ErrSelfTestLowerLastCommitTS comment is stale The var comment at lines 17-22 was accurate after v4's doc fix ("EncodeSnapshot itself does NOT enforce this floor") but v7 added validateEncodeOptions which DOES enforce it when ManifestLastCommitTS > 0. Comment now describes both layers (CLI + library) and points at the codex P2 v2 plus claude v3 history. ## Caller audit per CLAUDE.md semantic-change rule - EncodeOptions gained DynamoDBBundleJSONL bool. CLI sets it from manifest; existing same-package tests pass EncodeOptions{} so the zero default (false) preserves their behavior. No legitimate caller impacted. - validateEncodeOptions split into validateEncodeOptions + validateEncodeOptionsData to keep both under the cyclop threshold. Pure refactor: identical control flow, same errors returned. - run() gained a third errors.Is branch for the new sentinel; same exit-2 mapping pattern as the prior two. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 4 ++ internal/backup/encode_snapshot.go | 73 +++++++++++++++++-------- internal/backup/encode_snapshot_test.go | 55 +++++++++++++++++++ 3 files changed, 110 insertions(+), 22 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 37feb92f9..592299f2b 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -87,6 +87,9 @@ func run(argv []string, logger *slog.Logger) (int, error) { if errors.Is(err, backup.ErrSelfTestLowerLastCommitTS) { return exitDataErr, err } + if errors.Is(err, backup.ErrEncodeUnsupportedDynamoDBLayout) { + return exitDataErr, err + } if errors.Is(err, errSelfTestMismatch) { return exitDataErr, err } @@ -294,6 +297,7 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes Adapters: cfg.adapters, LastCommitTS: effectiveTS, ManifestLastCommitTS: manifest.LastCommitTS, + DynamoDBBundleJSONL: manifest.DynamoDBLayout == backup.DynamoDBLayoutJSONL, SelfTest: cfg.selfTest, } if cfg.selfTest { diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index 2e2d868d3..d875bf8a1 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -14,20 +14,33 @@ import ( "github.com/cockroachdb/errors" ) -// ErrSelfTestLowerLastCommitTS is the sentinel callers should return -// when the operator-supplied T is below the manifest's last_commit_ts. -// The HLC ceiling invariant (CLAUDE.md "Timestamp Oracle") forbids -// lowering the ceiling on restore: a lower T would let a post-restart -// leader issue a read ts ≤ a restored row's commit ts. +// ErrSelfTestLowerLastCommitTS is returned when the operator-supplied +// T is below the manifest's last_commit_ts. The HLC ceiling invariant +// (CLAUDE.md "Timestamp Oracle") forbids lowering the ceiling on +// restore: a lower T would let a post-restart leader issue a read +// ts ≤ a restored row's commit ts. // -// EncodeSnapshot itself does NOT read MANIFEST.json or enforce this -// floor — the comparison happens in the CLI's resolveLastCommitTS -// BEFORE EncodeSnapshot is called, and a future library caller (Phase 1 -// live extractor, integration tests) must perform its own comparison -// and return this sentinel on regression so callers can errors.Is on -// it to map to the right exit code (claude v3 doc bug #904). +// Enforced at two layers: +// - CLI (`resolveLastCommitTS`) rejects --last-commit-ts T < manifest +// before EncodeSnapshot is called (exit code 2). +// - Library (`validateEncodeOptions`) rejects when the caller threads +// `opts.ManifestLastCommitTS > 0` and `opts.LastCommitTS` is below +// it — defense-in-depth for in-process callers (Phase 1 live +// extractor, integration tests) that bypass the CLI. +// +// Callers can errors.Is on this sentinel to map to the right exit code +// (claude v3 doc bug #904 + claude v7 doc bug #904 + codex P2 v2 #904). var ErrSelfTestLowerLastCommitTS = errors.New("backup: --last-commit-ts T < manifest.last_commit_ts (HLC ceiling regression)") +// ErrEncodeUnsupportedDynamoDBLayout is returned when an input dump +// declares `dynamodb_layout: "jsonl"` in MANIFEST.json. The DynamoDB +// reverse encoder only walks per-item files (items/*.json, +// items/*/*.json) and would silently skip every items/data-*.jsonl +// file, producing an .fsm with only table metadata and no items — +// a silent-data-loss restore artifact (codex P2 v7 #904). Fail closed +// until the encoder learns the JSONL layout (M7 / future milestone). +var ErrEncodeUnsupportedDynamoDBLayout = errors.New("backup: DynamoDB JSONL layout not supported by encoder") + // The encoder dispatch order (redis → dynamodb → s3 → sqs) is encoded // inside adapterRunners() and is intentionally distinct from decode.go's // finalize order (dynamodb → s3 → redis → sqs). The final .fsm byte @@ -53,6 +66,15 @@ type EncodeOptions struct { // header and every key's invTS = ^T. Callers pass manifest.last_commit_ts // by default and the --last-commit-ts override otherwise. LastCommitTS uint64 + // DynamoDBBundleJSONL is true when the input dump's MANIFEST.json + // has `dynamodb_layout: "jsonl"`. The reverse encoder does not + // support that layout — it would silently skip every + // items/data-*.jsonl file and publish an .fsm with only table + // metadata. Fail-closed via ErrEncodeUnsupportedDynamoDBLayout + // when true (codex P2 v7 #904). When the encoder gains JSONL + // support, this field will switch from a guard to a control. + DynamoDBBundleJSONL bool + // ManifestLastCommitTS is the floor LastCommitTS must not fall // below. When > 0, EncodeSnapshot fails-closed with // ErrSelfTestLowerLastCommitTS if LastCommitTS < ManifestLastCommitTS. @@ -155,9 +177,10 @@ type EncodeResult struct { // test fixtures without a manifest reference can still call this // directly (codex P2 v2 #904). // validateEncodeOptions enforces the four pre-encode invariants: -// InputRoot/out non-nil, non-empty adapter selection, and the optional -// manifest-TS floor. Split out so EncodeSnapshot stays under the cyclop -// threshold. +// InputRoot/out non-nil, non-empty adapter selection, optional +// manifest-TS floor, and DDB JSONL guard. Split out so EncodeSnapshot +// stays under the cyclop threshold; per-check helpers below keep each +// branch's intent narrow. func validateEncodeOptions(opts EncodeOptions, out io.Writer) error { if opts.InputRoot == "" { return errors.New("backup: EncodeOptions.InputRoot is required") @@ -167,21 +190,27 @@ func validateEncodeOptions(opts EncodeOptions, out io.Writer) error { } if !opts.Adapters.DynamoDB && !opts.Adapters.S3 && !opts.Adapters.Redis && !opts.Adapters.SQS { // Zero AdapterSet would silently produce a header-only .fsm — - // a "successful" empty restore artifact. CLI's parseAdapterSet - // rejects this for flag-driven entry; library callers (Phase 1 - // live extractor, integration tests) get the same guard here - // (codex v5 + claude v5 #904). + // a "successful" empty restore artifact (codex v5 + claude v5 #904). return errors.New("backup: EncodeOptions.Adapters has no enabled adapter") } + return validateEncodeOptionsData(opts) +} + +// validateEncodeOptionsData covers the data-correctness pre-conditions: +// HLC ceiling floor and DynamoDB JSONL guard. Kept separate from the +// nil/empty-args checks so each function stays cyclop-clean. +func validateEncodeOptionsData(opts EncodeOptions) error { if opts.ManifestLastCommitTS > 0 && opts.LastCommitTS < opts.ManifestLastCommitTS { - // Defense-in-depth HLC ceiling floor (codex P2 v2 #904). The - // CLI's resolveLastCommitTS already enforces this for flag- - // driven entry; library callers that thread the manifest value - // via ManifestLastCommitTS get the same fail-closed guard here. + // Defense-in-depth HLC ceiling floor (codex P2 v2 #904). return errors.Wrapf(ErrSelfTestLowerLastCommitTS, "EncodeSnapshot opts.LastCommitTS %d < opts.ManifestLastCommitTS %d", opts.LastCommitTS, opts.ManifestLastCommitTS) } + if opts.DynamoDBBundleJSONL && opts.Adapters.DynamoDB { + // The DynamoDB reverse encoder only walks per-item files; + // JSONL items would be silently skipped (codex P2 v7 #904). + return errors.WithStack(ErrEncodeUnsupportedDynamoDBLayout) + } return nil } diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 8ee9ec664..1fd2d9423 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -263,6 +263,61 @@ func TestEncodeSnapshotManifestFloorOptOut(t *testing.T) { } } +// TestEncodeSnapshotRejectsDynamoDBJSONLLayout pins codex P2 v7 #904: +// the DynamoDB reverse encoder does not support the JSONL bundle +// layout, so a caller that threads DynamoDBBundleJSONL=true must be +// rejected with ErrEncodeUnsupportedDynamoDBLayout before any bytes +// are written. The CLI hits this path automatically when MANIFEST.json +// has `dynamodb_layout: "jsonl"`; library callers that mirror that +// thread the field themselves. +func TestEncodeSnapshotRejectsDynamoDBJSONLLayout(t *testing.T) { + t.Parallel() + in := t.TempDir() + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{DynamoDB: true}, + LastCommitTS: 1, + DynamoDBBundleJSONL: true, + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot with DynamoDBBundleJSONL accepted; want error") + } + if !errors.Is(err, ErrEncodeUnsupportedDynamoDBLayout) { + t.Errorf("err = %v, want errors.Is ErrEncodeUnsupportedDynamoDBLayout", err) + } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written when JSONL is rejected)", buf.Len()) + } +} + +// TestEncodeSnapshotJSONLOnlyRejectedWhenDDBEnabled pins that the JSONL +// guard fires only when DynamoDB is in the adapter set — a caller that +// happens to set DynamoDBBundleJSONL=true while encoding ONLY Redis (or +// any other adapter) is unaffected. Prevents the guard from becoming +// over-zealous for callers who simply mirror the manifest field. +func TestEncodeSnapshotJSONLOnlyRejectedWhenDDBEnabled(t *testing.T) { + t.Parallel() + in := t.TempDir() + const queue = "no-ddb" + writeSQSQueue(t, in, queue, + []byte(`{"format_version":1,"name":"no-ddb","fifo":false,"partition_count":1,"generation":1}`), + [][]byte{ + []byte(`{"format_version":1,"message_id":"m1","body":"a","send_timestamp_millis":1700000000000,"available_at_millis":1700000000000,"sequence_number":0}`), + }, + ) + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{SQS: true}, // DDB NOT in scope + LastCommitTS: 1, + DynamoDBBundleJSONL: true, // would be rejected if DDB were enabled + }, &buf) + if err != nil { + t.Fatalf("EncodeSnapshot rejected JSONL flag when DDB not in scope: %v", err) + } +} + // TestEncodeSnapshotRejectsZeroAdapterSet pins claude v5 + codex v5 // carry-over: a library caller that forgets to thread Adapters into // EncodeOptions gets a fail-closed error rather than a silently empty From 145a9128d2955d12feb82ba9e9cdb11d2a163f17 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 13:59:25 +0900 Subject: [PATCH 09/35] backup: #904 v9 - stat InputRoot + fix godoc attribution MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two findings on v8. ## Codex P2 v8: silent empty-restore on missing/stale InputRoot validateEncodeOptions only rejected the empty string. A library caller that passed a typo'd or deleted InputRoot would fan-out into every enabled adapter, each treating its missing top-level subdirectory as a no-op. Result: a "successful" header-only .fsm — exactly the silent empty-restore artifact the encoder is supposed to fail closed against. CLI callers don't hit this path (they open MANIFEST.json under InputRoot first), but the library entrypoint is now a real surface, so the guard belongs in EncodeSnapshot itself. Fix: validateEncodeOptions now os.Stats InputRoot after the empty- string check, rejecting non-existent paths (wraps the stat error) and regular files (returns a typed error). Both rejection paths run before any byte is written and before any adapter is invoked. Pinned by TestEncodeSnapshotRejectsMissingInputRoot (subtests: non-existent path; regular file). Existing TestEncodeSnapshot LibraryRoundTrip + the manifest-floor tests use t.TempDir(), which is a directory, so they continue to pass. ## Claude v8 doc bug: EncodeSnapshot godoc attributed to wrong symbol The EncodeSnapshot doc block sat directly above the validateEncodeOptions doc block with no blank line between them, so godoc tooling (and gopls) attributed the entire merged comment to validateEncodeOptions. EncodeSnapshot — the only exported function in this file — had no godoc at all. Fix: relocate the EncodeSnapshot doc block to immediately precede `func EncodeSnapshot`, separated from validateEncodeOptionsData by a blank line. Content unchanged; only the location moved. ## Caller audit per CLAUDE.md semantic-change rule - validateEncodeOptions gained a fail-closed pre-condition (stat + IsDir). All in-tree callers pass either t.TempDir() (directory) or a real backup root (directory). No legitimate caller is impacted; bad-input callers now surface a typed error instead of silently producing a header-only .fsm — strictly safer. - No exported signature or error semantics changed beyond the new rejection path. Tests + lint green. --- internal/backup/encode_snapshot.go | 79 ++++++++++++++----------- internal/backup/encode_snapshot_test.go | 48 +++++++++++++++ 2 files changed, 94 insertions(+), 33 deletions(-) diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index d875bf8a1..f5a4c8d5c 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -147,44 +147,28 @@ type EncodeResult struct { AdaptersEnabled []string } -// EncodeSnapshot reads the directory tree at opts.InputRoot, invokes the -// enabled per-adapter encoders in canonical fan-out order, optionally -// runs the round-trip self-test, and writes the .fsm bytes to out. -// The .fsm bytes are NOT returned; they go to out. -// -// When opts.SelfTest=false the FSM streams straight to out with a -// sha256 tee and no extra buffering. When opts.SelfTest=true the FSM -// is written to an on-disk temp file (encode-self-test-fsm-*) under -// opts.SelfTestDecodeOptions.OutRoot, the file is streamed through -// DecodeSnapshot, and bytes are copied to out ONLY if the decode -// survives. Memory cost in self-test mode is O(1) on top of the -// sort working set (gemini high #904 — the earlier *bytes.Buffer -// version would OOM on multi-GB snapshots). -// -// Self-test failure returns (result, nil) with result.SelfTestMatched -// == false and result.SelfTestMismatchTxt populated. Callers MUST -// check result.SelfTestMatched before treating a nil error as success. -// The CLI relies on this contract to write mismatch.txt + exit 2; -// library callers should follow the same pattern. -// -// EncodeSnapshot does NOT read MANIFEST.json itself, but it WILL -// enforce a floor on opts.LastCommitTS when the caller threads the -// manifest value through opts.ManifestLastCommitTS — a low -// LastCommitTS returns ErrSelfTestLowerLastCommitTS BEFORE any bytes -// are written. The CLI's resolveLastCommitTS sets both fields to the -// reconciled values, and library callers SHOULD do the same. The -// check is opt-in (ManifestLastCommitTS=0 disables it) so synthetic -// test fixtures without a manifest reference can still call this -// directly (codex P2 v2 #904). // validateEncodeOptions enforces the four pre-encode invariants: -// InputRoot/out non-nil, non-empty adapter selection, optional -// manifest-TS floor, and DDB JSONL guard. Split out so EncodeSnapshot -// stays under the cyclop threshold; per-check helpers below keep each -// branch's intent narrow. +// InputRoot non-empty + exists-as-directory, out non-nil, non-empty +// adapter selection, optional manifest-TS floor, and DDB JSONL guard. +// Split out so EncodeSnapshot stays under the cyclop threshold; the +// data-correctness checks live in validateEncodeOptionsData. func validateEncodeOptions(opts EncodeOptions, out io.Writer) error { if opts.InputRoot == "" { return errors.New("backup: EncodeOptions.InputRoot is required") } + // Stat the path so a typo'd or deleted directory surfaces here + // rather than fan-out-no-op'ing every adapter and producing a + // header-only .fsm (codex P2 v8 #904). CLI callers indirectly + // catch this via os.Open(MANIFEST.json) before EncodeSnapshot, + // but a library caller that passes a stale path needs the guard + // at this layer. + info, statErr := os.Stat(opts.InputRoot) + if statErr != nil { + return errors.Wrapf(statErr, "stat InputRoot %q", opts.InputRoot) + } + if !info.IsDir() { + return errors.Errorf("backup: InputRoot %q is not a directory", opts.InputRoot) + } if out == nil { return errors.New("backup: EncodeSnapshot out writer is nil") } @@ -214,6 +198,35 @@ func validateEncodeOptionsData(opts EncodeOptions) error { return nil } +// EncodeSnapshot reads the directory tree at opts.InputRoot, invokes the +// enabled per-adapter encoders in canonical fan-out order, optionally +// runs the round-trip self-test, and writes the .fsm bytes to out. +// The .fsm bytes are NOT returned; they go to out. +// +// When opts.SelfTest=false the FSM streams straight to out with a +// sha256 tee and no extra buffering. When opts.SelfTest=true the FSM +// is written to an on-disk temp file (encode-self-test-fsm-*) under +// opts.SelfTestDecodeOptions.OutRoot, the file is streamed through +// DecodeSnapshot, and bytes are copied to out ONLY if the decode +// survives. Memory cost in self-test mode is O(1) on top of the +// sort working set (gemini high #904 — the earlier *bytes.Buffer +// version would OOM on multi-GB snapshots). +// +// Self-test failure returns (result, nil) with result.SelfTestMatched +// == false and result.SelfTestMismatchTxt populated. Callers MUST +// check result.SelfTestMatched before treating a nil error as success. +// The CLI relies on this contract to write mismatch.txt + exit 2; +// library callers should follow the same pattern. +// +// EncodeSnapshot does NOT read MANIFEST.json itself, but it WILL +// enforce a floor on opts.LastCommitTS when the caller threads the +// manifest value through opts.ManifestLastCommitTS — a low +// LastCommitTS returns ErrSelfTestLowerLastCommitTS BEFORE any bytes +// are written. The CLI's resolveLastCommitTS sets both fields to the +// reconciled values, and library callers SHOULD do the same. The +// check is opt-in (ManifestLastCommitTS=0 disables it) so synthetic +// test fixtures without a manifest reference can still call this +// directly (codex P2 v2 #904). func EncodeSnapshot(opts EncodeOptions, out io.Writer) (EncodeResult, error) { if err := validateEncodeOptions(opts, out); err != nil { return EncodeResult{}, err diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 1fd2d9423..86aa4b8da 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -211,6 +211,54 @@ func TestEncodeSnapshotRequiresInputRoot(t *testing.T) { } } +// TestEncodeSnapshotRejectsMissingInputRoot pins codex P2 v8 #904: a +// non-existent or non-directory InputRoot must be rejected before any +// adapter runs. Otherwise each enabled adapter treats its missing +// top-level subdirectory as a no-op, the call "succeeds", and the +// caller gets a header-only .fsm — a silent empty-restore artifact. +// CLI callers don't hit this path (they open MANIFEST.json first), +// but library callers can pass a stale path, so the guard belongs in +// EncodeSnapshot itself. +func TestEncodeSnapshotRejectsMissingInputRoot(t *testing.T) { + t.Parallel() + t.Run("non-existent path", func(t *testing.T) { + t.Parallel() + missing := filepath.Join(t.TempDir(), "does-not-exist") + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: missing, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 1, + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot with non-existent InputRoot succeeded; want error") + } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written for missing InputRoot)", buf.Len()) + } + }) + t.Run("regular file", func(t *testing.T) { + t.Parallel() + dir := t.TempDir() + filePath := filepath.Join(dir, "not-a-dir") + if err := os.WriteFile(filePath, []byte("x"), 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: filePath, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 1, + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot with file-as-InputRoot succeeded; want error") + } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written for non-directory InputRoot)", buf.Len()) + } + }) +} + // TestEncodeSnapshotRejectsLowManifestFloor pins codex P2 v2: the // library-level HLC floor check fails-closed when opts.LastCommitTS // is below opts.ManifestLastCommitTS. Defense-in-depth for the CLI's From 18858341bb6486192d13890a4f33b6c0b30182d5 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 14:15:19 +0900 Subject: [PATCH 10/35] backup: #904 v10 - classify adapter-data rejections as exit 2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v9: adapter encoder rejections fall through to exit 1 When an adapter encoder rejects the input tree's contents (a malformed DynamoDB _schema.json, an S3 collision artifact the encoder cannot reverse, an SQS side-record with an unknown kind, etc.), the error propagates out of EncodeSnapshot but does not match any of the three exit-2 sentinels (ErrSelfTestLowerLastCommitTS, ErrEncodeUnsupportedDynamoDBLayout, errSelfTestMismatch), so the CLI's run() falls through to exit 1. Per the CLI contract, exit 1 is operator/flag error and exit 2 is data-correctness — runbooks branch on exit status to quarantine bad dump data, so misclassifying adapter rejections as user error is a real ergonomic regression. The encoder is offline-only — it operates entirely on local files under InputRoot. Every error from an adapter encoder originates from the contents of that tree (a sentinel rejection, an unmarshalling failure, etc.). Treating all adapter errors as data-correctness matches the CLI contract. Fix: - New ErrEncodeAdapterData sentinel in internal/backup. - runAdapterEncoders wraps each adapter error with errors.Mark so errors.Is(err, ErrEncodeAdapterData) is true. Mark preserves the inner sentinel chain — callers that errors.Is on the per-adapter sentinel (ErrDDBEncodeInvalidSchema, ErrS3EncodeKeyConflict, etc.) are unaffected. - CLI run() gains a fourth exit-2 branch for ErrEncodeAdapterData. Pinned by: - TestEncodeSnapshotMarksAdapterDataErrors (library): malformed _schema.json triggers ErrDDBEncodeInvalidSchema inside the DDB encoder; assertion verifies BOTH errors.Is(ErrEncodeAdapterData) AND errors.Is(ErrDDBEncodeInvalidSchema) hold - the mark is additive, not lossy. - TestCLIAdapterDataErrorExitsTwo (CLI): same fixture driven end-to-end through run(); asserts exit code = 2 and no .fsm published. ## Caller audit per CLAUDE.md semantic-change rule - runAdapterEncoders return semantics: every non-nil err is now marked. Sole caller is EncodeSnapshot; it propagates the marked error unchanged. - EncodeSnapshot callers (library tests, CLI's encodeToTempFile): - Library tests that errors.Is on the validation sentinels fire BEFORE runAdapterEncoders, so they are NOT marked. - CLI's encodeToTempFile wraps with "EncodeSnapshot"; the run() layer above it now matches ErrEncodeAdapterData and routes to exit-2 (new branch). - In-tree errors.Is on per-adapter sentinels (encode_dynamodb_test.go et al.) call the adapter encoders directly, not through EncodeSnapshot, so the mark wrapper is never on their path. Tests + lint green. wrapcheck required errors.WithStack around the Mark. --- cmd/elastickv-snapshot-encode/main.go | 10 ++++-- cmd/elastickv-snapshot-encode/main_test.go | 40 ++++++++++++++++++++++ internal/backup/encode_snapshot.go | 24 ++++++++++++- internal/backup/encode_snapshot_test.go | 35 +++++++++++++++++++ 4 files changed, 106 insertions(+), 3 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 592299f2b..db09f42c0 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -82,14 +82,20 @@ func run(argv []string, logger *slog.Logger) (int, error) { } if err := encodeOne(cfg, logger); err != nil { // Errors from the encoder layer that represent data constraints - // (HLC ceiling regression, self-test mismatch) are exit 2; other - // errors are exit 1. + // (HLC ceiling regression, JSONL layout, self-test mismatch, + // adapter rejecting input-tree contents) are exit 2; flag/path + // errors are exit 1. Runbooks branch on exit status to triage + // bad-dump-data vs operator typos, so this split is part of the + // CLI contract (codex P2 v9 #904 added the adapter-data branch). if errors.Is(err, backup.ErrSelfTestLowerLastCommitTS) { return exitDataErr, err } if errors.Is(err, backup.ErrEncodeUnsupportedDynamoDBLayout) { return exitDataErr, err } + if errors.Is(err, backup.ErrEncodeAdapterData) { + return exitDataErr, err + } if errors.Is(err, errSelfTestMismatch) { return exitDataErr, err } diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index c9a980b7e..e044fa736 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -77,6 +77,46 @@ func TestCLIRejectsUnknownAdapter(t *testing.T) { } } +// TestCLIAdapterDataErrorExitsTwo pins codex P2 v9 #904: when an +// adapter encoder rejects the input tree's contents (e.g. a malformed +// DynamoDB _schema.json with an empty table_name), the CLI exits 2 +// (data-correctness) rather than 1 (operator/flag error) so runbooks +// can branch on exit status to quarantine bad dump data. Pinned via +// the same ErrDDBEncodeInvalidSchema fixture pattern used by the +// in-package DynamoDB encoder tests. +func TestCLIAdapterDataErrorExitsTwo(t *testing.T) { + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 100) + // Empty table_name triggers ErrDDBEncodeInvalidSchema inside the + // DynamoDB encoder; runAdapterEncoders marks it with + // ErrEncodeAdapterData; run() maps that to exitDataErr. + schemaDir := filepath.Join(in, "dynamodb", "tbl") + if err := os.MkdirAll(schemaDir, 0o755); err != nil { + t.Fatalf("MkdirAll: %v", err) + } + schemaPath := filepath.Join(schemaDir, "_schema.json") + body := []byte(`{"format_version":1,"table_name":"","primary_key":{"hash_key":{"name":"id","type":"S"}}}`) + if err := os.WriteFile(schemaPath, body, 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{ + "--input", in, + "--output", out, + "--adapter", "dynamodb", + }, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want adapter rejection error") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (data error from adapter rejection, not flag-parse error)", code, exitDataErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm exists at %s; should not be published on adapter rejection", out) + } +} + // TestCLIRejectsLowerLastCommitTSOverride is the fail-closed pin per // parent §"MVCC re-encoding": T < manifest.last_commit_ts → exit 2 // (data-correctness failure, not flag-parse error). diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index f5a4c8d5c..269e40e38 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -41,6 +41,22 @@ var ErrSelfTestLowerLastCommitTS = errors.New("backup: --last-commit-ts T < mani // until the encoder learns the JSONL layout (M7 / future milestone). var ErrEncodeUnsupportedDynamoDBLayout = errors.New("backup: DynamoDB JSONL layout not supported by encoder") +// ErrEncodeAdapterData marks every error returned by an adapter +// encoder (Redis / DynamoDB / S3 / SQS) so callers can distinguish +// "the input tree contained content the encoder cannot translate" +// from "operator passed a bad flag". The encoder is offline-only — +// every adapter error originates from rejecting the content under +// opts.InputRoot (a malformed DynamoDB _schema.json, an S3 collision +// artifact the encoder cannot reverse, a SQS side-record with an +// unknown kind, …). These are data-correctness failures, not user +// errors; the CLI maps this sentinel to exit 2 so runbooks can branch +// on exit status to quarantine bad dump data (codex P2 v9 #904). +// +// Wrapped via errors.Mark inside runAdapterEncoders so the original +// adapter sentinel chain (ErrDDBEncodeInvalidSchema, …) is preserved +// for callers that errors.Is on the more specific type. +var ErrEncodeAdapterData = errors.New("backup: adapter encoder rejected input tree") + // The encoder dispatch order (redis → dynamodb → s3 → sqs) is encoded // inside adapterRunners() and is intentionally distinct from decode.go's // finalize order (dynamodb → s3 → redis → sqs). The final .fsm byte @@ -353,6 +369,12 @@ func adapterRunners() []adapterRunner { // runAdapterEncoders invokes each enabled adapter encoder in // canonicalAdapterFanOutOrder, returning the list of adapter names // actually invoked (for ENCODE_INFO.json adapters_enabled). +// +// Adapter errors are marked with ErrEncodeAdapterData so the CLI can +// route them to exit-2 (data-correctness) rather than exit-1 (user +// error). The original adapter sentinel chain is preserved — callers +// that errors.Is on ErrDDBEncodeInvalidSchema, ErrS3EncodeKeyConflict, +// etc. still see those (codex P2 v9 #904). func runAdapterEncoders(b *snapshotBuilder, opts EncodeOptions) ([]string, error) { var enabled []string for _, r := range adapterRunners() { @@ -360,7 +382,7 @@ func runAdapterEncoders(b *snapshotBuilder, opts EncodeOptions) ([]string, error continue } if err := r.encode(b, opts.InputRoot); err != nil { - return nil, err + return nil, errors.WithStack(errors.Mark(err, ErrEncodeAdapterData)) } enabled = append(enabled, r.name) } diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 86aa4b8da..e32c74ba0 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -388,6 +388,41 @@ func TestEncodeSnapshotRejectsZeroAdapterSet(t *testing.T) { } } +// TestEncodeSnapshotMarksAdapterDataErrors pins codex P2 v9 #904: when +// an adapter encoder rejects the input tree's contents (e.g. a +// malformed DynamoDB _schema.json), EncodeSnapshot must surface the +// failure as ErrEncodeAdapterData so the CLI can route it to exit-2 +// (data-correctness) rather than exit-1 (operator/flag error). +// Crucially, errors.Mark preserves the original sentinel chain, so a +// caller that errors.Is on the per-adapter sentinel +// (ErrDDBEncodeInvalidSchema here) still gets a match — the marking +// is additive. +func TestEncodeSnapshotMarksAdapterDataErrors(t *testing.T) { + t.Parallel() + in := t.TempDir() + // Empty table_name triggers ErrDDBEncodeInvalidSchema inside the + // DynamoDB encoder (encode_dynamodb.go:120). + writeDDBSchema(t, in, "tbl", + []byte(`{"format_version":1,"table_name":"","primary_key":{"hash_key":{"name":"id","type":"S"}}}`)) + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{DynamoDB: true}, + LastCommitTS: 1, + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot with malformed schema succeeded; want error") + } + if !errors.Is(err, ErrEncodeAdapterData) { + t.Errorf("err = %v, want errors.Is ErrEncodeAdapterData", err) + } + // Inner sentinel must still be reachable so existing per-adapter + // errors.Is callers are unaffected by the additional mark. + if !errors.Is(err, ErrDDBEncodeInvalidSchema) { + t.Errorf("err = %v, want errors.Is ErrDDBEncodeInvalidSchema (mark must preserve inner chain)", err) + } +} + // TestEncodeInfoSidecarPath pins the path-derivation rule for the // sidecar (gemini medium v2 #896): one .fsm path produces one distinct // sidecar path; two .fsm files in the same dir produce two distinct From f075610a26b7cf66345e3016dcd502dc29983da8 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 14:21:33 +0900 Subject: [PATCH 11/35] backup: #904 v11 - fix phantom sentinel name in runAdapterEncoders godoc Claude v10 doc fix: the runAdapterEncoders godoc referenced ErrS3EncodeKeyConflict, which does not exist. The actual S3 encoder sentinels are ErrS3EncodeInvalidBucket, ErrS3EncodeNotRegular, ErrS3EncodeUnsupportedCollision, ErrS3EncodeReservedPrefixCollision, and ErrS3EncodeInvalidManifest. Replaced with ErrS3EncodeUnsupportedCollision per claude v10's suggestion. Doc-only change; no behavior or test changes. --- internal/backup/encode_snapshot.go | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index 269e40e38..2e135e56e 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -373,8 +373,9 @@ func adapterRunners() []adapterRunner { // Adapter errors are marked with ErrEncodeAdapterData so the CLI can // route them to exit-2 (data-correctness) rather than exit-1 (user // error). The original adapter sentinel chain is preserved — callers -// that errors.Is on ErrDDBEncodeInvalidSchema, ErrS3EncodeKeyConflict, -// etc. still see those (codex P2 v9 #904). +// that errors.Is on ErrDDBEncodeInvalidSchema, +// ErrS3EncodeUnsupportedCollision, etc. still see those (codex P2 v9 +// #904; phantom-sentinel doc fix from claude v10 #904). func runAdapterEncoders(b *snapshotBuilder, opts EncodeOptions) ([]string, error) { var enabled []string for _, r := range adapterRunners() { From 5a2f2148cbef5b05d849cf116100d8139c6cece2 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 14:39:21 +0900 Subject: [PATCH 12/35] backup: #904 v12 - remove stale .fsm on self-test mismatch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v10: self-test mismatch leaves orphaned stale .fsm When --output already exists from a prior successful run and the new --self-test invocation detects a mismatch, writeAndPublish returned errSelfTestMismatch without removing the stale .fsm. encodeOne then wrote a fresh .encode_info.json with self_test.matched=false and the NEW SHA pointing to the unpublished temp snapshot, leaving: - .fsm: STALE bytes from the prior successful run - .encode_info.json: NEW SHA + matched=false - .mismatch.txt: from the new run The sidecar describes an FSM that does not exist on disk. This violates the CLI contract that "a self-test failure leaves no restore-visible FSM" and breaks runbooks that consume the sidecar's SHA to verify FSM integrity (they would find the stale FSM's SHA does not match the sidecar's NEW SHA). Fix: writeAndPublish, on the self-test-mismatch branch, now os.Remove(cfg.outputPath) before returning. After cleanup the artifacts are internally consistent: the sidecar describes a FAILED encode attempt, mismatch.txt has the diff, and the FSM path is absent (matching the sidecar's intent). errors.Is(rerr, os.ErrNotExist) is treated as success — a self-test mismatch from a first-ever encode (no prior FSM) leaves the path absent, which is the same end state. ## Surgical scope — only self-test mismatch wipes stale .fsm Other exit-2 paths (manifest-floor regression, adapter-data error, unsupported DDB layout) preserve any prior .fsm because those failures occur BEFORE writeAndPublish runs. This preserves the runbook ergonomic of "operator typo'd --last-commit-ts; last good FSM still on disk to retry against." ## Test infrastructure To drive a deterministic self-test mismatch end-to-end through the CLI's writeAndPublish, a minimal test seam was added to package backup: EncodeOptions.SetSelfTestCorruptHookForTest(func(*os.File)) exposes the previously-package-private corruptBufferForTest field to callers outside package backup. Production code MUST NOT call this; the godoc explicitly names cmd/elastickv-snapshot-encode tests as the sole intended caller. Pinned by: - TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch: pre-places a stale .fsm, injects corruption via the new seam, asserts publishErr == errSelfTestMismatch, the stale .fsm is GONE, and mismatch.txt is present. - TestCLINonSelfTestExitTwoPreservesPriorFSM: pre-places a stale .fsm, drives a manifest-floor regression (exit-2 BEFORE writeAndPublish), asserts the stale .fsm is byte-for-byte preserved. Pins the surgical scope of the cleanup. ## Caller audit per CLAUDE.md semantic-change rule - writeAndPublish: sole production caller is encodeOne. On self-test mismatch, encodeOne still writes the sidecar (publishErr matches errSelfTestMismatch) and returns publishErr. The new os.Remove happens before that, so the caller sees IDENTICAL error semantics and the same sidecar-write path. The only difference is on-disk state for .fsm: was stale, is now absent. - EncodeOptions.SetSelfTestCorruptHookForTest: new exported method. No production callers anywhere; one test caller in cmd/elastickv-snapshot-encode/main_test.go. In-package backup tests continue to set corruptBufferForTest directly. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 10 ++ cmd/elastickv-snapshot-encode/main_test.go | 125 +++++++++++++++++++++ internal/backup/encode_snapshot.go | 15 +++ 3 files changed, 150 insertions(+) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index db09f42c0..fbcc037c5 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -335,6 +335,16 @@ func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath if werr := os.WriteFile(mismatchPath, result.SelfTestMismatchTxt, mismatchTxtPerm); werr != nil { logger.Warn("write mismatch.txt", "err", werr) } + // Remove the stale .fsm if one exists from a prior + // successful run. encodeOne is about to write a fresh + // .encode_info.json with self_test.matched=false and + // a NEW SHA pointing to the unpublished temp snapshot; leaving + // the old bytes on disk would make the sidecar describe an + // FSM that does not exist and violate the "self-test failure + // leaves no restore-visible FSM" contract (codex P2 v10 #904). + if rerr := os.Remove(cfg.outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { + logger.Warn("remove stale .fsm on self-test mismatch", "err", rerr) + } return result, errors.Wrap(errSelfTestMismatch, "self-test diff (see "+mismatchPath+")") } if err := os.Rename(tempPath, cfg.outputPath); err != nil { diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index e044fa736..d4d977459 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -12,6 +12,7 @@ import ( "time" "github.com/bootjp/elastickv/internal/backup" + "github.com/cockroachdb/errors" ) // isWindows is true on Windows builds; perm-bit tests skip on Windows @@ -300,6 +301,130 @@ func readSidecar(t *testing.T, output string) backup.EncodeInfo { return info } +// TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch pins codex +// P2 v10 #904: when a prior successful run left an .fsm on +// disk and a new --self-test invocation produces a mismatch, +// writeAndPublish must remove that stale .fsm. Otherwise encodeOne +// writes a fresh sidecar (matched=false, NEW SHA) alongside the OLD +// bytes — violating the CLI contract that a self-test failure leaves +// no restore-visible FSM, and making the sidecar describe an FSM that +// is not on disk. +// +// To drive a deterministic self-test mismatch end-to-end through the +// CLI's writeAndPublish, the test uses the backup package's exported +// test seam (SetSelfTestCorruptHookForTest) to flip bytes in the +// disk-backed self-test buffer between WriteTo and the re-decode. +func TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch(t *testing.T) { + t.Parallel() + rawIn := t.TempDir() + writeSQSFixture(t, rawIn) + emitMinimalManifest(t, rawIn, 7000) + canonicalIn := canonicalizeInput(t, rawIn, 7000) + + out := filepath.Join(t.TempDir(), "out.fsm") + // Pre-place a stale .fsm — what a prior successful run would have + // left behind. The codex P2 v10 contract is that a subsequent + // self-test mismatch invalidates this file. + stalePayload := []byte("STALE FSM FROM PRIOR SUCCESSFUL RUN") + if err := os.WriteFile(out, stalePayload, 0o600); err != nil { + t.Fatalf("WriteFile stale: %v", err) + } + + scratchBase := t.TempDir() + encodeOpts := backup.EncodeOptions{ + InputRoot: canonicalIn, + Adapters: backup.AdapterSet{SQS: true}, + LastCommitTS: 7000, + ManifestLastCommitTS: 7000, + SelfTest: true, + SelfTestDecodeOptions: backup.DecodeOptions{ + OutRoot: scratchBase, + Adapters: backup.AdapterSet{SQS: true}, + }, + } + // Flip bytes past the EKVPBBL1 header so the re-decode trips on + // a malformed entry length and the self-test returns matched=false. + // Pattern mirrors flipBytesPastHeaderHelper in the library test. + encodeOpts.SetSelfTestCorruptHookForTest(func(f *os.File) { + info, ferr := f.Stat() + if ferr != nil { + t.Fatalf("temp Stat: %v", ferr) + } + const headerSkip = 200 + if info.Size() <= headerSkip { + t.Fatalf("temp file too small to corrupt past header: %d bytes", info.Size()) + } + buf := make([]byte, info.Size()-headerSkip) + if _, rerr := f.ReadAt(buf, headerSkip); rerr != nil { + t.Fatalf("ReadAt: %v", rerr) + } + for i := 0; i < len(buf); i += 13 { + buf[i] ^= 0xFF + } + if _, werr := f.WriteAt(buf, headerSkip); werr != nil { + t.Fatalf("WriteAt: %v", werr) + } + }) + + cfg := &config{ + inputPath: canonicalIn, + outputPath: out, + adapters: backup.AdapterSet{SQS: true}, + selfTest: true, + } + mismatchPath := out + ".mismatch.txt" + + _, publishErr := writeAndPublish(cfg, encodeOpts, mismatchPath, quietLogger()) + if !errors.Is(publishErr, errSelfTestMismatch) { + t.Fatalf("publishErr = %v, want errSelfTestMismatch", publishErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf("stale .fsm at %s not removed after self-test mismatch (codex P2 v10)", out) + } + // The mismatch.txt should be present as the operator-visible + // record of the failed encode attempt. + if _, statErr := os.Stat(mismatchPath); statErr != nil { + t.Errorf("mismatch.txt missing after self-test mismatch: %v", statErr) + } +} + +// TestCLINonSelfTestExitTwoPreservesPriorFSM pins the surgical scope +// of the codex P2 v10 fix: non-self-test exit-2 paths (e.g. the +// manifest-floor HLC regression that fails BEFORE writeAndPublish) +// must NOT remove a prior .fsm. Only self-test mismatch +// triggers the cleanup; a runbook recovering from a typo'd +// --last-commit-ts still has its last good FSM on disk. +func TestCLINonSelfTestExitTwoPreservesPriorFSM(t *testing.T) { + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 1000) + out := filepath.Join(t.TempDir(), "out.fsm") + stalePayload := []byte("STALE FSM FROM PRIOR SUCCESSFUL RUN") + if err := os.WriteFile(out, stalePayload, 0o600); err != nil { + t.Fatalf("WriteFile stale: %v", err) + } + // Manifest-floor regression → exit-2 from resolveLastCommitTS, + // before writeAndPublish runs. Stale .fsm should be preserved. + code, err := run([]string{ + "--input", in, + "--output", out, + "--last-commit-ts", "500", // below manifest 1000 + }, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want manifest-floor regression") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d", code, exitDataErr) + } + body, rerr := os.ReadFile(out) + if rerr != nil { + t.Fatalf("read stale .fsm: %v (must be preserved on non-self-test exit-2)", rerr) + } + if !bytes.Equal(body, stalePayload) { + t.Errorf("stale .fsm mutated; want preserved on manifest-floor regression") + } +} + // TestCLIRoundTripSelfTestAllAdapters is the gold-standard CLI-level // end-to-end test: a real adapter fixture, encoder runs with // --self-test, exit 0, matched:true in the sidecar. diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index 2e135e56e..c268fad2c 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -135,6 +135,21 @@ type EncodeOptions struct { corruptBufferForTest func(*os.File) } +// SetSelfTestCorruptHookForTest installs a same-process hook that +// fires against the on-disk self-test buffer between WriteTo and the +// re-decode call. The hook can WriteAt into the file to inject +// corruption so the subsequent self-test mismatches deterministically. +// +// Production code MUST NOT call this; it is exclusively a test seam +// for callers OUTSIDE package backup (specifically the +// cmd/elastickv-snapshot-encode CLI tests, which need to drive a real +// end-to-end self-test mismatch to verify the stale-.fsm cleanup +// path — codex P2 v10 #904). In-package tests should set +// EncodeOptions.corruptBufferForTest directly. +func (o *EncodeOptions) SetSelfTestCorruptHookForTest(hook func(*os.File)) { + o.corruptBufferForTest = hook +} + // EncodeResult is the public return value from EncodeSnapshot. Mirrors // the decoder's DecodeResult shape. type EncodeResult struct { From 48b0b0e582a44c59ed62d0c6f08687b16760f1c6 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 14:46:21 +0900 Subject: [PATCH 13/35] backup: #904 v13 - claude v12 doc fix + two carry-over cleanups MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Claude v12 mandatory: writeAndPublish godoc missed v12 cleanup The godoc described only the temp-file deferred cleanup but omitted the new stale-.fsm removal added in v12. A future reader auditing the on-disk state after self-test mismatch would see only the temp-file cleanup documented and need to grep for cfg.outputPath to discover the second removal. Expanded the godoc to call out both. ## Claude v12 carry-over: canonicalizeInput swallowed os.Open error The test helper used "f, _ := os.Open(tmpOut)" and passed nil to DecodeSnapshot if the open failed, producing a panic or misleading decode error instead of a clean t.Fatalf. Fixed: check the error explicitly and defer the Close. CodeRabbit flagged this on v11/v12. ## Claude v12 carry-over: test name contradicted its assertion TestCLISelfTestMismatchWritesSidecarWithMatchedFalse asserted os.IsNotExist on the sidecar — the encode never ran on the manifest-floor path it exercises. Name renamed to TestCLIManifestFloorLeavesNoStaleSidecar to match the actual assertion. The real "sidecar IS written with matched=false on self-test mismatch" behavior is now pinned end-to-end by TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch (v12) which drives a real self-test mismatch through the new corruption seam. Docstring updated to explain the rename history. ## Caller audit per CLAUDE.md semantic-change rule All v13 changes are doc/identifier/error-handling only. No public or package-private function signature, error semantics, or return contract changed. Renamed test has no callers (Go testing framework discovers by prefix). canonicalizeInput's error handling is strictly safer — any open failure now surfaces at the failure site. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 6 ++- cmd/elastickv-snapshot-encode/main_test.go | 43 ++++++++++++---------- 2 files changed, 28 insertions(+), 21 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index fbcc037c5..e5a394a5d 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -314,8 +314,10 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes // writeAndPublish writes the .fsm to a temp path, runs the optional // self-test via EncodeSnapshot, and renames temp → output on success. -// On self-test failure: writes mismatch.txt, removes the temp file via -// the deferred cleanup, returns errSelfTestMismatch. +// On self-test failure: writes mismatch.txt, removes any stale +// .fsm left by a prior successful run (codex P2 v10 #904), +// removes the temp file via the deferred cleanup, returns +// errSelfTestMismatch. func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath string, logger *slog.Logger) (backup.EncodeResult, error) { tempPath, err := tempOutputPath(cfg.outputPath) if err != nil { diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index d4d977459..f0fd1f89d 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -275,14 +275,17 @@ func canonicalizeInput(t *testing.T, rawIn string, lastCommitTS uint64) string { if err != nil || code != exitSuccess { t.Fatalf("canonical encode: code=%d err=%v", code, err) } - f, _ := os.Open(tmpOut) + f, oerr := os.Open(tmpOut) + if oerr != nil { + t.Fatalf("open canonical output: %v", oerr) + } + defer func() { _ = f.Close() }() if _, err := backup.DecodeSnapshot(f, backup.DecodeOptions{ OutRoot: canonicalIn, Adapters: backup.AdapterSet{SQS: true}, }); err != nil { t.Fatalf("canonical decode: %v", err) } - _ = f.Close() emitMinimalManifest(t, canonicalIn, lastCommitTS) return canonicalIn } @@ -457,25 +460,27 @@ func TestCLIRoundTripSelfTestAllAdapters(t *testing.T) { } } -// TestCLISelfTestMismatchWritesSidecarWithMatchedFalse pins codex P2 v6 -// #904: when --self-test detects a mismatch, the CLI MUST still write -// .encode_info.json with self_test.matched=false alongside -// .mismatch.txt. Operators need both files to diagnose a -// failed self-test (sidecar carries SHA256, effective T, adapters). +// TestCLIManifestFloorLeavesNoStaleSidecar pins that the +// manifest-floor preflight failure (--last-commit-ts T < manifest; +// fails in resolveLastCommitTS BEFORE writeAndPublish) leaves NO +// .encode_info.json on disk — neither a fresh one nor a +// stale one from a prior run (the pre-encode cleanup at the start +// of encodeOne removes it). // -// Driven via --last-commit-ts T < manifest (data-error path) since -// that's the only deterministic CLI-level mismatch trigger; a real -// self-test mismatch needs the same write path. Future cleanup: when -// the library-level corruption hook is exposed via a build-tagged -// CLI test seam, switch to a real self-test mismatch trigger. -func TestCLISelfTestMismatchWritesSidecarWithMatchedFalse(t *testing.T) { +// Note: the test name was previously TestCLISelfTestMismatchWritesSidecarWithMatchedFalse, +// which contradicted the assertion (the encode does NOT run on this +// path, so no sidecar is written). The actual sidecar-on-mismatch +// behavior is now pinned end-to-end by +// TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch using the +// CLI-level corruption seam (codex P2 v6/v10 #904; claude v12 rename). +func TestCLIManifestFloorLeavesNoStaleSidecar(t *testing.T) { t.Parallel() - // We can drive a real self-test mismatch path at the library - // level (covered by TestEncodeSnapshotSelfTestDetectsCorruption). - // At the CLI level, we additionally need to confirm the sidecar - // is published on the data-error branch — the manifest-floor - // regression path that exit-2's via the same wrap-then-return - // code that previously skipped writeSidecar. + // The pre-encode cleanup at the top of encodeOne removes any + // stale .encode_info.json before writeAndPublish runs. + // On the manifest-floor path, resolveLastCommitTS exits with + // exit-2 BEFORE that cleanup even runs (it's the second step in + // encodeOne after readInputManifest). So the assertion is: a + // fresh TempDir produces no sidecar at all. in := t.TempDir() emitMinimalManifest(t, in, 1000) out := filepath.Join(t.TempDir(), "out.fsm") From 71ecc126993a8fbb5d374d1ae6b2061c52aed312 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 14:56:32 +0900 Subject: [PATCH 14/35] backup: #904 v14 - fan out Redis encoder across all db_ directories MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P1 v13: hardcoded db_0 silently dropped non-default Redis DBs The Redis adapter runner in adapterRunners was hardcoded to NewRedisEncoder(root, 0). RedisEncoder is scoped to a single redis/db_/ subdirectory; any Phase 1 multi-DB dump that included redis/db_3/, redis/db_5/, etc. would silently drop those DBs from the produced .fsm — a header-only encode for those keys, exactly the silent-data-loss pattern the encoder is supposed to fail closed against. With --self-test on, the same valid multi-DB dump would be rejected at the diff stage (the decoded scratch tree would not match the input's db_3 content). Phase 0a/0b inputs only emit redis/db_0/, so this is a latent bug that fires the moment Phase 1's multi-DB dumper lands; codex flagged it as P1. ## Fix: enumerate redis/db_ directories and fan out per DB Two new helpers in encode_snapshot.go: - enumerateRedisDBs(inRoot) returns the sorted dbIndex values for which /redis/db_/ exists as a directory. A missing redis/ returns nil (no-op, matching the per-DB encoder's missing-subdir convention). Non-canonical entries (non-numeric suffix, negative, leading-zero like "db_01", wrong prefix, regular files at the redis/ level) are silently skipped — they cannot be produced by the canonical decoder. A symlinked or regular-file redis/ path is rejected with ErrRedisEncodeNotDir (mirrors the per-DB encoder's symlink refusal). - encodeAllRedisDBs(b, inRoot) invokes NewRedisEncoder per enumerated index in ascending order, wrapping per-DB errors with "redis encoder db_%d" for traceability. The redis adapter runner now uses encodeAllRedisDBs as its function value directly (gocritic unlambda). ## Pinned by - TestEnumerateRedisDBsMissingDir: missing redis/ returns nil indices. - TestEnumerateRedisDBsMixedEntries: only canonical db_ dirs kept; non-numeric, negative, leading-zero, empty-suffix, wrong- prefix, and regular-file entries are skipped; result sorted ascending (asserted via direct index comparison). - TestEnumerateRedisDBsRedisIsRegularFile: regular file at redis/ path triggers ErrRedisEncodeNotDir. - TestEncodeSnapshotRedisMultiDB: a fixture under redis/db_3/ ONLY (no db_0) produces strictly more bytes than an empty-redis baseline through EncodeSnapshot. Pre-fix, the hardcoded db_0 path would have produced an identical header-only .fsm for both inputs. ## Caller audit per CLAUDE.md semantic-change rule - adapterRunners.redis: closure replaced with encodeAllRedisDBs function value (gocritic unlambda). Error contract: - errors.Is(err, ErrEncodeAdapterData) — still true (errors.Mark in runAdapterEncoders is unchanged). - errors.Is(err, ErrRedisEncodeNotDir) etc. — still true (errors.Mark + errors.Wrap preserve the inner chain). - Error prefix changed from "redis encoder: ..." to "redis encoder db_%d: ..." (or "redis encoder enumerate: ..." for the enumeration step). More specific, not less. - NewRedisEncoder direct callers (encode_redis_test.go, encode_redis_coll_test.go, encode_redis_hardlink_unix_test.go): unaffected — they call NewRedisEncoder directly, never through the adapter runner. - enumerateRedisDBs / parseRedisDBDir / checkRedisRoot / encodeAllRedisDBs are new package-private functions; no external callers exist yet. ## Lint compliance Two cyclop refactor splits required: - enumerateRedisDBs (initially 12 branches) split into checkRedisRoot + parseRedisDBDir helpers. Each is ≤6 branches. - TestEnumerateRedisDBs (initially 14 branches via t.Run) split into three top-level tests. Each is ≤4 branches. Tests + lint green. --- internal/backup/encode_snapshot.go | 104 ++++++++++++++++++++- internal/backup/encode_snapshot_test.go | 119 ++++++++++++++++++++++++ 2 files changed, 220 insertions(+), 3 deletions(-) diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index c268fad2c..e994d7dff 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -356,6 +356,106 @@ func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, ou return result, nil } +// redisDBDirPrefix is the canonical "db_" prefix produced by the +// decoder for redis/db_/ directories. Mirrored by encoder +// enumeration (encodeAllRedisDBs) so a multi-DB dump round-trips. +const redisDBDirPrefix = "db_" + +// enumerateRedisDBs returns the sorted dbIndex values for which +// /redis/db_/ exists as a directory. A missing redis/ +// directory returns nil; the caller treats it as no-op (same convention +// as the per-DB encoder, which is a no-op when its db_ subdir is +// absent). Non-db_ entries (regular files, symlinks at the redis/ +// level, non-numeric or non-canonical suffixes like "db_-1" or +// "db_01") are silently skipped — they cannot have been produced by +// the canonical decoder and are not the encoder's concern. +// +// Codex P1 v13 #904: replaces the prior hardcoded NewRedisEncoder(_, 0) +// in adapterRunners that silently dropped non-default DBs from any +// future Phase 1 multi-DB dump. +func enumerateRedisDBs(inRoot string) ([]int, error) { + redisDir := filepath.Join(inRoot, "redis") + if err := checkRedisRoot(redisDir); err != nil { + return nil, err + } + entries, err := os.ReadDir(redisDir) + if err != nil { + if errors.Is(err, os.ErrNotExist) { + return nil, nil + } + return nil, errors.WithStack(err) + } + var indices []int + for _, ent := range entries { + if idx, ok := parseRedisDBDir(ent); ok { + indices = append(indices, idx) + } + } + sort.Ints(indices) + return indices, nil +} + +// checkRedisRoot stats /redis/ and rejects symlink / non-dir +// shapes. Missing is allowed (caller returns nil indices). Split out +// of enumerateRedisDBs to keep that function under the cyclop bound. +func checkRedisRoot(redisDir string) error { + info, err := os.Lstat(redisDir) + switch { + case errors.Is(err, os.ErrNotExist): + return nil + case err != nil: + return errors.WithStack(err) + case info.Mode()&os.ModeSymlink != 0: + // Symlinked redis/ would let os.OpenRoot in the per-DB encoder + // resolve outside the dump tree (mirrors the per-DB encoder's + // symlink refusal on redis/db_). + return errors.Wrapf(ErrRedisEncodeNotDir, "redis path %q is a symlink", redisDir) + case !info.IsDir(): + return errors.Wrapf(ErrRedisEncodeNotDir, "redis path %q is not a directory", redisDir) + } + return nil +} + +// parseRedisDBDir returns (dbIndex, true) when ent names a canonical +// db_ directory (N is a non-negative decimal with no leading zeros). +// Non-matching entries return (0, false) so the caller can skip without +// erroring — they cannot have been produced by the canonical decoder. +// Reject non-canonical decimals so a hypothetical Phase 1 dumper cannot +// double-emit the same db under two distinct directory names. +func parseRedisDBDir(ent os.DirEntry) (int, bool) { + if !ent.IsDir() { + return 0, false + } + name := ent.Name() + if !strings.HasPrefix(name, redisDBDirPrefix) { + return 0, false + } + suffix := name[len(redisDBDirPrefix):] + idx, err := strconv.Atoi(suffix) + if err != nil || idx < 0 || strconv.Itoa(idx) != suffix { + return 0, false + } + return idx, true +} + +// encodeAllRedisDBs invokes NewRedisEncoder per discovered db_ +// directory in ascending index order. A missing redis/ directory is a +// no-op. Codex P1 v13 #904: replaces the prior hardcoded db_0 fan-out +// which would silently drop non-default DBs from any Phase 1 multi-DB +// dump. Per-DB errors are wrapped with the db index for traceability. +func encodeAllRedisDBs(b *snapshotBuilder, inRoot string) error { + indices, err := enumerateRedisDBs(inRoot) + if err != nil { + return errors.Wrap(err, "redis encoder enumerate") + } + for _, idx := range indices { + if err := NewRedisEncoder(inRoot, idx).Encode(b); err != nil { + return errors.Wrapf(err, "redis encoder db_%d", idx) + } + } + return nil +} + // adapterRunner pairs an enabled-check with an Encode call, keeping // runAdapterEncoders's per-iteration body to two branches (cyclop). type adapterRunner struct { @@ -366,9 +466,7 @@ type adapterRunner struct { func adapterRunners() []adapterRunner { return []adapterRunner{ - {"redis", func(s AdapterSet) bool { return s.Redis }, func(b *snapshotBuilder, root string) error { - return errors.Wrap(NewRedisEncoder(root, 0).Encode(b), "redis encoder") - }}, + {"redis", func(s AdapterSet) bool { return s.Redis }, encodeAllRedisDBs}, {"dynamodb", func(s AdapterSet) bool { return s.DynamoDB }, func(b *snapshotBuilder, root string) error { return errors.Wrap(NewDynamoDBEncoder(root).Encode(b), "dynamodb encoder") }}, diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index e32c74ba0..a13019a88 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -388,6 +388,125 @@ func TestEncodeSnapshotRejectsZeroAdapterSet(t *testing.T) { } } +// TestEnumerateRedisDBsMissingDir pins codex P1 v13 #904: a missing +// redis/ directory returns nil indices (no-op), matching the per-DB +// encoder's "missing db_ = nothing to encode" convention. +func TestEnumerateRedisDBsMissingDir(t *testing.T) { + t.Parallel() + indices, err := enumerateRedisDBs(t.TempDir()) + if err != nil { + t.Fatalf("err = %v, want nil", err) + } + if indices != nil { + t.Errorf("indices = %v, want nil for missing redis/", indices) + } +} + +// TestEnumerateRedisDBsMixedEntries pins codex P1 v13 #904: only +// canonical db_ entries are kept; non-numeric, negative, leading- +// zero, empty-suffix, wrong-prefix, and non-directory entries are +// silently skipped. The returned slice is sorted ascending. +func TestEnumerateRedisDBsMixedEntries(t *testing.T) { + t.Parallel() + in := t.TempDir() + for _, name := range []string{"db_0", "db_1", "db_5"} { + if err := os.MkdirAll(filepath.Join(in, "redis", name), 0o755); err != nil { + t.Fatalf("MkdirAll %s: %v", name, err) + } + } + // Entries that MUST be skipped: + // db_garbage — non-numeric suffix + // db_-1 — negative + // db_01 — non-canonical leading zero + // db_ — empty suffix + // notdb_2 — wrong prefix + for _, name := range []string{"db_garbage", "db_-1", "db_01", "db_", "notdb_2"} { + if err := os.MkdirAll(filepath.Join(in, "redis", name), 0o755); err != nil { + t.Fatalf("MkdirAll %s: %v", name, err) + } + } + // A regular file under redis/ must be skipped (not enumerable). + if err := os.WriteFile(filepath.Join(in, "redis", "README"), []byte("x"), 0o600); err != nil { + t.Fatalf("WriteFile README: %v", err) + } + indices, err := enumerateRedisDBs(in) + if err != nil { + t.Fatalf("enumerateRedisDBs: %v", err) + } + want := []int{0, 1, 5} + if len(indices) != len(want) { + t.Fatalf("indices = %v, want %v", indices, want) + } + for i, v := range want { + if indices[i] != v { + t.Errorf("indices[%d] = %d, want %d", i, indices[i], v) + } + } +} + +// TestEnumerateRedisDBsRedisIsRegularFile pins fail-closed when the +// "redis" path inside the dump is a regular file rather than a +// directory — distinct from the missing case. +func TestEnumerateRedisDBsRedisIsRegularFile(t *testing.T) { + t.Parallel() + in := t.TempDir() + if err := os.WriteFile(filepath.Join(in, "redis"), []byte("not a dir"), 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + _, err := enumerateRedisDBs(in) + if !errors.Is(err, ErrRedisEncodeNotDir) { + t.Errorf("err = %v, want errors.Is ErrRedisEncodeNotDir", err) + } +} + +// TestEncodeSnapshotRedisMultiDB pins codex P1 v13 #904: the Redis +// fan-out in adapterRunners enumerates redis/db_/ and invokes the +// per-DB encoder for each. The fixture places a single string under +// redis/db_3/ ONLY (no db_0). Pre-fix, the encoder hardcoded +// NewRedisEncoder(root, 0) and produced a header-only .fsm for this +// input — silent data loss. Post-fix, the db_3 string is included. +// +// Assertion is content-free: compare encoded byte count against an +// empty-redis baseline. With multi-DB fan-out, the db_3 fixture +// produces MORE bytes than an empty tree. Without it, both encodes +// would produce identical header-only output. +func TestEncodeSnapshotRedisMultiDB(t *testing.T) { + t.Parallel() + emptyIn := t.TempDir() + var emptyBuf bytes.Buffer + emptyResult, err := EncodeSnapshot(EncodeOptions{ + InputRoot: emptyIn, + Adapters: AdapterSet{Redis: true}, + LastCommitTS: 1, + }, &emptyBuf) + if err != nil { + t.Fatalf("EncodeSnapshot empty: %v", err) + } + + in := t.TempDir() + encKey := EncodeSegment([]byte("k3")) + db3Strings := filepath.Join(in, "redis", "db_3", "strings") + if err := os.MkdirAll(db3Strings, 0o755); err != nil { + t.Fatalf("MkdirAll db_3/strings: %v", err) + } + if err := os.WriteFile(filepath.Join(db3Strings, encKey+".bin"), []byte("v3"), 0o600); err != nil { + t.Fatalf("WriteFile db_3 string: %v", err) + } + var buf bytes.Buffer + result, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{Redis: true}, + LastCommitTS: 1, + }, &buf) + if err != nil { + t.Fatalf("EncodeSnapshot db_3-only: %v", err) + } + if result.BytesWritten <= emptyResult.BytesWritten { + t.Errorf("BytesWritten with redis/db_3 fixture (%d) <= empty (%d); pre-fix, hardcoded db_0 fan-out dropped db_3 silently", + result.BytesWritten, emptyResult.BytesWritten) + } +} + // TestEncodeSnapshotMarksAdapterDataErrors pins codex P2 v9 #904: when // an adapter encoder rejects the input tree's contents (e.g. a // malformed DynamoDB _schema.json), EncodeSnapshot must surface the From af8c279c4cb6e65d80f2b6f0c72bee360e829197 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 15:04:57 +0900 Subject: [PATCH 15/35] backup: #904 v15 - two codex P2 v14 fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v14 (1/2): mismatch cleanup must not delete directories v12 added os.Remove(cfg.outputPath) on the self-test-mismatch branch of writeAndPublish to keep the sidecar+FSM internally consistent. That removal was unconditional. If --output names a directory (an operator typo — e.g. forgot the .fsm filename and passed a directory instead), the normal publish path would have failed at os.Rename, but the mismatch cleanup branch would silently call os.Remove on the directory and, for an empty one, delete it. Destructive behavior specific to the mismatch path. Fix: new helper removeStaleOutputFSM(outputPath, logger) performs an os.Lstat first and only proceeds with os.Remove when info.Mode() reports IsRegular. Non-regular files (directory, symlink, named pipe, etc.) are logged at warn level and left alone. Stat errors other than ErrNotExist are also warn-and-continue. Pinned by TestCLISelfTestMismatchSkipsDirectoryAtOutputPath: pre- places an empty directory at --output, drives a self-test mismatch via the corruption seam, asserts publishErr == errSelfTestMismatch AND the directory is still present + still a directory. ## Codex P2 v14 (2/2): classify corrupt manifests as data failures readInputManifest returns backup.ErrInvalidManifest (invalid JSON or schema-invariant violation) and backup.ErrUnsupportedFormatVersion (format_version unknown) when the input MANIFEST.json is broken. Neither sentinel was mapped in run(), so the CLI exited 1 — treating a broken dump tree as an operator-flag error and breaking runbook recovery paths that branch on exit status to triage corrupt-input vs operator-typo. Fix: extracted the error→exit-code mapping into classifyEncodeError and added ErrInvalidManifest + ErrUnsupportedFormatVersion to the exit-2 set. Switch-with-multiple-cases form keeps the function under nestif/cyclop. Pinned by TestCLIInvalidManifestExitsTwo (subtests: invalid JSON body; unsupported format_version). Both subtests assert run() exits exitDataErr. ## Caller audit per CLAUDE.md semantic-change rule - writeAndPublish self-test mismatch branch: os.Remove call replaced with removeStaleOutputFSM. Sole caller is encodeOne; the publishErr contract is unchanged (still errSelfTestMismatch); on-disk state changes ONLY when --output was a non-regular file, in which case the prior behavior was destructive AND wrong. No legitimate caller is impacted. - run() error classification: extracted to classifyEncodeError. Sole caller is run(). Two additional sentinels now classify as exit-2 (ErrInvalidManifest, ErrUnsupportedFormatVersion); all prior exit-2 sentinels still classify as exit-2; exit-1 paths are strictly a subset of the prior set. No runbook that branches on exit-2 would regress; runbooks that branch on exit-1 will see corrupt-manifest cases move to exit-2 where they belong. ## Lint compliance run() had nestif complexity 5 (six sequential errors.Is branches). classifyEncodeError uses a switch with multiple errors.Is cases per arm, which drops the complexity score below the linter bound. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 98 ++++++++++++----- cmd/elastickv-snapshot-encode/main_test.go | 120 +++++++++++++++++++++ 2 files changed, 190 insertions(+), 28 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index e5a394a5d..5c9d13c62 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -81,29 +81,42 @@ func run(argv []string, logger *slog.Logger) (int, error) { return exitUserErr, err } if err := encodeOne(cfg, logger); err != nil { - // Errors from the encoder layer that represent data constraints - // (HLC ceiling regression, JSONL layout, self-test mismatch, - // adapter rejecting input-tree contents) are exit 2; flag/path - // errors are exit 1. Runbooks branch on exit status to triage - // bad-dump-data vs operator typos, so this split is part of the - // CLI contract (codex P2 v9 #904 added the adapter-data branch). - if errors.Is(err, backup.ErrSelfTestLowerLastCommitTS) { - return exitDataErr, err - } - if errors.Is(err, backup.ErrEncodeUnsupportedDynamoDBLayout) { - return exitDataErr, err - } - if errors.Is(err, backup.ErrEncodeAdapterData) { - return exitDataErr, err - } - if errors.Is(err, errSelfTestMismatch) { - return exitDataErr, err - } - return exitUserErr, err + return classifyEncodeError(err), err } return exitSuccess, nil } +// classifyEncodeError maps the encodeOne return value to a CLI exit +// code. Data-correctness sentinels (HLC ceiling regression, JSONL +// layout, adapter rejecting input-tree contents, self-test mismatch, +// corrupt manifest) → exit 2; everything else → exit 1. Runbooks +// branch on exit status to triage bad-dump-data vs operator typos, +// so this mapping is part of the CLI contract. +// +// Sources of each sentinel: +// - ErrSelfTestLowerLastCommitTS: CLI resolveLastCommitTS + library +// validateEncodeOptionsData (codex P2 v2 #904) +// - ErrEncodeUnsupportedDynamoDBLayout: validateEncodeOptionsData +// (codex P2 v7 #904) +// - ErrEncodeAdapterData: runAdapterEncoders mark on adapter +// rejection (codex P2 v9 #904) +// - errSelfTestMismatch: writeAndPublish self-test branch +// - ErrInvalidManifest / ErrUnsupportedFormatVersion: readInputManifest +// surfacing backup.ReadManifest sentinels (codex P2 v14 #904) +func classifyEncodeError(err error) int { + switch { + case errors.Is(err, backup.ErrSelfTestLowerLastCommitTS), + errors.Is(err, backup.ErrEncodeUnsupportedDynamoDBLayout), + errors.Is(err, backup.ErrEncodeAdapterData), + errors.Is(err, errSelfTestMismatch), + errors.Is(err, backup.ErrInvalidManifest), + errors.Is(err, backup.ErrUnsupportedFormatVersion): + return exitDataErr + default: + return exitUserErr + } +} + func parseFlags(argv []string) (*config, error) { fs := flag.NewFlagSet("elastickv-snapshot-encode", flag.ContinueOnError) fs.SetOutput(io.Discard) @@ -312,6 +325,30 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes return encodeOpts } +// removeStaleOutputFSM removes outputPath ONLY if it exists and is a +// regular file. A directory or special-file at the path is left alone +// (codex P2 v14 #904 — the prior unconditional os.Remove would have +// deleted an empty directory the operator passed in error to --output). +// Errors other than ErrNotExist are downgraded to warn-and-continue so +// the caller's primary mismatch error remains the dominant signal. +func removeStaleOutputFSM(outputPath string, logger *slog.Logger) { + info, err := os.Lstat(outputPath) + if err != nil { + if !errors.Is(err, os.ErrNotExist) { + logger.Warn("stat stale .fsm on self-test mismatch", "err", err) + } + return + } + if !info.Mode().IsRegular() { + logger.Warn("skip stale .fsm cleanup: --output is not a regular file", + "path", outputPath, "mode", info.Mode()) + return + } + if rerr := os.Remove(outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { + logger.Warn("remove stale .fsm on self-test mismatch", "err", rerr) + } +} + // writeAndPublish writes the .fsm to a temp path, runs the optional // self-test via EncodeSnapshot, and renames temp → output on success. // On self-test failure: writes mismatch.txt, removes any stale @@ -338,15 +375,20 @@ func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath logger.Warn("write mismatch.txt", "err", werr) } // Remove the stale .fsm if one exists from a prior - // successful run. encodeOne is about to write a fresh - // .encode_info.json with self_test.matched=false and - // a NEW SHA pointing to the unpublished temp snapshot; leaving - // the old bytes on disk would make the sidecar describe an - // FSM that does not exist and violate the "self-test failure - // leaves no restore-visible FSM" contract (codex P2 v10 #904). - if rerr := os.Remove(cfg.outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { - logger.Warn("remove stale .fsm on self-test mismatch", "err", rerr) - } + // successful run AND is a regular file. encodeOne is about to + // write a fresh .encode_info.json with + // self_test.matched=false and a NEW SHA pointing to the + // unpublished temp snapshot; leaving old bytes on disk would + // make the sidecar describe an FSM that does not exist and + // violate the "self-test failure leaves no restore-visible + // FSM" contract (codex P2 v10 #904). + // + // The mode-check guards against an --output that names a + // directory (or any non-regular file): the normal publish + // path would fail at os.Rename anyway, but the mismatch + // cleanup must not destructively delete a directory the + // operator passed in error (codex P2 v14 #904). + removeStaleOutputFSM(cfg.outputPath, logger) return result, errors.Wrap(errSelfTestMismatch, "self-test diff (see "+mismatchPath+")") } if err := os.Rename(tempPath, cfg.outputPath); err != nil { diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index f0fd1f89d..7c389f7c8 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -304,6 +304,126 @@ func readSidecar(t *testing.T, output string) backup.EncodeInfo { return info } +// TestCLISelfTestMismatchSkipsDirectoryAtOutputPath pins codex P2 v14 +// #904: the self-test-mismatch cleanup must NOT delete an --output +// path that resolves to a directory (or any non-regular file). The +// prior unconditional os.Remove(cfg.outputPath) would have wiped an +// empty directory the operator passed in error. +// +// The fixture pre-creates an empty directory at the --output path, +// drives a self-test mismatch, asserts publishErr == errSelfTestMismatch, +// and asserts the directory is STILL PRESENT (the destructive +// cleanup did not fire). The normal publish path would have failed +// at os.Rename — this test pins the mismatch-cleanup-specific guard. +func TestCLISelfTestMismatchSkipsDirectoryAtOutputPath(t *testing.T) { + t.Parallel() + rawIn := t.TempDir() + writeSQSFixture(t, rawIn) + emitMinimalManifest(t, rawIn, 7000) + canonicalIn := canonicalizeInput(t, rawIn, 7000) + + out := filepath.Join(t.TempDir(), "out.fsm") + // Pre-create a directory at the --output path — an operator + // typo, but the cleanup MUST NOT destructively remove it. + if err := os.Mkdir(out, 0o755); err != nil { + t.Fatalf("Mkdir: %v", err) + } + + scratchBase := t.TempDir() + encodeOpts := backup.EncodeOptions{ + InputRoot: canonicalIn, + Adapters: backup.AdapterSet{SQS: true}, + LastCommitTS: 7000, + ManifestLastCommitTS: 7000, + SelfTest: true, + SelfTestDecodeOptions: backup.DecodeOptions{ + OutRoot: scratchBase, + Adapters: backup.AdapterSet{SQS: true}, + }, + } + encodeOpts.SetSelfTestCorruptHookForTest(func(f *os.File) { + info, ferr := f.Stat() + if ferr != nil { + t.Fatalf("temp Stat: %v", ferr) + } + const headerSkip = 200 + if info.Size() <= headerSkip { + t.Fatalf("temp file too small to corrupt: %d", info.Size()) + } + buf := make([]byte, info.Size()-headerSkip) + if _, rerr := f.ReadAt(buf, headerSkip); rerr != nil { + t.Fatalf("ReadAt: %v", rerr) + } + for i := 0; i < len(buf); i += 13 { + buf[i] ^= 0xFF + } + if _, werr := f.WriteAt(buf, headerSkip); werr != nil { + t.Fatalf("WriteAt: %v", werr) + } + }) + + cfg := &config{ + inputPath: canonicalIn, + outputPath: out, + adapters: backup.AdapterSet{SQS: true}, + selfTest: true, + } + mismatchPath := out + ".mismatch.txt" + + _, publishErr := writeAndPublish(cfg, encodeOpts, mismatchPath, quietLogger()) + if !errors.Is(publishErr, errSelfTestMismatch) { + t.Fatalf("publishErr = %v, want errSelfTestMismatch", publishErr) + } + info, statErr := os.Stat(out) + if statErr != nil { + t.Fatalf("output path missing after mismatch (codex P2 v14 destructive cleanup regression): %v", statErr) + } + if !info.IsDir() { + t.Errorf("output mode = %s; expected the pre-placed directory to be preserved", info.Mode()) + } +} + +// TestCLIInvalidManifestExitsTwo pins codex P2 v14 #904: a malformed +// MANIFEST.json (invalid JSON or unsupported format_version) surfaces +// backup.ErrInvalidManifest / backup.ErrUnsupportedFormatVersion from +// readInputManifest, and the CLI MUST map both to exit 2 +// (data-correctness). Treating a broken manifest as exit 1 misroutes +// runbook recovery for corrupt-dump scenarios. +func TestCLIInvalidManifestExitsTwo(t *testing.T) { + t.Parallel() + t.Run("invalid JSON body", func(t *testing.T) { + t.Parallel() + in := t.TempDir() + if err := os.WriteFile(filepath.Join(in, "MANIFEST.json"), []byte("{not json"), 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want manifest parse error") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (invalid manifest is data-correctness)", code, exitDataErr) + } + }) + t.Run("unsupported format_version", func(t *testing.T) { + t.Parallel() + in := t.TempDir() + if err := os.WriteFile(filepath.Join(in, "MANIFEST.json"), + []byte(`{"format_version":99,"last_commit_ts":1}`), 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want unsupported format_version") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (unsupported manifest format_version is data-correctness)", code, exitDataErr) + } + }) +} + // TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch pins codex // P2 v10 #904: when a prior successful run left an .fsm on // disk and a new --self-test invocation produces a mismatch, From b52a688e63e03ef78ed749f1ecea062ad4aa9f7f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 15:15:20 +0900 Subject: [PATCH 16/35] backup: #904 v16 - correct v14 Redis multi-DB fan-out (codex P2 v14) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit v14 added an enumerate-and-fan-out path for redis/db_/ to address codex P1 v13's silent-data-loss concern. Codex's v14 follow-up clarified that the fan-out is itself incorrect, and two P2 findings landed (L452 + L427). ## Codex P2 v14 L452: Reject Redis DB fan-out until DBs are encoded The Redis MVCC key prefixes (!redis|str|, !redis|hll|, !redis|ttl|, plus the collection helpers' analogues) carry NO database component. Feeding two distinct DBs into the same snapshotBuilder would: - collide on same-named keys across DBs (b.Add returns ErrDuplicate), or - merge keys under db_0 at decode time (DecodeOptions.RedisDBIndex defaults to 0, so a db_3-only self-test would decode under db_0 and the structural diff would fail). In either case the produced .fsm is mis-scoped: the v14 fan-out replaces silent-data-loss with silent-merge-or-collide. Fix: encodeAllRedisDBs now enumerates and decides: - 0 indices (no redis/, or empty redis/) → no-op. - [0] only → proceed with NewRedisEncoder(_, 0) (the legacy path). - anything else → fail closed with new sentinel ErrRedisEncodeMultiDBUnsupported. The sentinel is marked at the runAdapterEncoders boundary with ErrEncodeAdapterData so the CLI routes it to exit-2. Until Phase 1 makes native keys DB-aware, multi-DB inputs are quarantined as data-correctness failures rather than silently mis-scoped output. ## Codex P2 v14 L427: Fail closed on malformed Redis db_N entries The v14 parseRedisDBDir(ent) returned (_, false) whenever !ent.IsDir(), so a regular file or symlink at redis/db_0 would be silently skipped and the encoder would publish a header-only .fsm. Fix: split into parseRedisDBName (pure name parser) and shift the ent.IsDir() check into enumerateRedisDBs. When the name matches the canonical db_ pattern AND the entry is not a directory, enumerateRedisDBs returns ErrRedisEncodeNotDir (mirroring the per-DB encoder's existing Lstat guard). ## Pinned by - TestEncodeSnapshotRedisRejectsNonZeroDB: redis/db_3-only fixture through EncodeSnapshot → errors.Is ErrRedisEncodeMultiDBUnsupported AND errors.Is ErrEncodeAdapterData (both fire). - TestEncodeSnapshotRedisRejectsMultipleDBs: redis/db_0 + redis/db_3 → errors.Is ErrRedisEncodeMultiDBUnsupported. - TestEnumerateRedisDBsRejectsNonDirDBEntry: redis/db_2 as a regular file → errors.Is ErrRedisEncodeNotDir. - The replaced TestEncodeSnapshotRedisMultiDB (v14) was reformulated as TestEncodeSnapshotRedisRejectsNonZeroDB; the prior "more bytes than baseline" assertion no longer holds because the encoder rejects the fixture upfront. ## Caller audit per CLAUDE.md semantic-change rule - encodeAllRedisDBs: sole caller is adapterRunners.redis. Semantic change: was "fan out across all db_"; now "encode db_0 only, fail closed on anything else." Error contract: - errors.Is(err, ErrEncodeAdapterData) — still true (mark applied at runAdapterEncoders, unchanged). - errors.Is(err, ErrRedisEncodeMultiDBUnsupported) — new path. - errors.Is(err, ErrRedisEncodeNotDir) — newly reachable via enumerateRedisDBs for malformed db_ entries. All in-tree NewRedisEncoder direct callers (encode_redis_test.go, encode_redis_coll_test.go, encode_redis_hardlink_unix_test.go) are unaffected — they bypass the adapter runner. - enumerateRedisDBs: was "silently skip non-dir db_"; now "fail closed with ErrRedisEncodeNotDir." No production callers exist outside encodeAllRedisDBs; tests in encode_snapshot_test.go cover the new fail-closed path. - parseRedisDBDir → parseRedisDBName: signature change (DirEntry → string). Sole caller is enumerateRedisDBs (updated in lock-step). - v14's TestEncodeSnapshotRedisMultiDB is replaced — see above. TestEnumerateRedisDBsMixedEntries (v14) continues to pass: its positive entries (db_0, db_1, db_5) are all directories, so the IsDir check it never exercises remains exercised by the new TestEnumerateRedisDBsRejectsNonDirDBEntry. Tests + lint green. --- internal/backup/encode_snapshot.go | 88 ++++++++++++++++------ internal/backup/encode_snapshot_test.go | 97 ++++++++++++++++++------- 2 files changed, 135 insertions(+), 50 deletions(-) diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index e994d7dff..1946fd28c 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -41,6 +41,22 @@ var ErrSelfTestLowerLastCommitTS = errors.New("backup: --last-commit-ts T < mani // until the encoder learns the JSONL layout (M7 / future milestone). var ErrEncodeUnsupportedDynamoDBLayout = errors.New("backup: DynamoDB JSONL layout not supported by encoder") +// ErrRedisEncodeMultiDBUnsupported is returned when the input tree +// contains a Redis db_/ for any N != 0, or contains multiple +// db_ directories. The current Redis MVCC key prefixes +// (!redis|str|, !redis|hll|, !redis|ttl|, …) carry NO database +// component, so feeding two distinct DBs into the same snapshot +// builder would either collide on same-named keys or silently merge +// both DBs under db_0 on restore (DecodeOptions.RedisDBIndex +// defaults to 0). Failing closed preserves correctness until Phase 1 +// makes the native keys DB-aware (codex P2 v14 #904). +// +// v14 originally fanned out across db_ to address codex P1 v13's +// silent-data-loss concern; codex's v14 follow-up clarified that +// fan-out under the current key format produces mis-scoped output. +// The corrected fix replaces fan-out with fail-closed. +var ErrRedisEncodeMultiDBUnsupported = errors.New("backup: redis encoder requires single db_0 (multi-DB or non-zero DB not yet supported)") + // ErrEncodeAdapterData marks every error returned by an adapter // encoder (Redis / DynamoDB / S3 / SQS) so callers can distinguish // "the input tree contained content the encoder cannot translate" @@ -387,9 +403,20 @@ func enumerateRedisDBs(inRoot string) ([]int, error) { } var indices []int for _, ent := range entries { - if idx, ok := parseRedisDBDir(ent); ok { - indices = append(indices, idx) + idx, ok := parseRedisDBName(ent.Name()) + if !ok { + continue } + // Canonical db_ name; entry MUST be a directory. + // Silently skipping a regular file or symlink at + // redis/db_ would let a malformed dump publish a + // header-only/partial FSM (codex P2 v14 #904 L427). + if !ent.IsDir() { + return nil, errors.Wrapf(ErrRedisEncodeNotDir, + "redis/%s exists but is not a directory (mode=%s)", + ent.Name(), ent.Type()) + } + indices = append(indices, idx) } sort.Ints(indices) return indices, nil @@ -416,17 +443,19 @@ func checkRedisRoot(redisDir string) error { return nil } -// parseRedisDBDir returns (dbIndex, true) when ent names a canonical -// db_ directory (N is a non-negative decimal with no leading zeros). -// Non-matching entries return (0, false) so the caller can skip without -// erroring — they cannot have been produced by the canonical decoder. -// Reject non-canonical decimals so a hypothetical Phase 1 dumper cannot -// double-emit the same db under two distinct directory names. -func parseRedisDBDir(ent os.DirEntry) (int, bool) { - if !ent.IsDir() { - return 0, false - } - name := ent.Name() +// parseRedisDBName returns (dbIndex, true) when name matches the +// canonical db_ pattern (N is a non-negative decimal with no +// leading zeros). Non-matching names return (0, false) so the caller +// can skip them without erroring — they cannot have been produced by +// the canonical decoder. Reject non-canonical decimals so a +// hypothetical Phase 1 dumper cannot double-emit the same db under +// two distinct directory names. +// +// This is a pure name parser; the caller is responsible for +// validating the directory-entry shape (codex P2 v14 #904 L427 +// shifted the IsDir check to enumerateRedisDBs so a regular file +// at redis/db_ fails closed instead of being silently skipped). +func parseRedisDBName(name string) (int, bool) { if !strings.HasPrefix(name, redisDBDirPrefix) { return 0, false } @@ -438,20 +467,35 @@ func parseRedisDBDir(ent os.DirEntry) (int, bool) { return idx, true } -// encodeAllRedisDBs invokes NewRedisEncoder per discovered db_ -// directory in ascending index order. A missing redis/ directory is a -// no-op. Codex P1 v13 #904: replaces the prior hardcoded db_0 fan-out -// which would silently drop non-default DBs from any Phase 1 multi-DB -// dump. Per-DB errors are wrapped with the db index for traceability. +// encodeAllRedisDBs invokes NewRedisEncoder for redis/db_0/ when the +// input tree has exactly that DB (or no Redis content at all). A +// missing redis/ directory is a no-op. Any non-zero DB or the +// presence of multiple db_ directories fails closed with +// ErrRedisEncodeMultiDBUnsupported. +// +// Codex P1 v13 #904 originally asked for a per-DB fan-out to address +// the prior hardcoded db_0 dispatch silently dropping non-default +// DBs. Codex P2 v14 #904 (L452) clarified that fan-out under the +// current MVCC key prefixes (!redis|str|, !redis|hll|, !redis|ttl|, +// …, none of which carry a database component) would either collide +// on same-named keys across DBs or merge everything under db_0 at +// decode time. The corrected fix replaces the silent drop and the +// incorrect fan-out with a fail-closed sentinel until Phase 1 +// makes the native keys DB-aware. func encodeAllRedisDBs(b *snapshotBuilder, inRoot string) error { indices, err := enumerateRedisDBs(inRoot) if err != nil { return errors.Wrap(err, "redis encoder enumerate") } - for _, idx := range indices { - if err := NewRedisEncoder(inRoot, idx).Encode(b); err != nil { - return errors.Wrapf(err, "redis encoder db_%d", idx) - } + if len(indices) == 0 { + return nil + } + if len(indices) > 1 || indices[0] != 0 { + return errors.Wrapf(ErrRedisEncodeMultiDBUnsupported, + "redis encoder enumerated db indices %v", indices) + } + if err := NewRedisEncoder(inRoot, 0).Encode(b); err != nil { + return errors.Wrap(err, "redis encoder db_0") } return nil } diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index a13019a88..a063023c9 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -459,30 +459,20 @@ func TestEnumerateRedisDBsRedisIsRegularFile(t *testing.T) { } } -// TestEncodeSnapshotRedisMultiDB pins codex P1 v13 #904: the Redis -// fan-out in adapterRunners enumerates redis/db_/ and invokes the -// per-DB encoder for each. The fixture places a single string under -// redis/db_3/ ONLY (no db_0). Pre-fix, the encoder hardcoded -// NewRedisEncoder(root, 0) and produced a header-only .fsm for this -// input — silent data loss. Post-fix, the db_3 string is included. +// TestEncodeSnapshotRedisRejectsNonZeroDB pins codex P2 v14 #904 +// L452: the Redis MVCC key prefixes (!redis|str|, !redis|hll|, +// !redis|ttl|, …) carry no database component, so feeding a +// non-zero DB through the encoder would mis-scope the produced +// .fsm — same-named keys collide and a db_3-only self-test would +// decode under db_0. Until Phase 1 makes native keys DB-aware, +// non-zero-DB inputs MUST fail closed. // -// Assertion is content-free: compare encoded byte count against an -// empty-redis baseline. With multi-DB fan-out, the db_3 fixture -// produces MORE bytes than an empty tree. Without it, both encodes -// would produce identical header-only output. -func TestEncodeSnapshotRedisMultiDB(t *testing.T) { +// The fixture places a single string under redis/db_3/ ONLY. +// EncodeSnapshot must reject with ErrRedisEncodeMultiDBUnsupported +// and write no bytes. (v14 originally attempted to fan out per DB; +// codex's L452 follow-up established the correct fix is fail-closed.) +func TestEncodeSnapshotRedisRejectsNonZeroDB(t *testing.T) { t.Parallel() - emptyIn := t.TempDir() - var emptyBuf bytes.Buffer - emptyResult, err := EncodeSnapshot(EncodeOptions{ - InputRoot: emptyIn, - Adapters: AdapterSet{Redis: true}, - LastCommitTS: 1, - }, &emptyBuf) - if err != nil { - t.Fatalf("EncodeSnapshot empty: %v", err) - } - in := t.TempDir() encKey := EncodeSegment([]byte("k3")) db3Strings := filepath.Join(in, "redis", "db_3", "strings") @@ -493,17 +483,68 @@ func TestEncodeSnapshotRedisMultiDB(t *testing.T) { t.Fatalf("WriteFile db_3 string: %v", err) } var buf bytes.Buffer - result, err := EncodeSnapshot(EncodeOptions{ + _, err := EncodeSnapshot(EncodeOptions{ InputRoot: in, Adapters: AdapterSet{Redis: true}, LastCommitTS: 1, }, &buf) - if err != nil { - t.Fatalf("EncodeSnapshot db_3-only: %v", err) + if err == nil { + t.Fatalf("EncodeSnapshot accepted db_3-only Redis input; want ErrRedisEncodeMultiDBUnsupported") + } + if !errors.Is(err, ErrRedisEncodeMultiDBUnsupported) { + t.Errorf("err = %v, want errors.Is ErrRedisEncodeMultiDBUnsupported", err) + } + // Marked as adapter-data so the CLI routes it to exit-2. + if !errors.Is(err, ErrEncodeAdapterData) { + t.Errorf("err = %v, want errors.Is ErrEncodeAdapterData (mark from runAdapterEncoders)", err) + } +} + +// TestEncodeSnapshotRedisRejectsMultipleDBs pins the multi-DB case: +// redis/db_0 + redis/db_3 → ErrRedisEncodeMultiDBUnsupported (the +// fan-out would collide on same-named keys or merge both DBs under +// db_0 on restore; codex P2 v14 #904 L452). +func TestEncodeSnapshotRedisRejectsMultipleDBs(t *testing.T) { + t.Parallel() + in := t.TempDir() + for _, name := range []string{"db_0", "db_3"} { + if err := os.MkdirAll(filepath.Join(in, "redis", name, "strings"), 0o755); err != nil { + t.Fatalf("MkdirAll %s: %v", name, err) + } } - if result.BytesWritten <= emptyResult.BytesWritten { - t.Errorf("BytesWritten with redis/db_3 fixture (%d) <= empty (%d); pre-fix, hardcoded db_0 fan-out dropped db_3 silently", - result.BytesWritten, emptyResult.BytesWritten) + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{Redis: true}, + LastCommitTS: 1, + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot accepted db_0 + db_3; want ErrRedisEncodeMultiDBUnsupported") + } + if !errors.Is(err, ErrRedisEncodeMultiDBUnsupported) { + t.Errorf("err = %v, want errors.Is ErrRedisEncodeMultiDBUnsupported", err) + } +} + +// TestEnumerateRedisDBsRejectsNonDirDBEntry pins codex P2 v14 #904 +// L427: when a canonical db_ name resolves to a regular file +// (or symlink) instead of a directory, enumerateRedisDBs must fail +// closed with ErrRedisEncodeNotDir — silently skipping would let a +// malformed dump publish a header-only/partial FSM. +func TestEnumerateRedisDBsRejectsNonDirDBEntry(t *testing.T) { + t.Parallel() + in := t.TempDir() + if err := os.MkdirAll(filepath.Join(in, "redis"), 0o755); err != nil { + t.Fatalf("MkdirAll: %v", err) + } + // redis/db_2 is a regular file — name matches the canonical + // pattern but the entry shape is wrong. + if err := os.WriteFile(filepath.Join(in, "redis", "db_2"), []byte("not a dir"), 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + _, err := enumerateRedisDBs(in) + if !errors.Is(err, ErrRedisEncodeNotDir) { + t.Errorf("err = %v, want errors.Is ErrRedisEncodeNotDir", err) } } From 786448f15a1956ada7c618990cbc00851299c58b Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 15:22:03 +0900 Subject: [PATCH 17/35] backup: #904 v17 - add buf.Len() == 0 assertions to Redis rejection tests Claude v16 observation (non-blocking but worth folding in): the two v16 rejection tests TestEncodeSnapshotRedisRejectsNonZeroDB and TestEncodeSnapshotRedisRejectsMultipleDBs pass a *bytes.Buffer to EncodeSnapshot but never assert it stays empty on rejection. All five prior rejection tests consistently include the assertion: - TestEncodeSnapshotRejectsMissingInputRoot - TestEncodeSnapshotRejectsLowManifestFloor - TestEncodeSnapshotRejectsDynamoDBJSONLLayout - TestEncodeSnapshotRejectsZeroAdapterSet - TestEncodeSnapshotMarksAdapterDataErrors Adding the same `if buf.Len() != 0 { t.Errorf(...) }` block to both new tests keeps the convention consistent and pins the no-bytes-on-rejection invariant explicitly. No behavior change; tests + lint green. --- internal/backup/encode_snapshot_test.go | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index a063023c9..73f98018d 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -498,6 +498,9 @@ func TestEncodeSnapshotRedisRejectsNonZeroDB(t *testing.T) { if !errors.Is(err, ErrEncodeAdapterData) { t.Errorf("err = %v, want errors.Is ErrEncodeAdapterData (mark from runAdapterEncoders)", err) } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written on multi-DB rejection)", buf.Len()) + } } // TestEncodeSnapshotRedisRejectsMultipleDBs pins the multi-DB case: @@ -524,6 +527,9 @@ func TestEncodeSnapshotRedisRejectsMultipleDBs(t *testing.T) { if !errors.Is(err, ErrRedisEncodeMultiDBUnsupported) { t.Errorf("err = %v, want errors.Is ErrRedisEncodeMultiDBUnsupported", err) } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written on multi-DB rejection)", buf.Len()) + } } // TestEnumerateRedisDBsRejectsNonDirDBEntry pins codex P2 v14 #904 From 443e9b53cf48fcca2973b596416b5c8ce21cdea0 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 15:26:31 +0900 Subject: [PATCH 18/35] backup: #904 v18 - add ErrEncodeAdapterData assertion to multi-DB rejection test Claude v17 noted a pre-existing parity gap: TestEncodeSnapshotRedisRejectsMultipleDBs asserts errors.Is(ErrRedisEncodeMultiDBUnsupported) but not errors.Is(ErrEncodeAdapterData), while its sibling TestEncodeSnapshotRedisRejectsNonZeroDB asserts both. Both paths exit runAdapterEncoders through the same errors.Mark wrap so both sentinels should hold. Adding the missing assertion for parity. Test-only; no behavior change. Tests + lint green. --- internal/backup/encode_snapshot_test.go | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 73f98018d..22b4d1fc8 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -527,6 +527,11 @@ func TestEncodeSnapshotRedisRejectsMultipleDBs(t *testing.T) { if !errors.Is(err, ErrRedisEncodeMultiDBUnsupported) { t.Errorf("err = %v, want errors.Is ErrRedisEncodeMultiDBUnsupported", err) } + // Marked as adapter-data so the CLI routes it to exit-2 (mirrors + // TestEncodeSnapshotRedisRejectsNonZeroDB; claude v17 parity). + if !errors.Is(err, ErrEncodeAdapterData) { + t.Errorf("err = %v, want errors.Is ErrEncodeAdapterData (mark from runAdapterEncoders)", err) + } if buf.Len() != 0 { t.Errorf("buf.Len = %d, want 0 (no bytes should be written on multi-DB rejection)", buf.Len()) } From f246ca9a56e2b5b90f0b5ad1717491140ab46eeb Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 15:43:24 +0900 Subject: [PATCH 19/35] backup: #904 v19 - two codex P2 v17 fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v17 (1/2): preserve sidecar on adapter-error path encodeOne pre-deleted .encode_info.json at the top of every run (just after manifest read + ts resolve, just before writeAndPublish). The intent was "clean stale sidecar so the latest run's contents land here," but this only works when writeSidecar actually runs. writeSidecar runs on success and on self-test mismatch. It does NOT run on adapter-encoder errors (publishErr != nil && !errSelfTestMismatch). On that path the prior .fsm is preserved (only the self-test-mismatch branch calls removeStaleOutputFSM), so wiping its sidecar leaves the prior restore artifact without its matching provenance metadata — an inconsistent on-disk state. Fix: drop the os.Remove(EncodeInfoSidecarPath(...)) call. writeSidecar already opens with O_CREATE|O_WRONLY|O_TRUNC, so the sidecar is atomically overwritten on every write path. When writeSidecar isn't called (adapter-error), the prior sidecar stays paired with the prior FSM that's also untouched. Pinned by TestCLIAdapterErrorPreservesPriorSidecar: pre-places a prior FSM + sidecar at the output path, drives an adapter-data failure (malformed dynamodb _schema.json), asserts both are byte-for-byte preserved after the run exits exitDataErr. ## Codex P2 v17 (2/2): require MANIFEST.json at InputRoot The doc on EncodeOptions.InputRoot has always required a MANIFEST.json, but validateEncodeOptions only checked the path was a directory. A direct library caller (Phase 1 in-process extractor, integration tests) pointing at an existing-but-wrong directory would pass validation; each enabled adapter no-ops on its missing top-level subdir, and EncodeSnapshot silently emits a header-only .fsm — the silent-data-loss pattern the encoder is supposed to fail closed against. The CLI hits this path naturally by opening MANIFEST.json before calling EncodeSnapshot, but the library validation layer needed the equivalent guard. Fix: new EncodeOptions.AllowMissingManifest bool (default false). When false, validateEncodeOptions stats /MANIFEST.json and fails-closed if missing. Synthetic test fixtures opt in to the legacy lax behavior with AllowMissingManifest=true. Pinned by: - TestEncodeSnapshotRequiresManifest: empty InputRoot directory (no manifest), AllowMissingManifest=false → error, buf.Len()==0. - TestEncodeSnapshotAllowMissingManifestOptOut: SQS fixture without MANIFEST.json, AllowMissingManifest=true → succeeds. ## validateEncodeOptions cyclop split Added input-root logic moved into checkInputRoot helper (InputRoot non-empty + exists-as-dir + MANIFEST.json present unless opted out). validateEncodeOptions stays under cyclop. checkInputRoot has its own godoc. ## Caller audit per CLAUDE.md semantic-change rule - encodeOne: dropped sidecar pre-cleanup. Sole caller is run(). Behavior change ONLY on the adapter-error path (sidecar preserved instead of wiped). Success path: sidecar still written + atomically overwrites prior. Self-test-mismatch path: sidecar still written. Manifest-floor / DDB-JSONL paths: fail BEFORE the cleanup so no on-disk effect either way. - validateEncodeOptions: now requires MANIFEST.json unless opted out. Sole production caller is EncodeSnapshot. CLI calls EncodeSnapshot only after readInputManifest succeeds, so MANIFEST.json is guaranteed to exist at the time of the check — no CLI regression. All 10 existing library test sites that pass synthetic fixtures now set AllowMissingManifest=true. - checkInputRoot: new package-private helper, only called from validateEncodeOptions. No external callers. Test fixture churn: 10 EncodeSnapshot call sites in encode_snapshot_test.go now include AllowMissingManifest=true. The field's default (false) preserves the strict contract for all non-test callers. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 14 ++- cmd/elastickv-snapshot-encode/main_test.go | 72 ++++++++++++++ internal/backup/encode_snapshot.go | 60 +++++++++--- internal/backup/encode_snapshot_test.go | 106 ++++++++++++++++----- 4 files changed, 213 insertions(+), 39 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 5c9d13c62..5638d0987 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -262,10 +262,16 @@ func encodeOne(cfg *config, logger *slog.Logger) error { mismatchPath := cfg.outputPath + ".mismatch.txt" _ = os.Remove(mismatchPath) // stale-mismatch cleanup (gemini medium v6 #896) - // Stale sidecar cleanup too: a self-test failure rewrites the - // sidecar with matched:false (codex P2 v6 #904); make sure the - // file always reflects the latest run, not a prior success. - _ = os.Remove(backup.EncodeInfoSidecarPath(cfg.outputPath)) + // Do NOT pre-clean the sidecar here. The sidecar describes the + // .fsm at cfg.outputPath; the .fsm is preserved when a run fails + // in the adapter encoders (non-self-test exit-2 path), so wiping + // its sidecar would leave the prior restore artifact without its + // matching provenance metadata (codex P2 v17 #904). writeSidecar + // uses O_CREATE|O_TRUNC, so the sidecar is atomically overwritten + // on success and on self-test mismatch (where the .fsm is also + // replaced or removed in lock-step). On adapter-encoder errors + // neither writeSidecar nor removeStaleOutputFSM runs; the prior + // .fsm + prior sidecar therefore stay paired. result, publishErr := writeAndPublish(cfg, encodeOpts, mismatchPath, logger) // Sidecar is written even on self-test mismatch so an operator diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 7c389f7c8..9c5549853 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -304,6 +304,78 @@ func readSidecar(t *testing.T, output string) backup.EncodeInfo { return info } +// TestCLIAdapterErrorPreservesPriorSidecar pins codex P2 v17 #904: when +// a run gets past manifest/TS validation and then fails inside an +// adapter encoder (non-self-test exit-2), the prior .fsm is +// preserved (only the self-test mismatch path removes it), so the +// prior .encode_info.json must ALSO be preserved — wiping it +// while leaving the .fsm would orphan the restore artifact from its +// provenance metadata. The v17 fix drops the pre-encode sidecar +// cleanup; this test pins the resulting invariant end-to-end. +func TestCLIAdapterErrorPreservesPriorSidecar(t *testing.T) { + t.Parallel() + in, out, priorFSM, priorSidecar := setupAdapterErrorFixture(t) + code, err := run([]string{ + "--input", in, + "--output", out, + "--adapter", "dynamodb", + }, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want adapter-data rejection") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d", code, exitDataErr) + } + assertFilePreserved(t, out, priorFSM, "prior .fsm") + // Prior sidecar unchanged (codex P2 v17: the v17 fix drops the + // pre-encode sidecar cleanup so the sidecar+.fsm stay paired). + assertFilePreserved(t, out+".encode_info.json", priorSidecar, "prior sidecar") +} + +// setupAdapterErrorFixture builds a fixture for +// TestCLIAdapterErrorPreservesPriorSidecar: an InputRoot with a valid +// MANIFEST.json plus a malformed dynamodb _schema.json (empty +// table_name → ErrDDBEncodeInvalidSchema), and a pre-placed FSM + +// sidecar at the output path representing a hypothetical earlier +// successful run. Returns (inputRoot, outputPath, priorFSMBytes, +// priorSidecarBytes). +func setupAdapterErrorFixture(t *testing.T) (string, string, []byte, []byte) { + t.Helper() + in := t.TempDir() + emitMinimalManifest(t, in, 1000) + out := filepath.Join(t.TempDir(), "out.fsm") + priorFSM := []byte("PRIOR FSM BYTES") + priorSidecar := []byte(`{"format_version":1,"encoder_version":"prior","input_root":"x","output_fsm_path":"x"}`) + if err := os.WriteFile(out, priorFSM, 0o600); err != nil { + t.Fatalf("WriteFile prior fsm: %v", err) + } + if err := os.WriteFile(out+".encode_info.json", priorSidecar, 0o600); err != nil { + t.Fatalf("WriteFile prior sidecar: %v", err) + } + schemaDir := filepath.Join(in, "dynamodb", "tbl") + if err := os.MkdirAll(schemaDir, 0o755); err != nil { + t.Fatalf("MkdirAll: %v", err) + } + body := []byte(`{"format_version":1,"table_name":"","primary_key":{"hash_key":{"name":"id","type":"S"}}}`) + if err := os.WriteFile(filepath.Join(schemaDir, "_schema.json"), body, 0o600); err != nil { + t.Fatalf("WriteFile bad schema: %v", err) + } + return in, out, priorFSM, priorSidecar +} + +// assertFilePreserved asserts the named file is still present and its +// contents exactly match wantBody. label appears in error messages. +func assertFilePreserved(t *testing.T, path string, wantBody []byte, label string) { + t.Helper() + got, err := os.ReadFile(path) + if err != nil { + t.Fatalf("read %s at %s: %v", label, path, err) + } + if !bytes.Equal(got, wantBody) { + t.Errorf("%s mutated; codex P2 v17 expected adapter-error to preserve", label) + } +} + // TestCLISelfTestMismatchSkipsDirectoryAtOutputPath pins codex P2 v14 // #904: the self-test-mismatch cleanup must NOT delete an --output // path that resolves to a directory (or any non-regular file). The diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index 1946fd28c..7afd17357 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -135,6 +135,22 @@ type EncodeOptions struct { // the original decoder would have produced. SelfTestDecodeOptions DecodeOptions + // AllowMissingManifest opts out of the MANIFEST.json presence + // check in validateEncodeOptions. When false (default), + // EncodeSnapshot requires /MANIFEST.json to exist — + // the contract on InputRoot has always claimed this, but until + // codex P2 v17 #904 the library only checked the path was a + // directory, so a real library caller pointing at the wrong + // directory would silently emit a header-only .fsm (each enabled + // adapter no-ops when its top-level subdir is missing). + // + // Set to true for synthetic test fixtures that don't have a + // MANIFEST.json on disk. Production callers (CLI, Phase 1 + // in-process extractor) MUST leave this at false so a bad + // InputRoot surfaces an explicit error rather than a + // silent-empty .fsm. + AllowMissingManifest bool + // corruptBufferForTest is an unexported test-only hook that fires // against the on-disk self-test buffer AFTER snapshotBuilder.WriteTo // returns but BEFORE the self-test DecodeSnapshot call (when @@ -200,15 +216,31 @@ type EncodeResult struct { // Split out so EncodeSnapshot stays under the cyclop threshold; the // data-correctness checks live in validateEncodeOptionsData. func validateEncodeOptions(opts EncodeOptions, out io.Writer) error { + if err := checkInputRoot(opts); err != nil { + return err + } + if out == nil { + return errors.New("backup: EncodeSnapshot out writer is nil") + } + if !opts.Adapters.DynamoDB && !opts.Adapters.S3 && !opts.Adapters.Redis && !opts.Adapters.SQS { + // Zero AdapterSet would silently produce a header-only .fsm — + // a "successful" empty restore artifact (codex v5 + claude v5 #904). + return errors.New("backup: EncodeOptions.Adapters has no enabled adapter") + } + return validateEncodeOptionsData(opts) +} + +// checkInputRoot validates InputRoot's path-level invariants: present, +// existing as a directory, and (unless AllowMissingManifest) contains +// a MANIFEST.json. Split out of validateEncodeOptions to keep cyclop +// happy and to make the three failure modes inspectable in isolation. +func checkInputRoot(opts EncodeOptions) error { if opts.InputRoot == "" { return errors.New("backup: EncodeOptions.InputRoot is required") } // Stat the path so a typo'd or deleted directory surfaces here // rather than fan-out-no-op'ing every adapter and producing a - // header-only .fsm (codex P2 v8 #904). CLI callers indirectly - // catch this via os.Open(MANIFEST.json) before EncodeSnapshot, - // but a library caller that passes a stale path needs the guard - // at this layer. + // header-only .fsm (codex P2 v8 #904). info, statErr := os.Stat(opts.InputRoot) if statErr != nil { return errors.Wrapf(statErr, "stat InputRoot %q", opts.InputRoot) @@ -216,15 +248,19 @@ func validateEncodeOptions(opts EncodeOptions, out io.Writer) error { if !info.IsDir() { return errors.Errorf("backup: InputRoot %q is not a directory", opts.InputRoot) } - if out == nil { - return errors.New("backup: EncodeSnapshot out writer is nil") - } - if !opts.Adapters.DynamoDB && !opts.Adapters.S3 && !opts.Adapters.Redis && !opts.Adapters.SQS { - // Zero AdapterSet would silently produce a header-only .fsm — - // a "successful" empty restore artifact (codex v5 + claude v5 #904). - return errors.New("backup: EncodeOptions.Adapters has no enabled adapter") + // Require MANIFEST.json at InputRoot unless the caller has + // explicitly opted out (codex P2 v17 #904). The doc on InputRoot + // has always required it; this guard catches a library caller + // pointing at an existing-but-wrong directory whose adapter + // subdirs are all absent — without this check the call would + // silently succeed and publish a header-only .fsm. + if !opts.AllowMissingManifest { + manifestPath := filepath.Join(opts.InputRoot, "MANIFEST.json") + if _, mstat := os.Stat(manifestPath); mstat != nil { + return errors.Wrapf(mstat, "stat MANIFEST.json under InputRoot %q (set EncodeOptions.AllowMissingManifest=true for synthetic fixtures)", opts.InputRoot) + } } - return validateEncodeOptionsData(opts) + return nil } // validateEncodeOptionsData covers the data-correctness pre-conditions: diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 22b4d1fc8..4ec83817e 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -30,9 +30,10 @@ func TestEncodeSnapshotLibraryRoundTrip(t *testing.T) { var buf bytes.Buffer result, err := EncodeSnapshot(EncodeOptions{ - InputRoot: in, - Adapters: AdapterSet{SQS: true}, - LastCommitTS: 0xDEADBEEF, + InputRoot: in, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 0xDEADBEEF, + AllowMissingManifest: true, }, &buf) if err != nil { t.Fatalf("EncodeSnapshot: %v", err) @@ -83,9 +84,10 @@ func TestEncodeSnapshotSelfTestMatchesInput(t *testing.T) { canonicalIn := t.TempDir() var canonicalBuf bytes.Buffer if _, err := EncodeSnapshot(EncodeOptions{ - InputRoot: rawIn, - Adapters: AdapterSet{SQS: true}, - LastCommitTS: 0xCAFE, + InputRoot: rawIn, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 0xCAFE, + AllowMissingManifest: true, }, &canonicalBuf); err != nil { t.Fatalf("canonical encode: %v", err) } @@ -107,6 +109,7 @@ func TestEncodeSnapshotSelfTestMatchesInput(t *testing.T) { OutRoot: scratchBase, Adapters: AdapterSet{SQS: true}, }, + AllowMissingManifest: true, }, &buf) if err != nil { t.Fatalf("EncodeSnapshot: %v", err) @@ -180,6 +183,7 @@ func TestEncodeSnapshotSelfTestDetectsCorruption(t *testing.T) { Adapters: AdapterSet{SQS: true}, }, corruptBufferForTest: corrupt, + AllowMissingManifest: true, }, &out) if err != nil { t.Fatalf("EncodeSnapshot: %v", err) @@ -259,6 +263,55 @@ func TestEncodeSnapshotRejectsMissingInputRoot(t *testing.T) { }) } +// TestEncodeSnapshotRequiresManifest pins codex P2 v17 #904: a library +// caller pointing at an existing-but-wrong directory (no +// MANIFEST.json) must fail closed with an error referencing +// MANIFEST.json, NOT silently emit a header-only .fsm. The CLI hits +// this path naturally by opening MANIFEST.json first; the library +// validation layer needs the equivalent guard. +func TestEncodeSnapshotRequiresManifest(t *testing.T) { + t.Parallel() + in := t.TempDir() // exists, is a directory, but contains no MANIFEST.json + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 1, + // AllowMissingManifest: false (default) — must require manifest + }, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot with missing MANIFEST.json succeeded; want error") + } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written when MANIFEST.json is missing)", buf.Len()) + } +} + +// TestEncodeSnapshotAllowMissingManifestOptOut pins that the +// AllowMissingManifest opt-out works for synthetic test fixtures. +// Mirrors the ManifestLastCommitTS=0 opt-out pattern from codex P2 v2. +func TestEncodeSnapshotAllowMissingManifestOptOut(t *testing.T) { + t.Parallel() + in := t.TempDir() + const queue = "manifest-opt-out" + writeSQSQueue(t, in, queue, + []byte(`{"format_version":1,"name":"manifest-opt-out","fifo":false,"partition_count":1,"generation":1}`), + [][]byte{ + []byte(`{"format_version":1,"message_id":"m1","body":"a","send_timestamp_millis":1700000000000,"available_at_millis":1700000000000,"sequence_number":0}`), + }, + ) + var buf bytes.Buffer + _, err := EncodeSnapshot(EncodeOptions{ + InputRoot: in, + Adapters: AdapterSet{SQS: true}, + LastCommitTS: 1, + AllowMissingManifest: true, + }, &buf) + if err != nil { + t.Fatalf("EncodeSnapshot with AllowMissingManifest=true failed: %v", err) + } +} + // TestEncodeSnapshotRejectsLowManifestFloor pins codex P2 v2: the // library-level HLC floor check fails-closed when opts.LastCommitTS // is below opts.ManifestLastCommitTS. Defense-in-depth for the CLI's @@ -273,6 +326,7 @@ func TestEncodeSnapshotRejectsLowManifestFloor(t *testing.T) { Adapters: AdapterSet{SQS: true}, LastCommitTS: 500, ManifestLastCommitTS: 1000, // floor; LastCommitTS is below + AllowMissingManifest: true, }, &buf) if err == nil { t.Fatalf("EncodeSnapshot with LastCommitTS < ManifestLastCommitTS succeeded; want error") @@ -305,6 +359,7 @@ func TestEncodeSnapshotManifestFloorOptOut(t *testing.T) { Adapters: AdapterSet{SQS: true}, LastCommitTS: 500, ManifestLastCommitTS: 0, // opt-out + AllowMissingManifest: true, }, &buf) if err != nil { t.Fatalf("EncodeSnapshot with opt-out floor failed: %v", err) @@ -323,10 +378,11 @@ func TestEncodeSnapshotRejectsDynamoDBJSONLLayout(t *testing.T) { in := t.TempDir() var buf bytes.Buffer _, err := EncodeSnapshot(EncodeOptions{ - InputRoot: in, - Adapters: AdapterSet{DynamoDB: true}, - LastCommitTS: 1, - DynamoDBBundleJSONL: true, + InputRoot: in, + Adapters: AdapterSet{DynamoDB: true}, + LastCommitTS: 1, + DynamoDBBundleJSONL: true, + AllowMissingManifest: true, }, &buf) if err == nil { t.Fatalf("EncodeSnapshot with DynamoDBBundleJSONL accepted; want error") @@ -356,10 +412,11 @@ func TestEncodeSnapshotJSONLOnlyRejectedWhenDDBEnabled(t *testing.T) { ) var buf bytes.Buffer _, err := EncodeSnapshot(EncodeOptions{ - InputRoot: in, - Adapters: AdapterSet{SQS: true}, // DDB NOT in scope - LastCommitTS: 1, - DynamoDBBundleJSONL: true, // would be rejected if DDB were enabled + InputRoot: in, + Adapters: AdapterSet{SQS: true}, // DDB NOT in scope + LastCommitTS: 1, + DynamoDBBundleJSONL: true, // would be rejected if DDB were enabled + AllowMissingManifest: true, }, &buf) if err != nil { t.Fatalf("EncodeSnapshot rejected JSONL flag when DDB not in scope: %v", err) @@ -484,9 +541,10 @@ func TestEncodeSnapshotRedisRejectsNonZeroDB(t *testing.T) { } var buf bytes.Buffer _, err := EncodeSnapshot(EncodeOptions{ - InputRoot: in, - Adapters: AdapterSet{Redis: true}, - LastCommitTS: 1, + InputRoot: in, + Adapters: AdapterSet{Redis: true}, + LastCommitTS: 1, + AllowMissingManifest: true, }, &buf) if err == nil { t.Fatalf("EncodeSnapshot accepted db_3-only Redis input; want ErrRedisEncodeMultiDBUnsupported") @@ -517,9 +575,10 @@ func TestEncodeSnapshotRedisRejectsMultipleDBs(t *testing.T) { } var buf bytes.Buffer _, err := EncodeSnapshot(EncodeOptions{ - InputRoot: in, - Adapters: AdapterSet{Redis: true}, - LastCommitTS: 1, + InputRoot: in, + Adapters: AdapterSet{Redis: true}, + LastCommitTS: 1, + AllowMissingManifest: true, }, &buf) if err == nil { t.Fatalf("EncodeSnapshot accepted db_0 + db_3; want ErrRedisEncodeMultiDBUnsupported") @@ -577,9 +636,10 @@ func TestEncodeSnapshotMarksAdapterDataErrors(t *testing.T) { []byte(`{"format_version":1,"table_name":"","primary_key":{"hash_key":{"name":"id","type":"S"}}}`)) var buf bytes.Buffer _, err := EncodeSnapshot(EncodeOptions{ - InputRoot: in, - Adapters: AdapterSet{DynamoDB: true}, - LastCommitTS: 1, + InputRoot: in, + Adapters: AdapterSet{DynamoDB: true}, + LastCommitTS: 1, + AllowMissingManifest: true, }, &buf) if err == nil { t.Fatalf("EncodeSnapshot with malformed schema succeeded; want error") From 31e35ba895850870609c7bb602eebe4d8dd123ac Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 15:50:03 +0900 Subject: [PATCH 20/35] backup: #904 v20 - fix TestEncodeSnapshotRejectsZeroAdapterSet to pin the right guard Claude v19 caught: the v19 MANIFEST.json gate (added to checkInputRoot) moved this test's failure point from the zero-adapter guard (the invariant the test name and comment claim it pins) to the MANIFEST.json-missing guard. The test still passed both assertions (non-nil error, buf.Len()==0) but for the wrong reason. Adding AllowMissingManifest: true so checkInputRoot waves through the missing manifest and validateEncodeOptions proceeds to the zero-adapter check that's the actual subject of the test. The other 10 synthetic-fixture EncodeSnapshot callers in v19 already got this flag; this one was missed because its rejection assertions were satisfied by the new guard, so the failure wasn't a hard error. Test-only; no behavior change. Tests + lint green. --- internal/backup/encode_snapshot_test.go | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 4ec83817e..6cd20b99a 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -436,6 +436,13 @@ func TestEncodeSnapshotRejectsZeroAdapterSet(t *testing.T) { InputRoot: in, Adapters: AdapterSet{}, // explicit zero LastCommitTS: 1, + // AllowMissingManifest: true bypasses the v19 MANIFEST.json + // guard so this test actually exercises the zero-adapter + // guard further down in validateEncodeOptions — without this, + // checkInputRoot would error on the missing MANIFEST.json + // first and the assertion would pin the wrong invariant + // (claude v19 #904). + AllowMissingManifest: true, }, &buf) if err == nil { t.Fatalf("EncodeSnapshot with empty AdapterSet succeeded; want error") From c5b68c7ab82d9deb866e2a031c36f1650c56fd21 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 16:02:23 +0900 Subject: [PATCH 21/35] backup: #904 v21 - codex P2 v19: also remove symlinked --output on mismatch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v19: symlink at --output silently bypassed mismatch cleanup v15 added removeStaleOutputFSM with an IsRegular()-only check (codex P2 v14 L347 caught the directory case where unconditional os.Remove would have wiped an empty directory the operator passed in error). The IsRegular() check was too restrictive: a symlink at --output pointing at a previous .fsm file is a valid operator pattern, and the v15 logic silently skipped that case. On self-test mismatch with --output as a symlink: - writeAndPublish writes mismatch.txt + (via encodeOne) the fresh sidecar (matched=false, NEW SHA). - removeStaleOutputFSM logs "skip ... not a regular file" and returns. - --output continues to resolve through the symlink to the prior valid .fsm bytes. - Net: sidecar says "this encode failed" but --output still resolves to a usable .fsm — violating the "no restore-visible FSM after self-test mismatch" contract. ## Fix removeStaleOutputFSM now removes both regular files AND symlinks. os.Remove on a symlink operates on the link (not the resolved target), so: - Regular file at --output: file is deleted; --output → ENOENT. - Symlink at --output: link is unlinked; --output → ENOENT; target file is preserved as a side effect. - Directory / device / FIFO / socket at --output: skipped (matches v14 L347 behavior; those shapes were never valid restore targets and os.Remove on them could be destructive). ## Pinned by TestCLISelfTestMismatchRemovesSymlinkOutputButPreservesTarget (skipped on Windows where symlink semantics differ): - Pre-creates a real .fsm file at a separate path with a sentinel byte string. - Creates a symlink at --output pointing at that file. - Drives a real self-test mismatch via the corruption seam. - Asserts publishErr == errSelfTestMismatch. - Asserts os.Lstat(--output) returns ENOENT (the link is gone). - Asserts the target file's bytes are byte-equal to the pre-write payload (the target survives — os.Remove operated on the link). ## Lint refactors required by the new test - canonicalRoundTripTS const: canonicalizeInput's lastCommitTS parameter always received 7000 from all 4 call sites (unparam). Inlined as const canonicalRoundTripTS uint64 = 7000; signature dropped to (t, rawIn). - flipBytesPastHeaderInTempCorruptHook helper: three CLI mismatch tests now share one body of the corruption hook (was inline in each). Mirrors the library's flipBytesPastHeaderHelper. Refactor brings TestCLISelfTestMismatchRemovesSymlinkOutputButPreservesTarget under the cyclop bound (was 13 / max 10). ## Caller audit per CLAUDE.md semantic-change rule - removeStaleOutputFSM: sole caller is writeAndPublish's self-test mismatch branch. Behavior change ONLY for symlink-at-output: previously skipped → now removes the link. Regular-file and non-regular-non-symlink cases unchanged. Mismatch error propagation unchanged. - canonicalizeInput: 4 callers, all in main_test.go, all passed 7000. Signature change is mechanical; all call sites updated in lock-step. - flipBytesPastHeaderInTempCorruptHook: new test-only helper, no production callers. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 30 ++-- cmd/elastickv-snapshot-encode/main_test.go | 161 +++++++++++++++------ 2 files changed, 136 insertions(+), 55 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 5638d0987..8a76c06dd 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -331,12 +331,22 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes return encodeOpts } -// removeStaleOutputFSM removes outputPath ONLY if it exists and is a -// regular file. A directory or special-file at the path is left alone -// (codex P2 v14 #904 — the prior unconditional os.Remove would have -// deleted an empty directory the operator passed in error to --output). -// Errors other than ErrNotExist are downgraded to warn-and-continue so -// the caller's primary mismatch error remains the dominant signal. +// removeStaleOutputFSM removes outputPath ONLY when it exists as a +// regular file or a symlink. Both shapes satisfy the "no +// restore-visible FSM after self-test mismatch" contract: removing a +// regular file empties --output; removing a symlink unlinks the +// name (the target is preserved as a side effect — os.Remove on a +// symlink operates on the link, not the resolved target). A directory, +// device, FIFO, or socket at --output is left alone — those shapes +// were never valid restore targets, and os.Remove on a non-empty +// directory or device would be destructive in ways the mismatch +// contract does not require (codex P2 v14 #904 caught the directory +// case; codex P2 v19 #904 caught the symlink case where the prior +// IsRegular()-only check silently left the symlink resolving to the +// stale snapshot). +// +// Errors other than ErrNotExist are downgraded to warn-and-continue +// so the caller's primary mismatch error remains the dominant signal. func removeStaleOutputFSM(outputPath string, logger *slog.Logger) { info, err := os.Lstat(outputPath) if err != nil { @@ -345,9 +355,11 @@ func removeStaleOutputFSM(outputPath string, logger *slog.Logger) { } return } - if !info.Mode().IsRegular() { - logger.Warn("skip stale .fsm cleanup: --output is not a regular file", - "path", outputPath, "mode", info.Mode()) + mode := info.Mode() + isSymlink := mode&os.ModeSymlink != 0 + if !mode.IsRegular() && !isSymlink { + logger.Warn("skip stale .fsm cleanup: --output is not a regular file or symlink", + "path", outputPath, "mode", mode) return } if rerr := os.Remove(outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 9c5549853..3feddae73 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -267,7 +267,12 @@ func writeSQSFixture(t *testing.T, root string) { // encoder's output shape. Subsequent self-tests against the canonical // tree are byte-equal (any non-canonical formatting differences are // flattened by this first pass). -func canonicalizeInput(t *testing.T, rawIn string, lastCommitTS uint64) string { +// canonicalRoundTripTS is the fixed last_commit_ts used by every +// canonicalizeInput call site. Kept as a const so a future test that +// wants a different value can lift it back into a parameter. +const canonicalRoundTripTS uint64 = 7000 + +func canonicalizeInput(t *testing.T, rawIn string) string { t.Helper() canonicalIn := t.TempDir() tmpOut := filepath.Join(t.TempDir(), "canonical.fsm") @@ -286,10 +291,39 @@ func canonicalizeInput(t *testing.T, rawIn string, lastCommitTS uint64) string { }); err != nil { t.Fatalf("canonical decode: %v", err) } - emitMinimalManifest(t, canonicalIn, lastCommitTS) + emitMinimalManifest(t, canonicalIn, canonicalRoundTripTS) return canonicalIn } +// flipBytesPastHeaderInTempCorruptHook returns a corrupt-buffer hook +// that flips one byte every 13 starting at offset 200 in the on-disk +// self-test buffer — the same pattern as the library's +// flipBytesPastHeaderHelper. Extracted so the three CLI mismatch +// tests share one body rather than each open-coding the same loop. +func flipBytesPastHeaderInTempCorruptHook(t *testing.T) func(*os.File) { + t.Helper() + return func(f *os.File) { + info, ferr := f.Stat() + if ferr != nil { + t.Fatalf("temp Stat: %v", ferr) + } + const headerSkip = 200 + if info.Size() <= headerSkip { + t.Fatalf("temp file too small to corrupt past header: %d bytes", info.Size()) + } + buf := make([]byte, info.Size()-headerSkip) + if _, rerr := f.ReadAt(buf, headerSkip); rerr != nil { + t.Fatalf("ReadAt: %v", rerr) + } + for i := 0; i < len(buf); i += 13 { + buf[i] ^= 0xFF + } + if _, werr := f.WriteAt(buf, headerSkip); werr != nil { + t.Fatalf("WriteAt: %v", werr) + } + } +} + // readSidecar reads .encode_info.json into an EncodeInfo struct. func readSidecar(t *testing.T, output string) backup.EncodeInfo { t.Helper() @@ -392,7 +426,7 @@ func TestCLISelfTestMismatchSkipsDirectoryAtOutputPath(t *testing.T) { rawIn := t.TempDir() writeSQSFixture(t, rawIn) emitMinimalManifest(t, rawIn, 7000) - canonicalIn := canonicalizeInput(t, rawIn, 7000) + canonicalIn := canonicalizeInput(t, rawIn) out := filepath.Join(t.TempDir(), "out.fsm") // Pre-create a directory at the --output path — an operator @@ -413,26 +447,7 @@ func TestCLISelfTestMismatchSkipsDirectoryAtOutputPath(t *testing.T) { Adapters: backup.AdapterSet{SQS: true}, }, } - encodeOpts.SetSelfTestCorruptHookForTest(func(f *os.File) { - info, ferr := f.Stat() - if ferr != nil { - t.Fatalf("temp Stat: %v", ferr) - } - const headerSkip = 200 - if info.Size() <= headerSkip { - t.Fatalf("temp file too small to corrupt: %d", info.Size()) - } - buf := make([]byte, info.Size()-headerSkip) - if _, rerr := f.ReadAt(buf, headerSkip); rerr != nil { - t.Fatalf("ReadAt: %v", rerr) - } - for i := 0; i < len(buf); i += 13 { - buf[i] ^= 0xFF - } - if _, werr := f.WriteAt(buf, headerSkip); werr != nil { - t.Fatalf("WriteAt: %v", werr) - } - }) + encodeOpts.SetSelfTestCorruptHookForTest(flipBytesPastHeaderInTempCorruptHook(t)) cfg := &config{ inputPath: canonicalIn, @@ -496,6 +511,80 @@ func TestCLIInvalidManifestExitsTwo(t *testing.T) { }) } +// TestCLISelfTestMismatchRemovesSymlinkOutputButPreservesTarget pins +// codex P2 v19 #904: when --output is a symlink to a prior .fsm file +// and the new --self-test invocation mismatches, the cleanup must +// unlink the symlink (so the restore-visible --output path now +// resolves to ENOENT, matching the mismatch contract) while leaving +// the underlying target file intact (os.Remove on a symlink operates +// on the link, not the resolved target). +// +// Before v21 the IsRegular()-only check silently skipped the symlink +// cleanup; the new sidecar (matched=false) then described a fresh +// failed encode while --output still resolved to a prior valid .fsm, +// breaking the "no restore-visible FSM after self-test mismatch" +// invariant. Linux-only because Windows symlink semantics differ. +func TestCLISelfTestMismatchRemovesSymlinkOutputButPreservesTarget(t *testing.T) { + if isWindows { + t.Skip("symlink semantics differ on Windows") + } + t.Parallel() + rawIn := t.TempDir() + writeSQSFixture(t, rawIn) + emitMinimalManifest(t, rawIn, 7000) + canonicalIn := canonicalizeInput(t, rawIn) + + targetDir := t.TempDir() + target := filepath.Join(targetDir, "real.fsm") + const targetBody = "TARGET FSM BYTES (must survive symlink removal)" + if err := os.WriteFile(target, []byte(targetBody), 0o600); err != nil { + t.Fatalf("WriteFile target: %v", err) + } + out := filepath.Join(t.TempDir(), "out.fsm") + if err := os.Symlink(target, out); err != nil { + t.Fatalf("Symlink: %v", err) + } + + scratchBase := t.TempDir() + encodeOpts := backup.EncodeOptions{ + InputRoot: canonicalIn, + Adapters: backup.AdapterSet{SQS: true}, + LastCommitTS: 7000, + ManifestLastCommitTS: 7000, + SelfTest: true, + SelfTestDecodeOptions: backup.DecodeOptions{ + OutRoot: scratchBase, + Adapters: backup.AdapterSet{SQS: true}, + }, + } + encodeOpts.SetSelfTestCorruptHookForTest(flipBytesPastHeaderInTempCorruptHook(t)) + + cfg := &config{ + inputPath: canonicalIn, + outputPath: out, + adapters: backup.AdapterSet{SQS: true}, + selfTest: true, + } + mismatchPath := out + ".mismatch.txt" + + _, publishErr := writeAndPublish(cfg, encodeOpts, mismatchPath, quietLogger()) + if !errors.Is(publishErr, errSelfTestMismatch) { + t.Fatalf("publishErr = %v, want errSelfTestMismatch", publishErr) + } + // --output (the symlink) must now resolve to ENOENT. + if _, statErr := os.Lstat(out); !os.IsNotExist(statErr) { + t.Errorf("symlink at --output not removed after mismatch (codex P2 v19 regression)") + } + // The target file (which the symlink pointed to) must survive. + gotTarget, rerr := os.ReadFile(target) + if rerr != nil { + t.Fatalf("target file vanished (os.Remove operated on resolved target instead of symlink): %v", rerr) + } + if string(gotTarget) != targetBody { + t.Errorf("target body mutated; want preserved") + } +} + // TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch pins codex // P2 v10 #904: when a prior successful run left an .fsm on // disk and a new --self-test invocation produces a mismatch, @@ -514,7 +603,7 @@ func TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch(t *testing.T) { rawIn := t.TempDir() writeSQSFixture(t, rawIn) emitMinimalManifest(t, rawIn, 7000) - canonicalIn := canonicalizeInput(t, rawIn, 7000) + canonicalIn := canonicalizeInput(t, rawIn) out := filepath.Join(t.TempDir(), "out.fsm") // Pre-place a stale .fsm — what a prior successful run would have @@ -539,27 +628,7 @@ func TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch(t *testing.T) { } // Flip bytes past the EKVPBBL1 header so the re-decode trips on // a malformed entry length and the self-test returns matched=false. - // Pattern mirrors flipBytesPastHeaderHelper in the library test. - encodeOpts.SetSelfTestCorruptHookForTest(func(f *os.File) { - info, ferr := f.Stat() - if ferr != nil { - t.Fatalf("temp Stat: %v", ferr) - } - const headerSkip = 200 - if info.Size() <= headerSkip { - t.Fatalf("temp file too small to corrupt past header: %d bytes", info.Size()) - } - buf := make([]byte, info.Size()-headerSkip) - if _, rerr := f.ReadAt(buf, headerSkip); rerr != nil { - t.Fatalf("ReadAt: %v", rerr) - } - for i := 0; i < len(buf); i += 13 { - buf[i] ^= 0xFF - } - if _, werr := f.WriteAt(buf, headerSkip); werr != nil { - t.Fatalf("WriteAt: %v", werr) - } - }) + encodeOpts.SetSelfTestCorruptHookForTest(flipBytesPastHeaderInTempCorruptHook(t)) cfg := &config{ inputPath: canonicalIn, @@ -628,7 +697,7 @@ func TestCLIRoundTripSelfTestAllAdapters(t *testing.T) { rawIn := t.TempDir() writeSQSFixture(t, rawIn) emitMinimalManifest(t, rawIn, 7000) - canonicalIn := canonicalizeInput(t, rawIn, 7000) + canonicalIn := canonicalizeInput(t, rawIn) out := filepath.Join(t.TempDir(), "out.fsm") code, err := run([]string{ From 3e906dc25ac17c5858fda062aa731b856dd18c70 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 16:08:47 +0900 Subject: [PATCH 22/35] backup: #904 v22 - writeAndPublish godoc: mention symlink + decision-matrix pointer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Claude v21 non-blocking observation: the writeAndPublish summary godoc described "removes any stale .fsm" without mentioning the symlink case (codex P2 v19's surface), and without pointing readers at the per-shape decision matrix that lives in removeStaleOutputFSM's own godoc. Doc-only change: expand the summary to cover the four cases (regular file → deleted; symlink → unlinked; directory + special files → left alone) with the codex citations, and end with a pointer to removeStaleOutputFSM. No behavior change; lint green. --- cmd/elastickv-snapshot-encode/main.go | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 8a76c06dd..65272d571 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -370,9 +370,12 @@ func removeStaleOutputFSM(outputPath string, logger *slog.Logger) { // writeAndPublish writes the .fsm to a temp path, runs the optional // self-test via EncodeSnapshot, and renames temp → output on success. // On self-test failure: writes mismatch.txt, removes any stale -// .fsm left by a prior successful run (codex P2 v10 #904), -// removes the temp file via the deferred cleanup, returns -// errSelfTestMismatch. +// .fsm or symlink at left by a prior successful +// run (codex P2 v10 #904 covered regular files, codex P2 v19 #904 +// extended to symlinks; directories and special files are left +// alone per v14 L347), removes the temp file via the deferred +// cleanup, returns errSelfTestMismatch. See removeStaleOutputFSM +// for the per-shape decision matrix. func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath string, logger *slog.Logger) (backup.EncodeResult, error) { tempPath, err := tempOutputPath(cfg.outputPath) if err != nil { From 22e4c0aa1003f61acbe779588a8662373de25655 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 16:20:52 +0900 Subject: [PATCH 23/35] backup: #904 v23 - fail closed on three unsupported manifest exclusions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex P2 v21 flagged two manifest-derived flags the per-adapter encoders silently ignore today. The symmetric fix is the same as the DynamoDB JSONL guard from v8: fail closed at the validation layer before any bytes reach the snapshot builder. ## Codex P2 v21 L326: S3 include_incomplete_uploads + include_orphans The S3 reverse encoder walks the canonical objects layout and skips top-level _incomplete_uploads/ and _orphans/ payload directories (internal/backup/encode_s3_objects.go). A producer that ran with --include-incomplete-uploads or --include-orphans intentionally dumped those payloads; the encoder would publish an .fsm missing them — silent data loss. ## Codex P2 v21 L473: SQS preserve_sqs_visibility The SQS reverse encoder unconditionally zeroes the in-flight visibility fields (VisibleAtMillis, receive count, first receive, receipt token) on every restored message (internal/backup/encode_sqs.go). A producer that ran with --preserve-sqs-visibility intentionally preserved that in-flight state; the encoder would publish an .fsm that restores as "every message visible/reset" — partial data loss. ## Fix Three new sentinels + three new EncodeOptions fields + one new validation helper: - ErrEncodeUnsupportedS3IncompleteUploads - ErrEncodeUnsupportedS3Orphans - ErrEncodeUnsupportedSQSPreserveVisibility - EncodeOptions.S3IncludeIncompleteUploads / S3IncludeOrphans / PreserveSQSVisibility (bools, default false) - validateEncodeOptionsUnsupportedFeatures: each guard fires only when the corresponding adapter is enabled (S3 → first two, SQS → third). Orthogonal callers are unaffected. CLI's buildEncodeOptions threads the three manifest.Exclusions fields into the new EncodeOptions. classifyEncodeError adds the three sentinels to the exit-2 set, so the CLI surfaces them as data-correctness failures (runbooks branch on exit code to quarantine bad dumps). Until the per-adapter encoders learn these features (future milestone), this is fail-closed quarantine, not a regression. ## Pinned by - TestEncodeSnapshotRejectsUnsupportedFeatures (library, 3 subtests): each flag + matching adapter → its sentinel; buf stays empty. - TestEncodeSnapshotUnsupportedFeaturesGatedByAdapter (library, 3 subtests): each flag + a different adapter enabled → succeeds (the guard MUST not over-fire on out-of-scope callers, mirrors the JSONL "DDB not in scope" exemption). - TestCLIRejectsUnsupportedManifestExclusions (CLI, 3 subtests): manifest with the flag set + matching --adapter → exit-2, no .fsm published. ## Caller audit per CLAUDE.md semantic-change rule - EncodeOptions: gained three new bool fields, all default false. Production callers (CLI) set them from the manifest. Existing library tests pass EncodeOptions{} or with explicit fields, all defaulting to false. No legitimate caller is impacted. - validateEncodeOptionsData: extended to call validateEncodeOptionsUnsupportedFeatures at the end (split for cyclop). Sole caller is validateEncodeOptions; behavior unchanged on the existing paths (HLC floor + DDB JSONL still fire first). - classifyEncodeError: three new sentinels join the exit-2 case arm. exit-1 set is strictly a subset of the prior set. - buildEncodeOptions: now reads manifest.Exclusions when non-nil. Existing call site (encodeOne) unchanged; field plumbing is additive. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 14 +++ cmd/elastickv-snapshot-encode/main_test.go | 67 +++++++++++++ internal/backup/encode_snapshot.go | 82 +++++++++++++++ internal/backup/encode_snapshot_test.go | 111 +++++++++++++++++++++ 4 files changed, 274 insertions(+) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 65272d571..0cea69299 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -107,6 +107,9 @@ func classifyEncodeError(err error) int { switch { case errors.Is(err, backup.ErrSelfTestLowerLastCommitTS), errors.Is(err, backup.ErrEncodeUnsupportedDynamoDBLayout), + errors.Is(err, backup.ErrEncodeUnsupportedS3IncompleteUploads), + errors.Is(err, backup.ErrEncodeUnsupportedS3Orphans), + errors.Is(err, backup.ErrEncodeUnsupportedSQSPreserveVisibility), errors.Is(err, backup.ErrEncodeAdapterData), errors.Is(err, errSelfTestMismatch), errors.Is(err, backup.ErrInvalidManifest), @@ -325,6 +328,17 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes DynamoDBBundleJSONL: manifest.DynamoDBLayout == backup.DynamoDBLayoutJSONL, SelfTest: cfg.selfTest, } + // Thread manifest exclusions into the library guards (codex P2 v21 + // #904): the S3/SQS reverse encoders can't honor these today, so + // failing closed here surfaces the unsupported-feature errors + // before any bytes are written. The CLI's existing + // buildSelfTestDecodeOptions also threads the same fields into + // the scratch decode path so self-test sees a coherent picture. + if manifest.Exclusions != nil { + encodeOpts.S3IncludeIncompleteUploads = manifest.Exclusions.IncludeIncompleteUploads + encodeOpts.S3IncludeOrphans = manifest.Exclusions.IncludeOrphans + encodeOpts.PreserveSQSVisibility = manifest.Exclusions.PreserveSQSVisibility + } if cfg.selfTest { encodeOpts.SelfTestDecodeOptions = buildSelfTestDecodeOptions(cfg, manifest) } diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 3feddae73..99af1dccb 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -118,6 +118,73 @@ func TestCLIAdapterDataErrorExitsTwo(t *testing.T) { } } +// TestCLIRejectsUnsupportedManifestExclusions pins codex P2 v21 #904 +// end-to-end: when MANIFEST.json sets one of the three exclusion +// flags the encoder cannot honor (include_incomplete_uploads, +// include_orphans, preserve_sqs_visibility) AND the corresponding +// adapter is enabled, the CLI must exit 2 (data-correctness) before +// any bytes are written. Mirrors the DynamoDB JSONL guard. +func TestCLIRejectsUnsupportedManifestExclusions(t *testing.T) { + t.Parallel() + cases := []struct { + name string + mutate func(*backup.Exclusions) + adapters string + }{ + { + name: "include_incomplete_uploads with --adapter=s3", + mutate: func(e *backup.Exclusions) { e.IncludeIncompleteUploads = true }, + adapters: "s3", + }, + { + name: "include_orphans with --adapter=s3", + mutate: func(e *backup.Exclusions) { e.IncludeOrphans = true }, + adapters: "s3", + }, + { + name: "preserve_sqs_visibility with --adapter=sqs", + mutate: func(e *backup.Exclusions) { e.PreserveSQSVisibility = true }, + adapters: "sqs", + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + in := t.TempDir() + m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) + m.LastCommitTS = 100 + m.Adapters = &backup.Adapters{} + m.Exclusions = &backup.Exclusions{} + tc.mutate(m.Exclusions) + f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) + if ferr != nil { + t.Fatalf("create MANIFEST.json: %v", ferr) + } + if werr := backup.WriteManifest(f, m); werr != nil { + t.Fatalf("WriteManifest: %v", werr) + } + if cerr := f.Close(); cerr != nil { + t.Fatalf("close: %v", cerr) + } + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{ + "--input", in, + "--output", out, + "--adapter", tc.adapters, + }, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want exit-2 from unsupported manifest exclusion") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (data-correctness for unsupported exclusion)", code, exitDataErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm exists at %s; should not be published on unsupported-feature rejection", out) + } + }) + } +} + // TestCLIRejectsLowerLastCommitTSOverride is the fail-closed pin per // parent §"MVCC re-encoding": T < manifest.last_commit_ts → exit 2 // (data-correctness failure, not flag-parse error). diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index 7afd17357..eb73ecc9c 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -41,6 +41,35 @@ var ErrSelfTestLowerLastCommitTS = errors.New("backup: --last-commit-ts T < mani // until the encoder learns the JSONL layout (M7 / future milestone). var ErrEncodeUnsupportedDynamoDBLayout = errors.New("backup: DynamoDB JSONL layout not supported by encoder") +// ErrEncodeUnsupportedS3IncompleteUploads is returned when a caller +// (the CLI threading manifest.exclusions.include_incomplete_uploads, +// or a library caller setting EncodeOptions.S3IncludeIncompleteUploads) +// asks for S3 in-flight multipart uploads to round-trip, but the +// reverse encoder cannot rebuild that subtree. The S3 reverse encoder +// silently skips `_incomplete_uploads/` payload directories +// (internal/backup/encode_s3_objects.go), so a dump that included +// those records would publish an .fsm missing them. Fail closed until +// the encoder learns the subtree (codex P2 v21 #904). +var ErrEncodeUnsupportedS3IncompleteUploads = errors.New("backup: S3 include_incomplete_uploads not supported by encoder") + +// ErrEncodeUnsupportedS3Orphans is returned when the manifest (or a +// library caller) requests round-tripping S3 pre-generation orphan +// blob chunks but the reverse encoder cannot rebuild that subtree. +// Same pattern as ErrEncodeUnsupportedS3IncompleteUploads — the S3 +// reverse encoder silently skips `_orphans/` payload directories; +// fail closed until the encoder learns them (codex P2 v21 #904). +var ErrEncodeUnsupportedS3Orphans = errors.New("backup: S3 include_orphans not supported by encoder") + +// ErrEncodeUnsupportedSQSPreserveVisibility is returned when the +// manifest requests preserving in-flight SQS message visibility state +// (`preserve_sqs_visibility=true`), but the reverse encoder +// unconditionally resets VisibleAtMillis / receive count / first +// receive / receipt token to zero on every restored message +// (internal/backup/encode_sqs.go). A dump that intentionally +// preserved visibility would silently restore as "every message +// visible/reset" without this guard (codex P2 v21 #904). +var ErrEncodeUnsupportedSQSPreserveVisibility = errors.New("backup: preserve_sqs_visibility not supported by encoder") + // ErrRedisEncodeMultiDBUnsupported is returned when the input tree // contains a Redis db_/ for any N != 0, or contains multiple // db_ directories. The current Redis MVCC key prefixes @@ -107,6 +136,31 @@ type EncodeOptions struct { // support, this field will switch from a guard to a control. DynamoDBBundleJSONL bool + // S3IncludeIncompleteUploads is true when the input dump's + // MANIFEST.json has `exclusions.include_incomplete_uploads=true` + // (the producer dumped in-flight multipart uploads under + // _incomplete_uploads/). The reverse encoder cannot rebuild that + // subtree today; fail-closed via ErrEncodeUnsupportedS3IncompleteUploads + // when true AND Adapters.S3 is enabled (codex P2 v21 #904). + S3IncludeIncompleteUploads bool + + // S3IncludeOrphans is true when the input dump's MANIFEST.json + // has `exclusions.include_orphans=true` (the producer dumped + // pre-generation orphan blob chunks under _orphans/). The reverse + // encoder cannot rebuild that subtree today; fail-closed via + // ErrEncodeUnsupportedS3Orphans when true AND Adapters.S3 is + // enabled (codex P2 v21 #904). + S3IncludeOrphans bool + + // PreserveSQSVisibility is true when the input dump's MANIFEST.json + // has `exclusions.preserve_sqs_visibility=true` (the producer + // preserved in-flight message visibility state — VisibleAtMillis, + // receive count, first receive, receipt token). The reverse + // encoder unconditionally zeros those fields on every restored + // message; fail-closed via ErrEncodeUnsupportedSQSPreserveVisibility + // when true AND Adapters.SQS is enabled (codex P2 v21 #904). + PreserveSQSVisibility bool + // ManifestLastCommitTS is the floor LastCommitTS must not fall // below. When > 0, EncodeSnapshot fails-closed with // ErrSelfTestLowerLastCommitTS if LastCommitTS < ManifestLastCommitTS. @@ -278,6 +332,34 @@ func validateEncodeOptionsData(opts EncodeOptions) error { // JSONL items would be silently skipped (codex P2 v7 #904). return errors.WithStack(ErrEncodeUnsupportedDynamoDBLayout) } + return validateEncodeOptionsUnsupportedFeatures(opts) +} + +// validateEncodeOptionsUnsupportedFeatures rejects manifest-derived +// flags that the per-adapter encoders cannot honor today. Each guard +// fires only when the corresponding adapter is enabled — a caller +// encoding only Redis + SQS with `include_incomplete_uploads=true` +// inherited from the manifest is unaffected (no S3 → no concern). +// Split out from validateEncodeOptionsData to stay under cyclop; +// codex P2 v21 #904 added the three S3/SQS exclusion guards. +func validateEncodeOptionsUnsupportedFeatures(opts EncodeOptions) error { + if opts.S3IncludeIncompleteUploads && opts.Adapters.S3 { + // The S3 reverse encoder skips _incomplete_uploads/ payload + // directories; a dump that included them would silently lose + // those records (codex P2 v21 #904 L326). + return errors.WithStack(ErrEncodeUnsupportedS3IncompleteUploads) + } + if opts.S3IncludeOrphans && opts.Adapters.S3 { + // The S3 reverse encoder skips _orphans/ payload directories; + // same silent-data-loss pattern (codex P2 v21 #904 L326). + return errors.WithStack(ErrEncodeUnsupportedS3Orphans) + } + if opts.PreserveSQSVisibility && opts.Adapters.SQS { + // The SQS reverse encoder unconditionally zeroes the + // visibility fields; a dump that preserved them would lose + // the state on restore (codex P2 v21 #904 L473). + return errors.WithStack(ErrEncodeUnsupportedSQSPreserveVisibility) + } return nil } diff --git a/internal/backup/encode_snapshot_test.go b/internal/backup/encode_snapshot_test.go index 6cd20b99a..59e157087 100644 --- a/internal/backup/encode_snapshot_test.go +++ b/internal/backup/encode_snapshot_test.go @@ -366,6 +366,117 @@ func TestEncodeSnapshotManifestFloorOptOut(t *testing.T) { } } +// TestEncodeSnapshotRejectsUnsupportedFeatures pins codex P2 v21 #904: +// three manifest-derived flags that the per-adapter encoders cannot +// honor today must fail-closed before any bytes are written. Each +// guard fires only when the corresponding adapter is enabled — +// orthogonal callers (e.g., S3IncludeIncompleteUploads=true while +// encoding only Redis) are unaffected. Pattern matches the +// DynamoDBBundleJSONL guard from v8. +func TestEncodeSnapshotRejectsUnsupportedFeatures(t *testing.T) { + t.Parallel() + cases := []struct { + name string + opts EncodeOptions + wantErr error + }{ + { + name: "S3 include_incomplete_uploads", + opts: EncodeOptions{ + Adapters: AdapterSet{S3: true}, + S3IncludeIncompleteUploads: true, + }, + wantErr: ErrEncodeUnsupportedS3IncompleteUploads, + }, + { + name: "S3 include_orphans", + opts: EncodeOptions{ + Adapters: AdapterSet{S3: true}, + S3IncludeOrphans: true, + }, + wantErr: ErrEncodeUnsupportedS3Orphans, + }, + { + name: "SQS preserve_visibility", + opts: EncodeOptions{ + Adapters: AdapterSet{SQS: true}, + PreserveSQSVisibility: true, + }, + wantErr: ErrEncodeUnsupportedSQSPreserveVisibility, + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + in := t.TempDir() + opts := tc.opts + opts.InputRoot = in + opts.LastCommitTS = 1 + opts.AllowMissingManifest = true + var buf bytes.Buffer + _, err := EncodeSnapshot(opts, &buf) + if err == nil { + t.Fatalf("EncodeSnapshot accepted unsupported feature; want %v", tc.wantErr) + } + if !errors.Is(err, tc.wantErr) { + t.Errorf("err = %v, want errors.Is %v", err, tc.wantErr) + } + if buf.Len() != 0 { + t.Errorf("buf.Len = %d, want 0 (no bytes should be written on rejection)", buf.Len()) + } + }) + } +} + +// TestEncodeSnapshotUnsupportedFeaturesGatedByAdapter pins that each +// of the three v21 guards fires ONLY when its corresponding adapter +// is enabled. A library caller that inherits the manifest flag but +// disables the affected adapter is unaffected — mirrors the JSONL +// guard's "DDB not in scope" exemption. +func TestEncodeSnapshotUnsupportedFeaturesGatedByAdapter(t *testing.T) { + t.Parallel() + cases := []struct { + name string + opts EncodeOptions + }{ + { + name: "S3IncludeIncompleteUploads with S3 disabled", + opts: EncodeOptions{ + Adapters: AdapterSet{SQS: true}, // not S3 + S3IncludeIncompleteUploads: true, + }, + }, + { + name: "S3IncludeOrphans with S3 disabled", + opts: EncodeOptions{ + Adapters: AdapterSet{SQS: true}, // not S3 + S3IncludeOrphans: true, + }, + }, + { + name: "PreserveSQSVisibility with SQS disabled", + opts: EncodeOptions{ + Adapters: AdapterSet{Redis: true}, // not SQS + PreserveSQSVisibility: true, + }, + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + in := t.TempDir() + opts := tc.opts + opts.InputRoot = in + opts.LastCommitTS = 1 + opts.AllowMissingManifest = true + var buf bytes.Buffer + if _, err := EncodeSnapshot(opts, &buf); err != nil { + t.Errorf("EncodeSnapshot rejected unsupported flag when its adapter was out of scope: %v", err) + } + }) + } +} + // TestEncodeSnapshotRejectsDynamoDBJSONLLayout pins codex P2 v7 #904: // the DynamoDB reverse encoder does not support the JSONL bundle // layout, so a caller that threads DynamoDBBundleJSONL=true must be From 90f49b67acac535abdae6fe0132c6f0b1f9cf24d Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 17:39:13 +0900 Subject: [PATCH 24/35] backup: #904 v24 - claude v23 mandatory doc fix Claude v23 verdict: the classifyEncodeError godoc lists 5 sentinel sources but the switch now covers 8 (v23 added the three unsupported-feature sentinels: ErrEncodeUnsupportedS3IncompleteUploads, ErrEncodeUnsupportedS3Orphans, ErrEncodeUnsupportedSQSPreserveVisibility). A runbook author reading the comment to enumerate exit-2 sentinels would find 5 and not understand where the extra 3 came from. Doc updates: - classifyEncodeError godoc: added "unsupported manifest exclusion flags" to the summary clause; added three bullet entries to the "Sources of each sentinel" list, each pointing at validateEncodeOptionsUnsupportedFeatures with the codex P2 v21 citation. - validateEncodeOptionsData godoc (also flagged by claude v23 as non-blocking): summary now mentions the three guards transitively enforced via validateEncodeOptionsUnsupportedFeatures. No behavior change. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 15 +++++++++++---- internal/backup/encode_snapshot.go | 5 ++++- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 0cea69299..a5364452b 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -88,16 +88,23 @@ func run(argv []string, logger *slog.Logger) (int, error) { // classifyEncodeError maps the encodeOne return value to a CLI exit // code. Data-correctness sentinels (HLC ceiling regression, JSONL -// layout, adapter rejecting input-tree contents, self-test mismatch, -// corrupt manifest) → exit 2; everything else → exit 1. Runbooks -// branch on exit status to triage bad-dump-data vs operator typos, -// so this mapping is part of the CLI contract. +// layout, unsupported manifest exclusion flags, adapter rejecting +// input-tree contents, self-test mismatch, corrupt manifest) → +// exit 2; everything else → exit 1. Runbooks branch on exit status +// to triage bad-dump-data vs operator typos, so this mapping is +// part of the CLI contract. // // Sources of each sentinel: // - ErrSelfTestLowerLastCommitTS: CLI resolveLastCommitTS + library // validateEncodeOptionsData (codex P2 v2 #904) // - ErrEncodeUnsupportedDynamoDBLayout: validateEncodeOptionsData // (codex P2 v7 #904) +// - ErrEncodeUnsupportedS3IncompleteUploads: validateEncodeOptionsUnsupportedFeatures +// (codex P2 v21 #904) +// - ErrEncodeUnsupportedS3Orphans: validateEncodeOptionsUnsupportedFeatures +// (codex P2 v21 #904) +// - ErrEncodeUnsupportedSQSPreserveVisibility: validateEncodeOptionsUnsupportedFeatures +// (codex P2 v21 #904) // - ErrEncodeAdapterData: runAdapterEncoders mark on adapter // rejection (codex P2 v9 #904) // - errSelfTestMismatch: writeAndPublish self-test branch diff --git a/internal/backup/encode_snapshot.go b/internal/backup/encode_snapshot.go index eb73ecc9c..dfc38c436 100644 --- a/internal/backup/encode_snapshot.go +++ b/internal/backup/encode_snapshot.go @@ -318,7 +318,10 @@ func checkInputRoot(opts EncodeOptions) error { } // validateEncodeOptionsData covers the data-correctness pre-conditions: -// HLC ceiling floor and DynamoDB JSONL guard. Kept separate from the +// HLC ceiling floor, DynamoDB JSONL guard, and (via +// validateEncodeOptionsUnsupportedFeatures) the three manifest +// exclusion guards added by codex P2 v21 #904 (S3 incomplete uploads, +// S3 orphans, SQS preserve-visibility). Kept separate from the // nil/empty-args checks so each function stays cyclop-clean. func validateEncodeOptionsData(opts EncodeOptions) error { if opts.ManifestLastCommitTS > 0 && opts.LastCommitTS < opts.ManifestLastCommitTS { From a9fe00b08ee68e6af665264e8a861c9e969d9f5c Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 21:50:39 +0900 Subject: [PATCH 25/35] backup: #904 v25 - fsync parent dir after rename + sidecar write (codex P2 v24) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex P2 v24: the publish path fsyncs the temp file before rename, but neither the post-rename .fsm path nor the post-write sidecar path fsyncs the parent directory. On filesystems where rename durability requires syncing the containing directory, a power loss or host crash immediately after a successful encode can lose the new directory entry (or resurrect the old one) even though the CLI returned success. The repo already has this pattern in internal/encryption/sidecar.go (syncDir) and internal/raftengine/etcd/persistence.go; codex flagged its absence on this publish path. Fix: new fsyncParentDir(path) helper that opens filepath.Dir(path) read-only and calls Sync. Called after two write paths: - writeAndPublish: after os.Rename(tempPath, cfg.outputPath) succeeds. Sole production caller is encodeOne. - writeSidecar: after the sidecar's f.Sync() + Close. Sole production caller is encodeOne. Both error returns are bare wraps (not data-correctness sentinels), so classifyEncodeError routes them to exit-1 — the correct exit class for "operator environment can't durable-publish" (NFS, FUSE mounts that reject directory fsync, ENOSPC on parent, etc.). A successful publish on a normal filesystem is unchanged. The helper is local to package main rather than importing internal/encryption to avoid a one-direction dependency for a 6-line file-open + Sync. Mirrors syncDir's contract: read-only open + Sync, error bubbled up wrapped with the directory path. ## Caller audit per CLAUDE.md semantic-change rule - writeAndPublish: success path now does one extra fsync call. Same return contract (result, error). Sole caller is encodeOne. New failure mode: fsync error → encodeOne returns wrapped error → run() classifies as exit-1 (no matching data-correctness sentinel). - writeSidecar: success path now does one extra fsync call. Same return contract (error). Sole caller is encodeOne (success and self-test mismatch branches). New failure mode is the same as the existing pre-existing Sync()/Close() failures: returned wrapped. - fsyncParentDir: new package-private helper. No production callers outside main.go; no test seam needed (the success path runs on every CI test that exercises encodeOne). ## Test coverage Every existing test that exercises a successful encode + publish runs the new fsync call. Failure-injection for fsync would require a heavy seam (override os.Open + File.Sync for one test); not added since the success path is already covered and the failure surface matches the existing Sync/Close failure paths already covered. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 38 +++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index a5364452b..bbe429750 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -437,9 +437,41 @@ func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath return result, errors.Wrap(err, "rename tmp -> output") } publishedTempPath = "" // rename succeeded; defer no-ops + // fsync the parent dir so the rename's new directory entry is + // durable. Without this, a power loss / host crash immediately + // after a successful encode can lose the new entry (or + // resurrect the old one) on filesystems where rename durability + // requires syncing the containing directory. Mirrors the repo + // pattern used by internal/encryption/sidecar.go + + // internal/raftengine/etcd/persistence.go (codex P2 v24 #904). + if err := fsyncParentDir(cfg.outputPath); err != nil { + return result, errors.Wrap(err, "fsync output dir after rename") + } return result, nil } +// fsyncParentDir opens the parent directory of path read-only and +// calls fsync on its file descriptor. On most POSIX filesystems this +// is what makes os.Rename durable. Errors other than path-traversal +// (which means the operator passed something weird like "" — already +// rejected upstream) bubble up so the caller can surface them. +// +// Mirrors syncDir in internal/encryption/sidecar.go and the etcd +// raftengine persistence helper; kept local here so the CLI binary +// doesn't depend on internal/encryption for a 6-line helper. +func fsyncParentDir(path string) error { + dir := filepath.Dir(path) + f, err := os.Open(dir) //nolint:gosec // dir is derived from operator-supplied --output path + if err != nil { + return errors.Wrapf(err, "open parent dir %q", dir) + } + defer func() { _ = f.Close() }() + if err := f.Sync(); err != nil { + return errors.Wrapf(err, "fsync parent dir %q", dir) + } + return nil +} + // encodeToTempFile creates tempPath, runs EncodeSnapshot into it, // fsync+close. Caller is responsible for the os.Remove cleanup on error. // The temp file is created mode 0600 so the on-disk .fsm is not @@ -547,5 +579,11 @@ func writeSidecar(cfg *config, m backup.Manifest, effectiveTS uint64, overridden if err := f.Close(); err != nil { return errors.WithStack(err) } + // fsync the parent dir so the new sidecar's directory entry is + // durable alongside its bytes. Mirrors the rename path + // (codex P2 v24 #904). + if err := fsyncParentDir(sidecarPath); err != nil { + return errors.Wrap(err, "fsync sidecar parent dir") + } return nil } From 836646d2a2cf0206c2482ee2a1677f21f474ca77 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 22:00:50 +0900 Subject: [PATCH 26/35] backup: #904 v26 - no-follow opens for sidecar + mismatch.txt (codex P2 v25) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex P2 v25 #904: the CLI's .encode_info.json writer used os.OpenFile with O_CREATE|O_WRONLY|O_TRUNC and no O_NOFOLLOW. If an attacker (or a confused shared-host config) pre-creates a symlink at that deterministic path pointing at a sensitive file the encoding user can write, the open follows the link and the truncate-and-write lands on the target — exactly the sidecar-clobber attack the in-package adapter writers already defend against via internal/backup/open_nofollow_unix.go. The mismatch.txt writer at the same operator-supplied directory has the identical threat model: deterministic name, attacker can pre-place a symlink. Fixed for symmetry. ## Fix New exported wrapper backup.OpenSidecarFile(path) delegates to the existing unexported per-platform openSidecarFile. Single non-build-tagged file so the compiler picks up the right platform-specific impl (unix: O_NOFOLLOW + O_NONBLOCK + Nlink check + regular-file check; Windows: Lstat-then-OpenFile; other: fallback Lstat). writeSidecar in cmd/elastickv-snapshot-encode/main.go switches to backup.OpenSidecarFile (was os.OpenFile + O_CREATE|O_WRONLY|O_TRUNC). mismatch.txt write extracted into a new writeMismatchTxt helper that uses the same primitive. mismatchTxtPerm constant removed — both writes now get the 0o600 mode from inside openSidecarFile. The const for the temp .fsm path (encodeInfoFilePerm) stays because that write uses a different codepath (random suffix means no pre-existing symlink risk). ## Pinned by TestCLISidecarWriteRefusesSymlinkTarget (skipped on Windows): - Pre-places a "victim" file with a sentinel body in a separate temp dir. - Symlinks .encode_info.json to that victim. - Runs the CLI with a fresh manifest (encode would otherwise succeed) and asserts: - run() returns non-nil error and exits exitUserErr (not exitDataErr — an operator-env failure, not a data-correctness rejection). - The victim file is byte-equal to its pre-write contents (the OpenSidecarFile open did NOT follow the symlink and the truncate-write never reached the victim). ## Caller audit per CLAUDE.md semantic-change rule - writeSidecar: success path now opens via backup.OpenSidecarFile. Sole caller is encodeOne. On a symlinked sidecar path, returns wrapped ELOOP error → classifyEncodeError routes to exit-1 (no matching data-correctness sentinel). On a hard-linked or non- regular-file path, returns a wrapped error from the helper's own fstat guards. Same return contract; same caller; same on-disk effect on the success path (mode 0o600, atomic-replace). - writeMismatchTxt: new helper, sole caller is the self-test mismatch branch of writeAndPublish (was inline os.WriteFile). Returns wrapped error. The caller previously logged-and-continued on the os.WriteFile error; preserved that behavior (logger.Warn + drop through to the mismatch-error return). - backup.OpenSidecarFile: new exported function, simple wrapper. No production callers outside this CLI; the package backup in-package callers continue to call openSidecarFile directly. Tests + lint green (incl. gofmt on the perm-const block). --- cmd/elastickv-snapshot-encode/main.go | 50 +++++++++++++++++++--- cmd/elastickv-snapshot-encode/main_test.go | 49 +++++++++++++++++++++ internal/backup/open_sidecar_export.go | 21 +++++++++ 3 files changed, 114 insertions(+), 6 deletions(-) create mode 100644 internal/backup/open_sidecar_export.go diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index bbe429750..5dae08a22 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -50,9 +50,13 @@ const ( // 8-hex/4-byte form was flagged as collision-prone in highly // concurrent CI environments (gemini medium #904); 8 bytes is the // same width crypto/rand.Read pads cryptographic nonces to. - tempSuffixHexLen = 16 - tempSuffixByteLen = tempSuffixHexLen / 2 - mismatchTxtPerm = 0o600 + tempSuffixHexLen = 16 + tempSuffixByteLen = tempSuffixHexLen / 2 + // mismatchTxtPerm + sidecar perm constants were removed in v25: + // both writes now go through backup.OpenSidecarFile which fixes + // the mode at 0o600 internally (codex P2 v25 #904 — no-follow + // open required different syscall semantics than os.OpenFile + + // O_TRUNC, so the perm now lives in the helper). encodeInfoFilePerm = 0o600 ) @@ -352,6 +356,31 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes return encodeOpts } +// writeMismatchTxt writes the self-test mismatch report to mismatchPath +// using the same no-follow/no-clobber discipline as the sidecar +// writer: an attacker pre-placing a symlink at +// .mismatch.txt could otherwise redirect the +// truncate-and-write into a target of their choosing (codex P2 v25 +// #904 — extending the sidecar guard to the sibling deterministic +// write path). On open failure the caller (writeAndPublish) logs at +// warn level and continues; the failure does NOT block the +// errSelfTestMismatch return so the mismatch error remains the +// dominant signal. +func writeMismatchTxt(mismatchPath string, body []byte) error { + f, err := backup.OpenSidecarFile(mismatchPath) + if err != nil { + return errors.Wrap(err, "open mismatch.txt") + } + if _, werr := f.Write(body); werr != nil { + _ = f.Close() + return errors.Wrap(werr, "write mismatch.txt body") + } + if cerr := f.Close(); cerr != nil { + return errors.Wrap(cerr, "close mismatch.txt") + } + return nil +} + // removeStaleOutputFSM removes outputPath ONLY when it exists as a // regular file or a symlink. Both shapes satisfy the "no // restore-visible FSM after self-test mismatch" contract: removing a @@ -413,7 +442,7 @@ func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath return result, err } if cfg.selfTest && !result.SelfTestMatched { - if werr := os.WriteFile(mismatchPath, result.SelfTestMismatchTxt, mismatchTxtPerm); werr != nil { + if werr := writeMismatchTxt(mismatchPath, result.SelfTestMismatchTxt); werr != nil { logger.Warn("write mismatch.txt", "err", werr) } // Remove the stale .fsm if one exists from a prior @@ -564,9 +593,18 @@ func writeSidecar(cfg *config, m backup.Manifest, effectiveTS uint64, overridden // 0o600 keeps ENCODE_INFO.json (which includes the source path, // cluster_id, and SHA-256 of the .fsm) from leaking to non-owner // users on multi-user backup hosts (claude v4 #904). - f, err := os.OpenFile(sidecarPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, encodeInfoFilePerm) //nolint:gosec // operator-supplied path + // + // backup.OpenSidecarFile refuses to follow a symlink at the + // sidecar path, refuses to truncate a hard-linked or + // non-regular file there, and (on unix) refuses to block on a + // reader-less FIFO — all the clobber-attack vectors the adapter + // dump writers already defend against. Without these guards an + // attacker pre-placing a symlink at .encode_info.json + // could redirect the truncate-and-write into a target of their + // choosing (codex P2 v25 #904). + f, err := backup.OpenSidecarFile(sidecarPath) if err != nil { - return errors.WithStack(err) + return errors.Wrap(err, "open sidecar") } if err := backup.WriteEncodeInfo(f, info); err != nil { _ = f.Close() diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 99af1dccb..c3473a244 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -906,6 +906,55 @@ func TestParseLastCommitTS(t *testing.T) { } } +// TestCLISidecarWriteRefusesSymlinkTarget pins codex P2 v25 #904: the +// CLI's encode_info sidecar writer must not follow a pre-existing +// symlink at .encode_info.json. An attacker (or a confused +// shared-host config) could plant a symlink pointing at a sensitive +// file the encoding user can write; the prior os.OpenFile + O_TRUNC +// path would have truncated that target and then written the JSON +// blob to it. backup.OpenSidecarFile refuses the open with ELOOP on +// unix. Skipped on Windows where symlink semantics differ and the +// Windows variant of OpenSidecarFile uses the Lstat-then-OpenFile +// guard. +func TestCLISidecarWriteRefusesSymlinkTarget(t *testing.T) { + if isWindows { + t.Skip("symlink-refusal semantics differ on Windows") + } + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 100) + + outDir := t.TempDir() + out := filepath.Join(outDir, "out.fsm") + sidecarPath := backup.EncodeInfoSidecarPath(out) + // Pre-plant the attacker's victim file and the symlink. + victimDir := t.TempDir() + victim := filepath.Join(victimDir, "victim.json") + const victimBody = "VICTIM CONTENTS — must survive sidecar write" + if err := os.WriteFile(victim, []byte(victimBody), 0o600); err != nil { + t.Fatalf("WriteFile victim: %v", err) + } + if err := os.Symlink(victim, sidecarPath); err != nil { + t.Fatalf("Symlink at sidecar path: %v", err) + } + + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want sidecar open to fail on symlinked path") + } + if code != exitUserErr { + t.Errorf("exit code = %d, want %d (operator-env error, not data-correctness)", code, exitUserErr) + } + // Victim file MUST be untouched. + got, rerr := os.ReadFile(victim) + if rerr != nil { + t.Fatalf("read victim: %v", rerr) + } + if string(got) != victimBody { + t.Errorf("victim mutated; OpenSidecarFile followed the symlink (codex P2 v25 regression)") + } +} + // TestCLIPublishesFsmAndSidecarMode0600 pins claude v4 #904: the // produced .fsm and ENCODE_INFO.json are created with mode 0o600 so a // multi-user backup host does not get a world-readable dataset. The diff --git a/internal/backup/open_sidecar_export.go b/internal/backup/open_sidecar_export.go new file mode 100644 index 000000000..b476e87bc --- /dev/null +++ b/internal/backup/open_sidecar_export.go @@ -0,0 +1,21 @@ +package backup + +import "os" + +// OpenSidecarFile is the exported wrapper around the per-platform +// openSidecarFile. It opens path for write while refusing symlink, +// hard-link, FIFO, socket, and other non-regular-file clobber +// attacks via the platform-appropriate primitives (O_NOFOLLOW + +// O_NONBLOCK + Nlink check on unix; Lstat-then-OpenFile on Windows; +// a stricter Lstat-then-OpenFile fallback on other platforms). +// +// Use this whenever a writer creates or replaces a "sidecar" style +// file at a deterministic path inside an operator-supplied +// directory — the path is predictable to an attacker who can pre- +// create the entry, so the open MUST refuse to follow a symlink or +// truncate a hard-linked / non-regular file (codex P2 v25 #904 +// extended this from in-package adapter writers to the +// cmd/elastickv-snapshot-encode CLI's ENCODE_INFO.json writer). +func OpenSidecarFile(path string) (*os.File, error) { + return openSidecarFile(path) +} From 029bb349965bb43ea499caecad0801c8f827b362 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 22:11:06 +0900 Subject: [PATCH 27/35] backup: #904 v27 - validate selected adapters against manifest (codex P2 v26) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v26: silent partial / stale encodes The CLI built EncodeOptions from cfg.adapters without consulting manifest.Adapters. Two silent-data-correctness scenarios: - Scenario A (truncated dump): MANIFEST.json says `adapters.sqs` (non-nil scope) but the on-disk `sqs/` subdir was lost. The SQS reverse encoder no-ops on a missing top-level subdir, so the CLI publishes a header-only or partial `.fsm` without flagging it. - Scenario B (stale subdir): MANIFEST.json omits one adapter (nil scope), but a stale subdir for that adapter still exists on disk from a prior decode. The default `--adapter dynamodb,s3,redis,sqs` enables all four, so the encoder picks up the stale subtree even though the manifest says the producer didn't dump it. Both scenarios are silent integrity failures the encoder is supposed to fail closed against. ## Fix New errAdapterNotInManifest sentinel + new validateAdaptersAgainstManifest(cfg.adapters, manifest, inputPath) helper called from encodeOne immediately after readInputManifest. For each enabled adapter, the helper requires BOTH: 1. manifest.Adapters. is non-nil (producer dumped this adapter). 2. // exists and is a directory. Either failure returns errAdapterNotInManifest wrapped with the specific adapter name and reason. classifyEncodeError routes the sentinel to exit 2 (data-correctness, like the manifest-floor and manifest-corruption sentinels). Helper is split into validateAdaptersAgainstManifest + manifestAdapterField + checkOneAdapterScope so each function stays under the cyclop bound and the per-adapter table is readable. ## Pinned by - TestCLIRejectsAdapterNotInManifest: manifest lists ONLY SQS, user runs with default `--adapter` (enables all four). Asserts exit 2 and no .fsm published. - TestCLIRejectsAdapterListedButSubdirMissing: manifest lists SQS with a non-empty Queues scope, but sqs/ subdir is absent on disk. Asserts exit 2 and no .fsm published. ## Test fixture update — emitMinimalManifest Pre-v27 emitMinimalManifest produced a manifest with `Adapters = &backup.Adapters{}` (all per-adapter pointers nil), and 18 CLI tests relied on this minimal shape. The new validation would reject every one of them because the default `--adapter` enables all four adapters but the manifest claims none. emitMinimalManifest now populates all four adapter scopes as `&Adapter{}` AND creates an empty subdir per adapter under outRoot. The encoder treats each empty subdir as no-op (same as "no records to encode"), so existing tests continue to exercise their specific rejection paths without tripping the new guard. Tests that need to construct adapter-scope mismatches explicitly bypass emitMinimalManifest and write a custom manifest (the two new tests above use this pattern). ## Caller audit per CLAUDE.md semantic-change rule - encodeOne: gained one new validation call right after readInputManifest. Sole caller is run(). Failure mode adds one new sentinel to the error chain; classifyEncodeError routes to exit 2; on success no behavior change. - validateAdaptersAgainstManifest, manifestAdapterField, checkOneAdapterScope: new helpers. Sole production caller is encodeOne. No external callers. - emitMinimalManifest (test fixture): produces a richer baseline manifest + on-disk skeleton. All 18 existing CLI tests pass unchanged (they exercise their respective rejection paths before the adapter-scope check fires, or they intentionally encode empty trees). Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 107 ++++++++++++++++++++- cmd/elastickv-snapshot-encode/main_test.go | 100 ++++++++++++++++++- 2 files changed, 205 insertions(+), 2 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 5dae08a22..cfc48db57 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -124,13 +124,105 @@ func classifyEncodeError(err error) int { errors.Is(err, backup.ErrEncodeAdapterData), errors.Is(err, errSelfTestMismatch), errors.Is(err, backup.ErrInvalidManifest), - errors.Is(err, backup.ErrUnsupportedFormatVersion): + errors.Is(err, backup.ErrUnsupportedFormatVersion), + errors.Is(err, errAdapterNotInManifest): return exitDataErr default: return exitUserErr } } +// validateAdaptersAgainstManifest intersects the user-selected +// AdapterSet with manifest.Adapters and stats each enabled adapter's +// top-level subdir under inputPath. Two failure modes — both +// data-correctness, both routed to exit 2: +// +// - Manifest lists no scope for an enabled adapter (nil pointer in +// manifest.Adapters.). A truncated/wrong manifest combined with +// the default `--adapter dynamodb,s3,redis,sqs` would otherwise +// pick up a stale on-disk subdir for an adapter the producer did +// not dump (codex P2 v26 #904 scenario B). +// - Manifest lists a scope for an enabled adapter but the on-disk +// subdir is absent. The adapter encoder's "missing top-level +// subdir → no-op" behavior would otherwise publish a header-only +// or partial .fsm without a hard error (codex P2 v26 #904 +// scenario A). +// +// A nil manifest.Adapters is treated as "manifest has no scopes for +// any adapter" — every enabled adapter trips the scenario-B guard. +// Older manifests that omit the Adapters block deliberately are +// expected to also omit on-disk subdirs and pass `--adapter` set to +// only what they DO contain; that case is operator-driven and +// surfaces the same fail-closed error here. +func validateAdaptersAgainstManifest(selected backup.AdapterSet, m backup.Manifest, inputPath string) error { + checks := []struct { + name string + selected bool + scope *backup.Adapter + subdir string + }{ + {"dynamodb", selected.DynamoDB, manifestAdapterField(m.Adapters, "dynamodb"), "dynamodb"}, + {"s3", selected.S3, manifestAdapterField(m.Adapters, "s3"), "s3"}, + {"redis", selected.Redis, manifestAdapterField(m.Adapters, "redis"), "redis"}, + {"sqs", selected.SQS, manifestAdapterField(m.Adapters, "sqs"), "sqs"}, + } + for _, c := range checks { + if err := checkOneAdapterScope(c.name, c.selected, c.scope, filepath.Join(inputPath, c.subdir)); err != nil { + return err + } + } + return nil +} + +// manifestAdapterField returns the *Adapter for one adapter name from +// m.Adapters, or nil if m.Adapters or the specific adapter pointer is +// nil. Centralized so validateAdaptersAgainstManifest's table stays +// readable. +func manifestAdapterField(a *backup.Adapters, name string) *backup.Adapter { + if a == nil { + return nil + } + switch name { + case "dynamodb": + return a.DynamoDB + case "s3": + return a.S3 + case "redis": + return a.Redis + case "sqs": + return a.SQS + default: + return nil + } +} + +// checkOneAdapterScope is the per-adapter half of +// validateAdaptersAgainstManifest. Selected adapters MUST have a +// manifest scope AND a present on-disk subdir; not-selected adapters +// are unchecked (the operator chose to skip them). +func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, subdirPath string) error { + if !selected { + return nil + } + if scope == nil { + return errors.Wrapf(errAdapterNotInManifest, + "adapter %q selected but MANIFEST.json has no scope for it (use --adapter to restrict, or re-dump including this adapter)", + name) + } + info, err := os.Stat(subdirPath) + if err != nil { + return errors.Wrapf(errAdapterNotInManifest, + "adapter %q listed in MANIFEST.json but on-disk subdir %s is missing (stat: %v)", + name, subdirPath, err) + } + if !info.IsDir() { + return errors.Wrapf(errAdapterNotInManifest, + "adapter %q listed in MANIFEST.json but %s is not a directory (mode=%s)", + name, subdirPath, info.Mode()) + } + return nil +} + func parseFlags(argv []string) (*config, error) { fs := flag.NewFlagSet("elastickv-snapshot-encode", flag.ContinueOnError) fs.SetOutput(io.Discard) @@ -263,11 +355,24 @@ func applyAdapterName(name string, s *backup.AdapterSet) error { // to exit-2 without coupling to the encoder's mismatch.txt format. var errSelfTestMismatch = errors.New("backup: --self-test diff against --input") +// errAdapterNotInManifest is returned by validateAdaptersAgainstManifest +// when the user has enabled an adapter that the manifest doesn't list, +// or when the manifest lists an adapter whose top-level subdir is +// missing on disk. Both are silent-data-loss scenarios per codex P2 +// v26 #904: the per-adapter encoders no-op on missing subdirs, so a +// truncated dump or a stale-dir-with-default-`--adapter-all` would +// publish a header-only/partial .fsm. classifyEncodeError routes this +// to exit 2 (data-correctness). +var errAdapterNotInManifest = errors.New("encode: adapter scope mismatch between MANIFEST.json and --adapter / on-disk tree") + func encodeOne(cfg *config, logger *slog.Logger) error { manifest, err := readInputManifest(cfg.inputPath) if err != nil { return err } + if err := validateAdaptersAgainstManifest(cfg.adapters, manifest, cfg.inputPath); err != nil { + return err + } effectiveTS, overridden, err := resolveLastCommitTS(cfg, manifest.LastCommitTS) if err != nil { return err diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index c3473a244..f28deae42 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -22,11 +22,28 @@ var isWindows = runtime.GOOS == "windows" // emitMinimalManifest writes a minimal valid MANIFEST.json under outRoot // with the given lastCommitTS. Used by every CLI test as the producer- // side artifact the encoder will consume. +// +// Populates manifest.Adapters with ALL FOUR adapter scopes (each as +// a non-nil &Adapter{} so the codex P2 v26 #904 validation +// — validateAdaptersAgainstManifest — sees a manifest that claims +// every adapter was dumped. Also creates an empty subdir for each +// adapter under outRoot so the same validation's on-disk-stat guard +// passes (the encoder treats an empty subdir as no-op, matching the +// "no records to encode" semantics). +// +// Tests that need to assert rejection on specific adapter-scope +// scenarios (e.g. manifest missing one adapter) can mutate the +// manifest or filesystem after this returns. func emitMinimalManifest(t *testing.T, outRoot string, lastCommitTS uint64) { t.Helper() m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) m.LastCommitTS = lastCommitTS - m.Adapters = &backup.Adapters{} + m.Adapters = &backup.Adapters{ + DynamoDB: &backup.Adapter{}, + S3: &backup.Adapter{}, + Redis: &backup.Adapter{}, + SQS: &backup.Adapter{}, + } m.Exclusions = &backup.Exclusions{} f, err := os.Create(filepath.Join(outRoot, "MANIFEST.json")) if err != nil { @@ -38,6 +55,11 @@ func emitMinimalManifest(t *testing.T, outRoot string, lastCommitTS uint64) { if err := f.Close(); err != nil { t.Fatalf("close: %v", err) } + for _, sub := range []string{"dynamodb", "s3", "redis", "sqs"} { + if mkErr := os.MkdirAll(filepath.Join(outRoot, sub), 0o755); mkErr != nil { + t.Fatalf("MkdirAll %s: %v", sub, mkErr) + } + } } func quietLogger() *slog.Logger { @@ -185,6 +207,82 @@ func TestCLIRejectsUnsupportedManifestExclusions(t *testing.T) { } } +// TestCLIRejectsAdapterNotInManifest pins codex P2 v26 #904 scenario B: +// the user enables an adapter (default `--adapter` is all four) but +// MANIFEST.json doesn't list a scope for it. The prior code would +// pick up a stale on-disk subdir for an unlisted adapter; the new +// guard rejects with exit 2. +func TestCLIRejectsAdapterNotInManifest(t *testing.T) { + t.Parallel() + in := t.TempDir() + // Manifest lists ONLY SQS (no DynamoDB / S3 / Redis), but the + // default `--adapter dynamodb,s3,redis,sqs` enables all four. + m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) + m.LastCommitTS = 100 + m.Adapters = &backup.Adapters{SQS: &backup.Adapter{}} + m.Exclusions = &backup.Exclusions{} + f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) + if ferr != nil { + t.Fatalf("create MANIFEST.json: %v", ferr) + } + if werr := backup.WriteManifest(f, m); werr != nil { + t.Fatalf("WriteManifest: %v", werr) + } + if cerr := f.Close(); cerr != nil { + t.Fatalf("close: %v", cerr) + } + // Operator forgot to restrict --adapter; encoder must NOT silently + // pick up the (absent) dynamodb/s3/redis subtrees just because + // the CLI default enables them. + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want adapter-scope rejection") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (data-correctness)", code, exitDataErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm should not be published on adapter-scope rejection") + } +} + +// TestCLIRejectsAdapterListedButSubdirMissing pins codex P2 v26 #904 +// scenario A: the manifest lists an adapter (non-nil scope) but the +// matching on-disk subdir is missing. The per-adapter encoder would +// no-op silently and publish a partial/header-only .fsm; the new +// guard rejects up front with exit 2. +func TestCLIRejectsAdapterListedButSubdirMissing(t *testing.T) { + t.Parallel() + in := t.TempDir() + // Manifest lists SQS only; do NOT create sqs/ on disk. + m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) + m.LastCommitTS = 100 + m.Adapters = &backup.Adapters{SQS: &backup.Adapter{Queues: []string{"q1"}}} + m.Exclusions = &backup.Exclusions{} + f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) + if ferr != nil { + t.Fatalf("create MANIFEST.json: %v", ferr) + } + if werr := backup.WriteManifest(f, m); werr != nil { + t.Fatalf("WriteManifest: %v", werr) + } + if cerr := f.Close(); cerr != nil { + t.Fatalf("close: %v", cerr) + } + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out, "--adapter", "sqs"}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want truncated-dump rejection") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (data-correctness)", code, exitDataErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm should not be published on adapter-scope rejection") + } +} + // TestCLIRejectsLowerLastCommitTSOverride is the fail-closed pin per // parent §"MVCC re-encoding": T < manifest.last_commit_ts → exit 2 // (data-correctness failure, not flag-parse error). From bbfb4d35cbf49ffc9ffe9609713a72f74ba51d37 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 22:18:27 +0900 Subject: [PATCH 28/35] backup: #904 v28 - allow absent subdirs for empty adapter scopes (codex P2 v27) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v27: v27's check rejects legitimate empty adapter dumps v27's validateAdaptersAgainstManifest required EVERY enabled adapter to have both a non-nil manifest scope AND an existing on-disk subdir. That's correct for adapters with concrete tables / buckets / databases / queues, but rejects a legitimate case: When the producer dumped a snapshot with adapter X enabled but X had no records, populateAdapterScopes writes a non-nil empty Adapter{} into manifest.Adapters.X. The decoder's per-adapter finalizers do NOT create the top-level subdir in that case (no records to flush → no directory creation). The encoder must accept this shape — a Redis-only dump re-encoded with the default --adapter set (which enables all four) would otherwise be rejected on the dynamodb/s3/sqs adapters even though the manifest correctly declares they're empty. ## Fix Two-tier scope check in checkOneAdapterScope: - scope == nil → fail (producer did not dump this adapter). - scope is empty Adapter{} (no entries in any of Tables / Buckets / Databases / Queues) → on-disk subdir is optional; pass. - scope has concrete entries → on-disk subdir required. New isEmptyAdapterScope helper detects the empty case; the error message on the non-empty path now says "with non-empty scope" so operators can tell which check fired. ## Pinned by TestCLIAcceptsEmptyAdapterScopeWithoutSubdir (new): manifest lists all four adapters with empty Adapter{} scopes, no on-disk subdirs, default --adapter (enables all four). Asserts exit 0 and a header-only .fsm published. Both v26 tests continue to pass for the right reason: - TestCLIRejectsAdapterNotInManifest: still trips on the manifest.Adapters. == nil case (DynamoDB / S3 / Redis nil with default --adapter all). - TestCLIRejectsAdapterListedButSubdirMissing: uses Queues:["q1"] (non-empty scope), so the new non-empty branch fires the subdir stat → fail. ## Caller audit per CLAUDE.md semantic-change rule - checkOneAdapterScope: gained a new pass-through branch for empty scopes. Existing nil-scope and non-empty-scope-with-subdir paths preserved. Sole caller is validateAdaptersAgainstManifest; no external callers. - isEmptyAdapterScope: new pure helper. No callers outside checkOneAdapterScope. - validateAdaptersAgainstManifest / errAdapterNotInManifest / classifyEncodeError: unchanged. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 32 +++++++++++++-- cmd/elastickv-snapshot-encode/main_test.go | 45 ++++++++++++++++++++++ 2 files changed, 74 insertions(+), 3 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index cfc48db57..e5c589f3d 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -198,8 +198,17 @@ func manifestAdapterField(a *backup.Adapters, name string) *backup.Adapter { // checkOneAdapterScope is the per-adapter half of // validateAdaptersAgainstManifest. Selected adapters MUST have a -// manifest scope AND a present on-disk subdir; not-selected adapters -// are unchecked (the operator chose to skip them). +// manifest scope AND, when that scope is non-empty, a present on-disk +// subdir; not-selected adapters are unchecked (the operator chose to +// skip them). +// +// An empty Adapter{} (non-nil but no tables / buckets / databases / +// queues) means "the producer ran with this adapter enabled but the +// adapter had no records to dump." The decoder finalizers do not +// create the top-level subdir in that case, so requiring on-disk +// presence would reject valid Redis-only / header-only / empty dumps +// (codex P2 v27 #904). Empty scope → subdir is optional; only the +// scope=nil case still fails closed. func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, subdirPath string) error { if !selected { return nil @@ -209,10 +218,15 @@ func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, sub "adapter %q selected but MANIFEST.json has no scope for it (use --adapter to restrict, or re-dump including this adapter)", name) } + if isEmptyAdapterScope(scope) { + // Producer dumped this adapter with no records; the absence + // of the on-disk subdir is valid (codex P2 v27 #904). + return nil + } info, err := os.Stat(subdirPath) if err != nil { return errors.Wrapf(errAdapterNotInManifest, - "adapter %q listed in MANIFEST.json but on-disk subdir %s is missing (stat: %v)", + "adapter %q listed in MANIFEST.json with non-empty scope but on-disk subdir %s is missing (stat: %v)", name, subdirPath, err) } if !info.IsDir() { @@ -223,6 +237,18 @@ func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, sub return nil } +// isEmptyAdapterScope reports whether scope has no concrete +// tables / buckets / databases / queues. A non-nil but empty +// Adapter{} is the "producer dumped this adapter, no records" +// signal — distinct from the nil pointer that means "producer did +// not enable this adapter at all" (codex P2 v27 #904). +func isEmptyAdapterScope(scope *backup.Adapter) bool { + return len(scope.Tables) == 0 && + len(scope.Buckets) == 0 && + len(scope.Databases) == 0 && + len(scope.Queues) == 0 +} + func parseFlags(argv []string) (*config, error) { fs := flag.NewFlagSet("elastickv-snapshot-encode", flag.ContinueOnError) fs.SetOutput(io.Discard) diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index f28deae42..8f230d3db 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -247,6 +247,51 @@ func TestCLIRejectsAdapterNotInManifest(t *testing.T) { } } +// TestCLIAcceptsEmptyAdapterScopeWithoutSubdir pins codex P2 v27 +// #904: when the producer ran with an adapter enabled but the +// adapter had no records, populateAdapterScopes writes a non-nil +// empty Adapter{} into the manifest but the decoder does NOT +// create the top-level subdir. The encoder must accept this shape +// — requiring the subdir would reject valid Redis-only or +// header-only dumps re-encoded with the default --adapter set. +func TestCLIAcceptsEmptyAdapterScopeWithoutSubdir(t *testing.T) { + t.Parallel() + in := t.TempDir() + // Manifest lists every adapter with an EMPTY scope (no tables / + // buckets / databases / queues) and no on-disk subdirs. + m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) + m.LastCommitTS = 100 + m.Adapters = &backup.Adapters{ + DynamoDB: &backup.Adapter{}, + S3: &backup.Adapter{}, + Redis: &backup.Adapter{}, + SQS: &backup.Adapter{}, + } + m.Exclusions = &backup.Exclusions{} + f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) + if ferr != nil { + t.Fatalf("create MANIFEST.json: %v", ferr) + } + if werr := backup.WriteManifest(f, m); werr != nil { + t.Fatalf("WriteManifest: %v", werr) + } + if cerr := f.Close(); cerr != nil { + t.Fatalf("close: %v", cerr) + } + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err != nil { + t.Fatalf("run failed: code=%d err=%v (codex P2 v27 regression — empty scope should not require subdir)", code, err) + } + if code != exitSuccess { + t.Errorf("exit code = %d, want %d", code, exitSuccess) + } + // Header-only .fsm should be published. + if _, statErr := os.Stat(out); statErr != nil { + t.Errorf(".fsm not published at %s: %v", out, statErr) + } +} + // TestCLIRejectsAdapterListedButSubdirMissing pins codex P2 v26 #904 // scenario A: the manifest lists an adapter (non-nil scope) but the // matching on-disk subdir is missing. The per-adapter encoder would From eeb2554254fd7f8de1efaea7a2ec4ebb462e69f7 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 22:29:25 +0900 Subject: [PATCH 29/35] backup: #904 v29 - reject stale subdirs for empty adapter scopes (codex P2 v28) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v28: v28 fix had an inverse asymmetry v28 made empty-scope adapters tolerant of missing on-disk subdirs (legitimate "producer dumped this adapter with no records" case). That introduced the inverse gap: empty scope + stale populated subdir was now accepted, so a partially-cleaned input directory where MANIFEST.json says `s3: {}` but `s3/` still has bucket artifacts from a prior decode would publish those stale records that the manifest doesn't declare. ## Fix Two-level scope decision in checkOneAdapterScope (new helper checkSubdirAbsentOrEmpty handles the empty-scope branch): - scope == nil → fail - empty scope, subdir absent or empty → pass - empty scope, subdir non-empty (≥1 entry) → fail (NEW, codex v28) - non-empty scope, subdir missing or not-a-dir → fail - non-empty scope, subdir is a directory → pass The empty-scope-non-empty-subdir error message names the count of stray entries and suggests "clean the directory or re-dump" so the operator's recovery path is obvious. ## Pinned by TestCLIRejectsEmptyScopeWithStaleSubdir (new): manifest declares all four adapters with empty scopes, but a stale `sqs/stale-queue/` exists on disk. CLI run with `--adapter sqs` → exit 2, no .fsm published. All three prior adapter-scope tests still pass for the right reason: - TestCLIAcceptsEmptyAdapterScopeWithoutSubdir (v28): empty scope, no subdirs → exit 0, header-only .fsm. - TestCLIRejectsAdapterNotInManifest (v27): nil scope for unlisted adapter → exit 2. - TestCLIRejectsAdapterListedButSubdirMissing (v27): non-empty scope, subdir missing → exit 2. ## emitMinimalManifest fixture update The pre-v29 emitMinimalManifest set every adapter scope to an empty Adapter{} regardless of on-disk state, so writeSQSFixture + emitMinimalManifest would now produce an empty SQS scope + populated `sqs/` subdir — exactly the v28 case that fails. emitMinimalManifest now scans each adapter subdir; when the subdir has any entries it stamps a placeholder scope entry of the right kind (Tables / Buckets / Databases / Queues) via the new scopeForPopulatedSubdir helper. This mirrors what the decoder's populateAdapterScopes would produce for a real round-trip — the encoder doesn't consult scope content (it walks the on-disk tree); the placeholder just keeps v29's empty-scope-vs-non-empty-subdir guard happy. ## Caller audit per CLAUDE.md semantic-change rule - checkOneAdapterScope: empty-scope branch delegates to checkSubdirAbsentOrEmpty. Existing nil-scope and non-empty-scope branches preserved. Sole caller is validateAdaptersAgainstManifest. - checkSubdirAbsentOrEmpty: new helper; no callers outside checkOneAdapterScope. - emitMinimalManifest (test fixture): scope-population logic is now data-driven. All 18 existing tests pass because the helper produces the same shape the encoder/decoder canonical round-trip would produce. The two new adapter-scope tests (v28 + v29) bypass the helper to construct specific scenarios. - scopeForPopulatedSubdir (test helper): new; only callers are emitMinimalManifest. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 59 +++++++++--- cmd/elastickv-snapshot-encode/main_test.go | 103 ++++++++++++++++++--- 2 files changed, 135 insertions(+), 27 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index e5c589f3d..b44314d95 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -198,17 +198,20 @@ func manifestAdapterField(a *backup.Adapters, name string) *backup.Adapter { // checkOneAdapterScope is the per-adapter half of // validateAdaptersAgainstManifest. Selected adapters MUST have a -// manifest scope AND, when that scope is non-empty, a present on-disk -// subdir; not-selected adapters are unchecked (the operator chose to -// skip them). +// manifest scope; the on-disk subdir requirements depend on whether +// the scope is empty or has concrete entries. Not-selected adapters +// are unchecked (the operator chose to skip them). // -// An empty Adapter{} (non-nil but no tables / buckets / databases / -// queues) means "the producer ran with this adapter enabled but the -// adapter had no records to dump." The decoder finalizers do not -// create the top-level subdir in that case, so requiring on-disk -// presence would reject valid Redis-only / header-only / empty dumps -// (codex P2 v27 #904). Empty scope → subdir is optional; only the -// scope=nil case still fails closed. +// Decision matrix (codex P2 v26 + v27 + v28 #904): +// +// - scope == nil → fail (producer did not dump this adapter). +// - empty scope, subdir absent or empty → pass (no records in dump). +// - empty scope, subdir non-empty → fail (stale data on disk +// not declared in manifest — input directory was reused or +// partially cleaned). +// - non-empty scope, subdir missing or not-a-dir → fail (truncated +// dump). +// - non-empty scope, subdir is a directory → pass. func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, subdirPath string) error { if !selected { return nil @@ -219,9 +222,7 @@ func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, sub name) } if isEmptyAdapterScope(scope) { - // Producer dumped this adapter with no records; the absence - // of the on-disk subdir is valid (codex P2 v27 #904). - return nil + return checkSubdirAbsentOrEmpty(name, subdirPath) } info, err := os.Stat(subdirPath) if err != nil { @@ -237,6 +238,38 @@ func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, sub return nil } +// checkSubdirAbsentOrEmpty fails closed when an empty-scope adapter +// has a non-empty on-disk subdir. The empty scope says the producer +// dumped no records; a populated subdir would otherwise be encoded by +// cfg.adapters (default `--adapter all`) and silently publish stale +// data the manifest doesn't claim (codex P2 v28 #904 — the inverse of +// v27's truncated-dump scenario). +func checkSubdirAbsentOrEmpty(name, subdirPath string) error { + info, err := os.Stat(subdirPath) + if err != nil { + if os.IsNotExist(err) { + return nil + } + return errors.Wrapf(errAdapterNotInManifest, + "adapter %q: stat %s: %v", name, subdirPath, err) + } + if !info.IsDir() { + return errors.Wrapf(errAdapterNotInManifest, + "adapter %q: %s is not a directory (mode=%s)", name, subdirPath, info.Mode()) + } + entries, err := os.ReadDir(subdirPath) + if err != nil { + return errors.Wrapf(errAdapterNotInManifest, + "adapter %q: readdir %s: %v", name, subdirPath, err) + } + if len(entries) != 0 { + return errors.Wrapf(errAdapterNotInManifest, + "adapter %q has empty scope in MANIFEST.json but %s contains %d entries (stale dump artifacts; clean the directory or re-dump)", + name, subdirPath, len(entries)) + } + return nil +} + // isEmptyAdapterScope reports whether scope has no concrete // tables / buckets / databases / queues. A non-nil but empty // Adapter{} is the "producer dumped this adapter, no records" diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 8f230d3db..9e67287bb 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -23,26 +23,28 @@ var isWindows = runtime.GOOS == "windows" // with the given lastCommitTS. Used by every CLI test as the producer- // side artifact the encoder will consume. // -// Populates manifest.Adapters with ALL FOUR adapter scopes (each as -// a non-nil &Adapter{} so the codex P2 v26 #904 validation -// — validateAdaptersAgainstManifest — sees a manifest that claims -// every adapter was dumped. Also creates an empty subdir for each -// adapter under outRoot so the same validation's on-disk-stat guard -// passes (the encoder treats an empty subdir as no-op, matching the -// "no records to encode" semantics). +// Populates manifest.Adapters with a non-nil &Adapter{} per adapter, +// matching what the decoder's populateAdapterScopes would emit. When +// the corresponding on-disk subdir is already populated (e.g. a +// writeSQSFixture call ran before emitMinimalManifest), this helper +// scans the subdir and stamps a placeholder scope entry of the right +// kind (Tables / Buckets / Databases / Queues) so v29's adapter-scope +// validation sees a manifest that matches the on-disk shape (codex +// P2 v28 #904 — empty scope + non-empty subdir was the bug). // -// Tests that need to assert rejection on specific adapter-scope -// scenarios (e.g. manifest missing one adapter) can mutate the -// manifest or filesystem after this returns. +// Then creates empty subdirs for any adapter not already populated so +// validateAdaptersAgainstManifest's empty-scope-empty-subdir branch +// passes. Tests that need a specific adapter-scope mismatch bypass +// this helper and write a custom manifest. func emitMinimalManifest(t *testing.T, outRoot string, lastCommitTS uint64) { t.Helper() m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) m.LastCommitTS = lastCommitTS m.Adapters = &backup.Adapters{ - DynamoDB: &backup.Adapter{}, - S3: &backup.Adapter{}, - Redis: &backup.Adapter{}, - SQS: &backup.Adapter{}, + DynamoDB: scopeForPopulatedSubdir(t, outRoot, "dynamodb", "tables"), + S3: scopeForPopulatedSubdir(t, outRoot, "s3", "buckets"), + Redis: scopeForPopulatedSubdir(t, outRoot, "redis", "databases"), + SQS: scopeForPopulatedSubdir(t, outRoot, "sqs", "queues"), } m.Exclusions = &backup.Exclusions{} f, err := os.Create(filepath.Join(outRoot, "MANIFEST.json")) @@ -62,6 +64,31 @@ func emitMinimalManifest(t *testing.T, outRoot string, lastCommitTS uint64) { } } +// scopeForPopulatedSubdir returns an empty Adapter{} when the named +// subdir under outRoot is absent or empty, and an Adapter{} with a +// single placeholder entry of the appropriate kind when the subdir +// already has fixture content. The encoder doesn't consult the scope +// content (it walks the on-disk subtree); the placeholder just keeps +// v29's empty-scope-vs-non-empty-subdir guard happy. +func scopeForPopulatedSubdir(t *testing.T, outRoot, sub, scopeKind string) *backup.Adapter { + t.Helper() + entries, err := os.ReadDir(filepath.Join(outRoot, sub)) + if err != nil || len(entries) == 0 { + return &backup.Adapter{} + } + switch scopeKind { + case "tables": + return &backup.Adapter{Tables: []string{"placeholder"}} + case "buckets": + return &backup.Adapter{Buckets: []string{"placeholder"}} + case "databases": + return &backup.Adapter{Databases: []uint32{0}} + case "queues": + return &backup.Adapter{Queues: []string{"placeholder"}} + } + return &backup.Adapter{} +} + func quietLogger() *slog.Logger { return slog.New(slog.NewTextHandler(io.Discard, nil)) } @@ -292,6 +319,54 @@ func TestCLIAcceptsEmptyAdapterScopeWithoutSubdir(t *testing.T) { } } +// TestCLIRejectsEmptyScopeWithStaleSubdir pins codex P2 v28 #904: the +// inverse of v27. When MANIFEST.json says an adapter has an empty +// scope (Adapter{}, no records dumped) but a stale on-disk subdir is +// still populated (e.g., the input directory was reused or only +// partially cleaned), the encoder would otherwise pick up the stale +// subtree from cfg.adapters and publish records the manifest does +// NOT declare. Fail closed. +func TestCLIRejectsEmptyScopeWithStaleSubdir(t *testing.T) { + t.Parallel() + in := t.TempDir() + // Manifest declares SQS with EMPTY scope, but write a stale + // sqs/ subdir with content. + m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) + m.LastCommitTS = 100 + m.Adapters = &backup.Adapters{ + DynamoDB: &backup.Adapter{}, + S3: &backup.Adapter{}, + Redis: &backup.Adapter{}, + SQS: &backup.Adapter{}, // declared empty + } + m.Exclusions = &backup.Exclusions{} + f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) + if ferr != nil { + t.Fatalf("create MANIFEST.json: %v", ferr) + } + if werr := backup.WriteManifest(f, m); werr != nil { + t.Fatalf("WriteManifest: %v", werr) + } + if cerr := f.Close(); cerr != nil { + t.Fatalf("close: %v", cerr) + } + // Stale sqs/ tree from a prior decode the operator didn't clean. + if err := os.MkdirAll(filepath.Join(in, "sqs", "stale-queue"), 0o755); err != nil { + t.Fatalf("MkdirAll stale sqs subdir: %v", err) + } + out := filepath.Join(t.TempDir(), "out.fsm") + code, err := run([]string{"--input", in, "--output", out, "--adapter", "sqs"}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want stale-subdir rejection") + } + if code != exitDataErr { + t.Errorf("exit code = %d, want %d (data-correctness)", code, exitDataErr) + } + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm should not be published on stale-subdir rejection") + } +} + // TestCLIRejectsAdapterListedButSubdirMissing pins codex P2 v26 #904 // scenario A: the manifest lists an adapter (non-nil scope) but the // matching on-disk subdir is missing. The per-adapter encoder would From 38f001e42300df19e694f8f0aa29d06b44db1849 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 22:40:16 +0900 Subject: [PATCH 30/35] backup: #904 v30 - retract v28/v29 scope-content heuristics (codex P1 v29) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P1 v29: deferred scope enumeration breaks v28/v29 elastickv-snapshot-decode's populateAdapterScopes explicitly DEFERS scope enumeration: it always writes a non-nil &Adapter{} (all scope arrays empty) for every enabled adapter, regardless of whether the adapter actually emitted files. So the manifest's scope CONTENT cannot distinguish: a. "this adapter had no records" (legitimate empty case) b. "this adapter had records but the decoder didn't enumerate them" (the actual common case for every real round-trip) v27 added a check that required non-empty-scope adapters to have present subdirs — inert in practice because the decoder never writes non-empty scopes. v28 made empty scopes tolerate missing subdirs — correct for case (a) but accidentally enabled case (b) without the subdir-stat doing anything useful. v29 added an empty-scope + non-empty-subdir = fail rule — which REJECTS every real round-trip in case (b). The v28/v29 tests passed only because emitMinimalManifest fabricated placeholder scope entries (Tables: ["placeholder"], etc.) that the real decoder does NOT write. Codex called this out as a P1 — the heuristic was unsound and broke production round-trips. ## Fix checkOneAdapterScope reverts to v27's pure nil-pointer check: - scope == nil → fail (producer did NOT enable this adapter; any on-disk subdir is stale and would otherwise be encoded under --adapter all). This is codex P2 v26 scenario B and is the ONLY sound check the manifest alone can support. - scope != nil → pass (producer enabled the adapter; trust the contract regardless of subdir presence/contents). Scenario A (truncated dump where the on-disk subdir was lost but scope was non-nil) needs SHA / record-count verification at the producer side — the manifest's deferred scope can't surface it. Documented as future work in the function godoc. Removed: - isEmptyAdapterScope (helper) - checkSubdirAbsentOrEmpty (helper) - subdir-stat / readdir branches in checkOneAdapterScope - TestCLIAcceptsEmptyAdapterScopeWithoutSubdir (v28, no longer meaningful — empty scope always passes) - TestCLIRejectsEmptyScopeWithStaleSubdir (v29, was rejecting legitimate real-decode round-trips) - TestCLIRejectsAdapterListedButSubdirMissing (v27 scenario A test, used a fabricated non-empty manifest scope that the real decoder never produces — no longer reachable from real round-trips) - scopeForPopulatedSubdir test helper (was fabricating placeholders the real decoder doesn't write) emitMinimalManifest reverts to the realistic decoder shape: a non-nil &Adapter{} per enabled adapter, all scope arrays empty, no fabricated placeholders. No on-disk subdir creation — checkOneAdapterScope no longer reads the subdirs. Added: - TestCLIAcceptsEmptyAdapterScopeRoundTrip (replaces v28/v29 tests): realistic decoder output — empty &Adapter{} scope + populated sqs/ subdir (via writeSQSFixture) — must encode successfully. This is the case v29 was rejecting. ## Caller audit per CLAUDE.md semantic-change rule - checkOneAdapterScope: removed empty-scope and non-empty-scope branches; only nil-scope rejection remains. Sole caller is validateAdaptersAgainstManifest. Strictly less strict than v29, so every input that passed v29 still passes v30; some inputs v29 rejected (real round-trips) now pass. - subdirPath parameter kept in the signature (unused, with `_` blank) to avoid churning the call site for a future SHA-based check. - emitMinimalManifest: no longer creates on-disk subdirs. All existing CLI tests pass because they either don't depend on pre-existing subdirs or create them via their own fixture helpers (writeSQSFixture, etc.). - TestCLIRejectsAdapterNotInManifest (v27 scenario B): still passes; this was the v27 check that v30 keeps. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 97 +++-------- cmd/elastickv-snapshot-encode/main_test.go | 193 ++++----------------- 2 files changed, 56 insertions(+), 234 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index b44314d95..127bbf9fc 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -197,22 +197,33 @@ func manifestAdapterField(a *backup.Adapters, name string) *backup.Adapter { } // checkOneAdapterScope is the per-adapter half of -// validateAdaptersAgainstManifest. Selected adapters MUST have a -// manifest scope; the on-disk subdir requirements depend on whether -// the scope is empty or has concrete entries. Not-selected adapters -// are unchecked (the operator chose to skip them). +// validateAdaptersAgainstManifest. The decoder's populateAdapterScopes +// explicitly defers scope enumeration and writes `&Adapter{}` (empty) +// for every enabled adapter regardless of whether records were +// dumped, so the manifest's scope CONTENT cannot distinguish +// "this adapter had no records" from "this adapter had records but +// the decoder didn't enumerate them" (codex P1 v29 #904 corrected +// v27/v28/v29's over-eager subdir stat-and-readdir checks). // -// Decision matrix (codex P2 v26 + v27 + v28 #904): +// The only sound nil/non-nil signal is the per-adapter POINTER +// (`manifest.Adapters.`): // -// - scope == nil → fail (producer did not dump this adapter). -// - empty scope, subdir absent or empty → pass (no records in dump). -// - empty scope, subdir non-empty → fail (stale data on disk -// not declared in manifest — input directory was reused or -// partially cleaned). -// - non-empty scope, subdir missing or not-a-dir → fail (truncated -// dump). -// - non-empty scope, subdir is a directory → pass. -func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, subdirPath string) error { +// - scope == nil → fail (producer did NOT enable this adapter; any +// on-disk subdir is stale and would otherwise be encoded under +// the default `--adapter all`, codex P2 v26 #904 scenario B). +// - scope != nil → pass (producer enabled the adapter; trust the +// manifest contract regardless of the on-disk subdir's +// presence/contents). +// +// Detecting truncated dumps (codex P2 v26 #904 scenario A: scope +// non-nil but on-disk subdir lost) needs SHA / record-count +// verification at the producer side; the manifest alone cannot +// surface it. Tracked as future work. +// +// subdirPath is intentionally unused now but kept in the signature so +// a future check that pairs the manifest with a SHA index doesn't +// need a call-site refactor. +func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, _ string) error { if !selected { return nil } @@ -221,67 +232,9 @@ func checkOneAdapterScope(name string, selected bool, scope *backup.Adapter, sub "adapter %q selected but MANIFEST.json has no scope for it (use --adapter to restrict, or re-dump including this adapter)", name) } - if isEmptyAdapterScope(scope) { - return checkSubdirAbsentOrEmpty(name, subdirPath) - } - info, err := os.Stat(subdirPath) - if err != nil { - return errors.Wrapf(errAdapterNotInManifest, - "adapter %q listed in MANIFEST.json with non-empty scope but on-disk subdir %s is missing (stat: %v)", - name, subdirPath, err) - } - if !info.IsDir() { - return errors.Wrapf(errAdapterNotInManifest, - "adapter %q listed in MANIFEST.json but %s is not a directory (mode=%s)", - name, subdirPath, info.Mode()) - } - return nil -} - -// checkSubdirAbsentOrEmpty fails closed when an empty-scope adapter -// has a non-empty on-disk subdir. The empty scope says the producer -// dumped no records; a populated subdir would otherwise be encoded by -// cfg.adapters (default `--adapter all`) and silently publish stale -// data the manifest doesn't claim (codex P2 v28 #904 — the inverse of -// v27's truncated-dump scenario). -func checkSubdirAbsentOrEmpty(name, subdirPath string) error { - info, err := os.Stat(subdirPath) - if err != nil { - if os.IsNotExist(err) { - return nil - } - return errors.Wrapf(errAdapterNotInManifest, - "adapter %q: stat %s: %v", name, subdirPath, err) - } - if !info.IsDir() { - return errors.Wrapf(errAdapterNotInManifest, - "adapter %q: %s is not a directory (mode=%s)", name, subdirPath, info.Mode()) - } - entries, err := os.ReadDir(subdirPath) - if err != nil { - return errors.Wrapf(errAdapterNotInManifest, - "adapter %q: readdir %s: %v", name, subdirPath, err) - } - if len(entries) != 0 { - return errors.Wrapf(errAdapterNotInManifest, - "adapter %q has empty scope in MANIFEST.json but %s contains %d entries (stale dump artifacts; clean the directory or re-dump)", - name, subdirPath, len(entries)) - } return nil } -// isEmptyAdapterScope reports whether scope has no concrete -// tables / buckets / databases / queues. A non-nil but empty -// Adapter{} is the "producer dumped this adapter, no records" -// signal — distinct from the nil pointer that means "producer did -// not enable this adapter at all" (codex P2 v27 #904). -func isEmptyAdapterScope(scope *backup.Adapter) bool { - return len(scope.Tables) == 0 && - len(scope.Buckets) == 0 && - len(scope.Databases) == 0 && - len(scope.Queues) == 0 -} - func parseFlags(argv []string) (*config, error) { fs := flag.NewFlagSet("elastickv-snapshot-encode", flag.ContinueOnError) fs.SetOutput(io.Discard) diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 9e67287bb..3271ff1d6 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -23,28 +23,26 @@ var isWindows = runtime.GOOS == "windows" // with the given lastCommitTS. Used by every CLI test as the producer- // side artifact the encoder will consume. // -// Populates manifest.Adapters with a non-nil &Adapter{} per adapter, -// matching what the decoder's populateAdapterScopes would emit. When -// the corresponding on-disk subdir is already populated (e.g. a -// writeSQSFixture call ran before emitMinimalManifest), this helper -// scans the subdir and stamps a placeholder scope entry of the right -// kind (Tables / Buckets / Databases / Queues) so v29's adapter-scope -// validation sees a manifest that matches the on-disk shape (codex -// P2 v28 #904 — empty scope + non-empty subdir was the bug). +// Mirrors what elastickv-snapshot-decode's populateAdapterScopes +// produces: a non-nil &Adapter{} per enabled adapter, with all scope +// arrays empty (scope enumeration is explicitly deferred — codex P1 +// v29 #904 corrected the earlier placeholder-scope fixture that +// fabricated entries the real decoder never writes). The v30 +// checkOneAdapterScope reads only the per-adapter pointer, never the +// scope content, so the empty &Adapter{} is sufficient. // -// Then creates empty subdirs for any adapter not already populated so -// validateAdaptersAgainstManifest's empty-scope-empty-subdir branch -// passes. Tests that need a specific adapter-scope mismatch bypass -// this helper and write a custom manifest. +// Tests that need a specific adapter-scope mismatch (e.g. a nil +// pointer for one adapter to exercise the v27 nil-scope rejection) +// bypass this helper and write a custom manifest. func emitMinimalManifest(t *testing.T, outRoot string, lastCommitTS uint64) { t.Helper() m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) m.LastCommitTS = lastCommitTS m.Adapters = &backup.Adapters{ - DynamoDB: scopeForPopulatedSubdir(t, outRoot, "dynamodb", "tables"), - S3: scopeForPopulatedSubdir(t, outRoot, "s3", "buckets"), - Redis: scopeForPopulatedSubdir(t, outRoot, "redis", "databases"), - SQS: scopeForPopulatedSubdir(t, outRoot, "sqs", "queues"), + DynamoDB: &backup.Adapter{}, + S3: &backup.Adapter{}, + Redis: &backup.Adapter{}, + SQS: &backup.Adapter{}, } m.Exclusions = &backup.Exclusions{} f, err := os.Create(filepath.Join(outRoot, "MANIFEST.json")) @@ -57,36 +55,6 @@ func emitMinimalManifest(t *testing.T, outRoot string, lastCommitTS uint64) { if err := f.Close(); err != nil { t.Fatalf("close: %v", err) } - for _, sub := range []string{"dynamodb", "s3", "redis", "sqs"} { - if mkErr := os.MkdirAll(filepath.Join(outRoot, sub), 0o755); mkErr != nil { - t.Fatalf("MkdirAll %s: %v", sub, mkErr) - } - } -} - -// scopeForPopulatedSubdir returns an empty Adapter{} when the named -// subdir under outRoot is absent or empty, and an Adapter{} with a -// single placeholder entry of the appropriate kind when the subdir -// already has fixture content. The encoder doesn't consult the scope -// content (it walks the on-disk subtree); the placeholder just keeps -// v29's empty-scope-vs-non-empty-subdir guard happy. -func scopeForPopulatedSubdir(t *testing.T, outRoot, sub, scopeKind string) *backup.Adapter { - t.Helper() - entries, err := os.ReadDir(filepath.Join(outRoot, sub)) - if err != nil || len(entries) == 0 { - return &backup.Adapter{} - } - switch scopeKind { - case "tables": - return &backup.Adapter{Tables: []string{"placeholder"}} - case "buckets": - return &backup.Adapter{Buckets: []string{"placeholder"}} - case "databases": - return &backup.Adapter{Databases: []uint32{0}} - case "queues": - return &backup.Adapter{Queues: []string{"placeholder"}} - } - return &backup.Adapter{} } func quietLogger() *slog.Logger { @@ -274,135 +242,36 @@ func TestCLIRejectsAdapterNotInManifest(t *testing.T) { } } -// TestCLIAcceptsEmptyAdapterScopeWithoutSubdir pins codex P2 v27 -// #904: when the producer ran with an adapter enabled but the -// adapter had no records, populateAdapterScopes writes a non-nil -// empty Adapter{} into the manifest but the decoder does NOT -// create the top-level subdir. The encoder must accept this shape -// — requiring the subdir would reject valid Redis-only or -// header-only dumps re-encoded with the default --adapter set. -func TestCLIAcceptsEmptyAdapterScopeWithoutSubdir(t *testing.T) { +// TestCLIAcceptsEmptyAdapterScopeRoundTrip pins codex P1 v29 #904 +// reality: elastickv-snapshot-decode's populateAdapterScopes always +// writes an empty &Adapter{} for every enabled adapter (scope +// enumeration is deferred). A real decoded dump with files under +// dynamodb/ / s3/ / redis/ / sqs/ MUST be re-encodable through the +// CLI even though the manifest's per-adapter scope arrays are empty +// — v28/v29's stat-and-readdir checks were rejecting these. +func TestCLIAcceptsEmptyAdapterScopeRoundTrip(t *testing.T) { t.Parallel() in := t.TempDir() - // Manifest lists every adapter with an EMPTY scope (no tables / - // buckets / databases / queues) and no on-disk subdirs. - m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) - m.LastCommitTS = 100 - m.Adapters = &backup.Adapters{ - DynamoDB: &backup.Adapter{}, - S3: &backup.Adapter{}, - Redis: &backup.Adapter{}, - SQS: &backup.Adapter{}, - } - m.Exclusions = &backup.Exclusions{} - f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) - if ferr != nil { - t.Fatalf("create MANIFEST.json: %v", ferr) - } - if werr := backup.WriteManifest(f, m); werr != nil { - t.Fatalf("WriteManifest: %v", werr) - } - if cerr := f.Close(); cerr != nil { - t.Fatalf("close: %v", cerr) - } + // Pre-populate sqs/ with a real fixture (simulates a normal + // decoder run that emitted SQS records). + writeSQSFixture(t, in) + // Manifest mirrors the decoder: every adapter scope is an empty + // &Adapter{} regardless of on-disk content. + emitMinimalManifest(t, in, 100) + out := filepath.Join(t.TempDir(), "out.fsm") - code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + code, err := run([]string{"--input", in, "--output", out, "--adapter", "sqs"}, quietLogger()) if err != nil { - t.Fatalf("run failed: code=%d err=%v (codex P2 v27 regression — empty scope should not require subdir)", code, err) + t.Fatalf("run failed: code=%d err=%v (codex P1 v29 regression — empty scope must coexist with populated subdir)", code, err) } if code != exitSuccess { t.Errorf("exit code = %d, want %d", code, exitSuccess) } - // Header-only .fsm should be published. if _, statErr := os.Stat(out); statErr != nil { t.Errorf(".fsm not published at %s: %v", out, statErr) } } -// TestCLIRejectsEmptyScopeWithStaleSubdir pins codex P2 v28 #904: the -// inverse of v27. When MANIFEST.json says an adapter has an empty -// scope (Adapter{}, no records dumped) but a stale on-disk subdir is -// still populated (e.g., the input directory was reused or only -// partially cleaned), the encoder would otherwise pick up the stale -// subtree from cfg.adapters and publish records the manifest does -// NOT declare. Fail closed. -func TestCLIRejectsEmptyScopeWithStaleSubdir(t *testing.T) { - t.Parallel() - in := t.TempDir() - // Manifest declares SQS with EMPTY scope, but write a stale - // sqs/ subdir with content. - m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) - m.LastCommitTS = 100 - m.Adapters = &backup.Adapters{ - DynamoDB: &backup.Adapter{}, - S3: &backup.Adapter{}, - Redis: &backup.Adapter{}, - SQS: &backup.Adapter{}, // declared empty - } - m.Exclusions = &backup.Exclusions{} - f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) - if ferr != nil { - t.Fatalf("create MANIFEST.json: %v", ferr) - } - if werr := backup.WriteManifest(f, m); werr != nil { - t.Fatalf("WriteManifest: %v", werr) - } - if cerr := f.Close(); cerr != nil { - t.Fatalf("close: %v", cerr) - } - // Stale sqs/ tree from a prior decode the operator didn't clean. - if err := os.MkdirAll(filepath.Join(in, "sqs", "stale-queue"), 0o755); err != nil { - t.Fatalf("MkdirAll stale sqs subdir: %v", err) - } - out := filepath.Join(t.TempDir(), "out.fsm") - code, err := run([]string{"--input", in, "--output", out, "--adapter", "sqs"}, quietLogger()) - if err == nil { - t.Fatalf("run succeeded; want stale-subdir rejection") - } - if code != exitDataErr { - t.Errorf("exit code = %d, want %d (data-correctness)", code, exitDataErr) - } - if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { - t.Errorf(".fsm should not be published on stale-subdir rejection") - } -} - -// TestCLIRejectsAdapterListedButSubdirMissing pins codex P2 v26 #904 -// scenario A: the manifest lists an adapter (non-nil scope) but the -// matching on-disk subdir is missing. The per-adapter encoder would -// no-op silently and publish a partial/header-only .fsm; the new -// guard rejects up front with exit 2. -func TestCLIRejectsAdapterListedButSubdirMissing(t *testing.T) { - t.Parallel() - in := t.TempDir() - // Manifest lists SQS only; do NOT create sqs/ on disk. - m := backup.NewPhase0SnapshotManifest(time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)) - m.LastCommitTS = 100 - m.Adapters = &backup.Adapters{SQS: &backup.Adapter{Queues: []string{"q1"}}} - m.Exclusions = &backup.Exclusions{} - f, ferr := os.Create(filepath.Join(in, "MANIFEST.json")) - if ferr != nil { - t.Fatalf("create MANIFEST.json: %v", ferr) - } - if werr := backup.WriteManifest(f, m); werr != nil { - t.Fatalf("WriteManifest: %v", werr) - } - if cerr := f.Close(); cerr != nil { - t.Fatalf("close: %v", cerr) - } - out := filepath.Join(t.TempDir(), "out.fsm") - code, err := run([]string{"--input", in, "--output", out, "--adapter", "sqs"}, quietLogger()) - if err == nil { - t.Fatalf("run succeeded; want truncated-dump rejection") - } - if code != exitDataErr { - t.Errorf("exit code = %d, want %d (data-correctness)", code, exitDataErr) - } - if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { - t.Errorf(".fsm should not be published on adapter-scope rejection") - } -} - // TestCLIRejectsLowerLastCommitTSOverride is the fail-closed pin per // parent §"MVCC re-encoding": T < manifest.last_commit_ts → exit 2 // (data-correctness failure, not flag-parse error). From a2ed548f4be738f1b6e8e4b2c6e7fce7c6ff1cc9 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 22:47:17 +0900 Subject: [PATCH 31/35] backup: #904 v31 - doc-only: align three godocs with v30's nil-only scope check MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Claude v30 flagged that v30's behavior change (retracting v27/v28/v29's subdir stat-and-readdir branches) left three docs out of sync: 1. validateAdaptersAgainstManifest godoc still said "stats each enabled adapter's top-level subdir" and listed two failure modes (scenarios A + B) — but v30 only enforces scenario B. 2. errAdapterNotInManifest var comment said the sentinel fires when "the manifest lists an adapter whose top-level subdir is missing on disk" (scenario A) — that path was removed in v30. 3. classifyEncodeError's "Sources of each sentinel" list omitted errAdapterNotInManifest entirely (the switch arm was added in v27 but the doc list was never updated — pre-existing v27 gap that claude v30 flagged as worth closing). Fix: - validateAdaptersAgainstManifest godoc rewritten: describes the single nil-pointer check, explicitly notes scenario A is deferred to a future SHA / record-count manifest field per the v29 retract. - errAdapterNotInManifest var comment trimmed to the nil-scope case only, with a forward reference to the v30 retract for context. - classifyEncodeError summary clause adds "adapter scope mismatch with manifest"; the Sources list gains the errAdapterNotInManifest bullet with the codex P2 v26 / P1 v29 citations. No behavior change. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 73 +++++++++++++++------------ 1 file changed, 42 insertions(+), 31 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 127bbf9fc..0ea5f2ee1 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -92,11 +92,11 @@ func run(argv []string, logger *slog.Logger) (int, error) { // classifyEncodeError maps the encodeOne return value to a CLI exit // code. Data-correctness sentinels (HLC ceiling regression, JSONL -// layout, unsupported manifest exclusion flags, adapter rejecting -// input-tree contents, self-test mismatch, corrupt manifest) → -// exit 2; everything else → exit 1. Runbooks branch on exit status -// to triage bad-dump-data vs operator typos, so this mapping is -// part of the CLI contract. +// layout, unsupported manifest exclusion flags, adapter scope +// mismatch with manifest, adapter rejecting input-tree contents, +// self-test mismatch, corrupt manifest) → exit 2; everything else +// → exit 1. Runbooks branch on exit status to triage bad-dump-data +// vs operator typos, so this mapping is part of the CLI contract. // // Sources of each sentinel: // - ErrSelfTestLowerLastCommitTS: CLI resolveLastCommitTS + library @@ -114,6 +114,10 @@ func run(argv []string, logger *slog.Logger) (int, error) { // - errSelfTestMismatch: writeAndPublish self-test branch // - ErrInvalidManifest / ErrUnsupportedFormatVersion: readInputManifest // surfacing backup.ReadManifest sentinels (codex P2 v14 #904) +// - errAdapterNotInManifest: validateAdaptersAgainstManifest when +// a selected adapter has a nil manifest scope pointer (codex P2 +// v26 #904 scenario B; retracted v27 scenario A in v30 per codex +// P1 v29 #904) func classifyEncodeError(err error) int { switch { case errors.Is(err, backup.ErrSelfTestLowerLastCommitTS), @@ -132,28 +136,30 @@ func classifyEncodeError(err error) int { } } -// validateAdaptersAgainstManifest intersects the user-selected -// AdapterSet with manifest.Adapters and stats each enabled adapter's -// top-level subdir under inputPath. Two failure modes — both -// data-correctness, both routed to exit 2: +// validateAdaptersAgainstManifest checks each enabled adapter +// against the nil/non-nil manifest scope pointer (manifest.Adapters.). +// One failure mode, routed to exit 2: // -// - Manifest lists no scope for an enabled adapter (nil pointer in -// manifest.Adapters.). A truncated/wrong manifest combined with -// the default `--adapter dynamodb,s3,redis,sqs` would otherwise -// pick up a stale on-disk subdir for an adapter the producer did -// not dump (codex P2 v26 #904 scenario B). -// - Manifest lists a scope for an enabled adapter but the on-disk -// subdir is absent. The adapter encoder's "missing top-level -// subdir → no-op" behavior would otherwise publish a header-only -// or partial .fsm without a hard error (codex P2 v26 #904 -// scenario A). +// - Manifest lists no scope (nil pointer) for an enabled adapter +// (codex P2 v26 #904 scenario B). A truncated/wrong manifest +// combined with the default `--adapter dynamodb,s3,redis,sqs` +// would otherwise pick up a stale on-disk subdir for an adapter +// the producer did not dump. +// +// Scenario A (non-nil scope but on-disk subdir missing) cannot be +// detected from the manifest alone because the decoder's +// populateAdapterScopes defers scope enumeration and always writes +// an empty &Adapter{} regardless of record count (codex P1 v29 #904 +// pulled the v27/v28/v29 stat-and-readdir checks). Future work: +// add a SHA / record-count manifest field so scenario A becomes +// detectable. See checkOneAdapterScope's doc for the full per-shape +// decision matrix. // // A nil manifest.Adapters is treated as "manifest has no scopes for -// any adapter" — every enabled adapter trips the scenario-B guard. -// Older manifests that omit the Adapters block deliberately are -// expected to also omit on-disk subdirs and pass `--adapter` set to -// only what they DO contain; that case is operator-driven and -// surfaces the same fail-closed error here. +// any adapter" — every enabled adapter trips the guard. Older +// manifests that omit the Adapters block deliberately are expected +// to pass `--adapter` set to only what they DO contain; that case +// is operator-driven and surfaces the same fail-closed error here. func validateAdaptersAgainstManifest(selected backup.AdapterSet, m backup.Manifest, inputPath string) error { checks := []struct { name string @@ -368,13 +374,18 @@ func applyAdapterName(name string, s *backup.AdapterSet) error { var errSelfTestMismatch = errors.New("backup: --self-test diff against --input") // errAdapterNotInManifest is returned by validateAdaptersAgainstManifest -// when the user has enabled an adapter that the manifest doesn't list, -// or when the manifest lists an adapter whose top-level subdir is -// missing on disk. Both are silent-data-loss scenarios per codex P2 -// v26 #904: the per-adapter encoders no-op on missing subdirs, so a -// truncated dump or a stale-dir-with-default-`--adapter-all` would -// publish a header-only/partial .fsm. classifyEncodeError routes this -// to exit 2 (data-correctness). +// when the user has enabled an adapter that the manifest doesn't list +// (nil pointer in manifest.Adapters.). This is the codex P2 v26 +// #904 scenario B: a stale on-disk subdir for an adapter the producer +// did not dump would otherwise be encoded under the default +// `--adapter dynamodb,s3,redis,sqs`. classifyEncodeError routes the +// sentinel to exit 2 (data-correctness). +// +// The earlier v27/v28/v29 attempts to also detect a missing on-disk +// subdir under a non-nil scope (codex P2 v26 scenario A) were retracted +// in v30 once codex P1 v29 #904 clarified that the decoder defers +// scope enumeration; the manifest can no longer distinguish "no +// records dumped" from "records dumped but scope not enumerated." var errAdapterNotInManifest = errors.New("encode: adapter scope mismatch between MANIFEST.json and --adapter / on-disk tree") func encodeOne(cfg *config, logger *slog.Logger) error { From 6f2040858f67023c367dfa499d955d6bff3f6fdf Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 23:05:07 +0900 Subject: [PATCH 32/35] backup: #904 v32 - rollback orphan FSM + chmod existing sidecar (codex/claude P2 v31) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two non-blocking P2 observations claude flagged on its v31 review, both pre-existing but real correctness/security issues. Folding both in. ## Codex/claude P2 v31 (1/2): writeSidecar failure leaves orphan .fsm The encodeOne success path runs writeAndPublish (renames temp .fsm → output) BEFORE writeSidecar. If writeSidecar fails after the rename, the CLI returns non-zero but a restore-visible .fsm exists without its provenance metadata — exactly the orphan state v17 closed for the adapter-error path, but un-closed for the sidecar-write path. Fix: new rollbackOrphanFSMAndSidecar(outputPath, logger) helper. On sidecar-write failure in the success path (publishErr == nil), the caller calls this helper before returning: - os.Remove(outputPath) — drops the just-published .fsm so the restore path sees nothing where the orphan was. - os.Remove(sidecarPath) — drops any partially-written sidecar bytes (OpenSidecarFile may have already Truncate(0)'d before WriteEncodeInfo failed). Both removes log-and-continue on non-ErrNotExist failures so the caller's primary sidecar-write error remains the dominant signal. A prior successful encode at the same output path is unrecoverable — writeAndPublish's os.Rename already overwrote it before writeSidecar ran. The rollback brings the state to "no .fsm, no sidecar," which is the same end state as "encode never ran" — the cleanest consistent outcome without filesystem transactions. Self-test-mismatch path is unchanged: the existing v15 removeStaleOutputFSM already handles that case, and writeSidecar's failure on that path already logger.Warn's-and-continues. Pinned end-to-end by extending TestCLISidecarWriteRefusesSymlinkTarget: - Existing assertions: victim file at the symlink target survives byte-equal; exit 1; symlink itself unlinked. - New v32 assertion: .fsm is also removed (the rollback fired after the .fsm was published but before encodeOne returned). ## Codex/claude P2 v31 (2/2): existing sidecar perm not tightened on re-encode openSidecarFile (unix) opens with O_CREATE + 0o600. The kernel applies the mode arg ONLY when creating a new file; if the path already exists (e.g. an older encoder wrote it at 0o644), the pre-existing perm is preserved. The sidecar carries the source path, cluster_id, and SHA256 of the .fsm — leaving it world-readable on a multi-user backup host is the same leak claude v4 fixed for the create case. Fix: after the regular-file + Nlink checks + Truncate(0), call f.Chmod(sidecarFileMode) to enforce the mode on the open descriptor. New named const sidecarFileMode = 0o600 so a future widening of the flag-arg mode forces the chmod call to be touched too. Windows variant unchanged (different perm model; the openSidecarFile file is already platform-split for this reason). Pinned by TestOpenSidecarFileEnforcesOwnerOnlyMode (unix-only via build tag): writes a pre-existing 0o644 file, calls OpenSidecarFile, asserts the resulting descriptor's Stat reports perm 0o600. ## Caller audit per CLAUDE.md semantic-change rule - encodeOne success-path branch: rollback called only when publishErr == nil AND writeSidecar fails. Self-test-mismatch path unchanged. Sole caller of encodeOne is run(); error return semantics unchanged. - rollbackOrphanFSMAndSidecar: new helper, sole caller is the success-path sidecar-failure branch in encodeOne. No external callers. - openSidecarFile (unix): success path now has one extra Chmod call. All in-package callers (encode_redis.go's multiple KEYMAP / TTL sidecar writers) are unaffected — they all already expect a 0o600 file. CLI's writeSidecar / writeMismatchTxt (via the exported OpenSidecarFile wrapper) are likewise strictly-tighter-perm only. - sidecarFileMode const: new named const, only consumer is the Chmod call in openSidecarFile (unix). Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 37 ++++++++++++++++++++++ cmd/elastickv-snapshot-encode/main_test.go | 11 ++++++- internal/backup/open_nofollow_unix.go | 16 ++++++++++ internal/backup/open_nofollow_unix_test.go | 37 ++++++++++++++++++++++ 4 files changed, 100 insertions(+), 1 deletion(-) create mode 100644 internal/backup/open_nofollow_unix_test.go diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 0ea5f2ee1..0a2ca744c 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -426,6 +426,15 @@ func encodeOne(cfg *config, logger *slog.Logger) error { // Surface the sidecar-write failure only if encode itself // succeeded; on mismatch the mismatch error takes priority. if publishErr == nil { + // .fsm was just renamed into place by writeAndPublish + // but the sidecar write failed → we have an orphan + // .fsm without its matching provenance metadata. + // Roll both back to a consistent absent state so the + // operator doesn't see a "successful" restore + // artifact missing its sidecar (claude/codex P2 v31 + // observation on PR #904 — the design contract is + // that .fsm + sidecar move together). + rollbackOrphanFSMAndSidecar(cfg.outputPath, logger) return errors.Wrap(serr, "write encode_info sidecar") } logger.Warn("write encode_info sidecar on mismatch", "err", serr) @@ -484,6 +493,34 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes return encodeOpts } +// rollbackOrphanFSMAndSidecar removes both the just-published +// .fsm and the partial .encode_info.json after a +// sidecar-write failure on the encode success path. The pair was +// supposed to move together (the .fsm describes the data the sidecar +// records the provenance for); if the sidecar didn't land, the +// operator must not see a "successful" .fsm without its matching +// provenance metadata (claude / codex P2 v31 observation on PR #904). +// +// A prior successful encode at the same output path is unrecoverable +// — writeAndPublish's os.Rename already overwrote it before +// writeSidecar ran. The rollback brings the state to "no .fsm, no +// sidecar at this path", which is the same end state as "encode +// never ran." That's the cleanest consistent outcome the CLI can +// produce without filesystem transactions. +// +// Both os.Remove calls log-and-continue on non-ErrNotExist failures +// so the caller's primary sidecar-write error remains the dominant +// signal. +func rollbackOrphanFSMAndSidecar(outputPath string, logger *slog.Logger) { + if rerr := os.Remove(outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { + logger.Warn("rollback orphan .fsm after sidecar failure", "err", rerr) + } + sidecarPath := backup.EncodeInfoSidecarPath(outputPath) + if srerr := os.Remove(sidecarPath); srerr != nil && !errors.Is(srerr, os.ErrNotExist) { + logger.Warn("rollback partial sidecar after write failure", "err", srerr) + } +} + // writeMismatchTxt writes the self-test mismatch report to mismatchPath // using the same no-follow/no-clobber discipline as the sidecar // writer: an attacker pre-placing a symlink at diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index 3271ff1d6..eac9af904 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -1032,7 +1032,9 @@ func TestCLISidecarWriteRefusesSymlinkTarget(t *testing.T) { if code != exitUserErr { t.Errorf("exit code = %d, want %d (operator-env error, not data-correctness)", code, exitUserErr) } - // Victim file MUST be untouched. + // Victim file MUST be untouched (the no-follow open refused to + // resolve the symlink; the v32 rollback's os.Remove on + // sidecarPath operates on the symlink itself, not the target). got, rerr := os.ReadFile(victim) if rerr != nil { t.Fatalf("read victim: %v", rerr) @@ -1040,6 +1042,13 @@ func TestCLISidecarWriteRefusesSymlinkTarget(t *testing.T) { if string(got) != victimBody { t.Errorf("victim mutated; OpenSidecarFile followed the symlink (codex P2 v25 regression)") } + // .fsm at outputPath MUST be removed by v32's rollback. The + // sidecar-write failure happened AFTER writeAndPublish renamed + // the .fsm into place, so without the rollback we'd have an + // orphan .fsm visible to the restore path. + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm at %s should be removed by v32 rollback after sidecar failure", out) + } } // TestCLIPublishesFsmAndSidecarMode0600 pins claude v4 #904: the diff --git a/internal/backup/open_nofollow_unix.go b/internal/backup/open_nofollow_unix.go index bdfb1db75..cf72b759c 100644 --- a/internal/backup/open_nofollow_unix.go +++ b/internal/backup/open_nofollow_unix.go @@ -98,5 +98,21 @@ func openSidecarFile(path string) (*os.File, error) { _ = f.Close() return nil, cockroachdberr.WithStack(err) } + // Enforce 0o600 on the descriptor. The flags-arg mode (0o600 + // above) is applied by the kernel ONLY on file creation; if + // path already existed, its pre-existing perms are kept. An + // older encoder writing 0o644 would otherwise leave the + // sidecar's source path / cluster ID / SHA256 world-readable + // after re-encode (claude / codex P2 v31 observation on PR #904). + if err := f.Chmod(sidecarFileMode); err != nil { + _ = f.Close() + return nil, cockroachdberr.WithStack(err) + } return f, nil } + +// sidecarFileMode is the file mode openSidecarFile enforces — owner +// read/write only. Pulled into a named const so the truncate-then- +// chmod step here matches the OpenFile flag-arg mode above; a future +// edit that widens one must touch both. +const sidecarFileMode os.FileMode = 0o600 diff --git a/internal/backup/open_nofollow_unix_test.go b/internal/backup/open_nofollow_unix_test.go new file mode 100644 index 000000000..7e7eddc77 --- /dev/null +++ b/internal/backup/open_nofollow_unix_test.go @@ -0,0 +1,37 @@ +//go:build unix + +package backup + +import ( + "os" + "path/filepath" + "testing" +) + +// TestOpenSidecarFileEnforcesOwnerOnlyMode pins claude / codex P2 v31 +// observation on PR #904: an older encoder may have written the +// sidecar at 0o644; OpenFile's mode arg only applies on CREATE, so +// re-opening for re-encode would preserve the wider perms. The +// post-Truncate Chmod restores 0o600 on every successful open. +func TestOpenSidecarFileEnforcesOwnerOnlyMode(t *testing.T) { + t.Parallel() + dir := t.TempDir() + path := filepath.Join(dir, "sidecar.json") + // Pre-existing sidecar with wider perms (simulating an older + // encoder). + if err := os.WriteFile(path, []byte("prior"), 0o644); err != nil { //nolint:gosec // test simulates legacy permissive sidecar + t.Fatalf("WriteFile: %v", err) + } + f, err := OpenSidecarFile(path) + if err != nil { + t.Fatalf("OpenSidecarFile: %v", err) + } + t.Cleanup(func() { _ = f.Close() }) + info, err := f.Stat() + if err != nil { + t.Fatalf("Stat: %v", err) + } + if got := info.Mode().Perm(); got != 0o600 { + t.Errorf("perm = %o, want 0o600 (Chmod after Truncate must tighten existing-file perms)", got) + } +} From 5adca7f679605e416bc4135d70b333c96bd80690 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 23:44:43 +0900 Subject: [PATCH 33/35] backup: #904 v33 - three codex P2 v30/v31/v32 fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex's review window came back online after the rate-limit pause and posted three P2 findings against v30, v31, and v32 that I had missed (the inline comments were attached to PR reviews 4410223887, 4410270684, 4410442057 but the review BODIES were just the standard "Codex Review" header — the actual findings were in the per-review inline comments). ## Codex P2 v31 (L640): S3 encoder silently skips non-directory bucket entries S3RecordEncoder.Encode does `if !ent.IsDir() { continue }` for top-level entries under `s3/`. When a partially-truncated dump leaves a regular file or symlink where a bucket directory should be, the encoder publishes a partial .fsm with the affected bucket silently omitted. The deferred-enumeration empty S3 scope in the manifest cannot otherwise flag the missing data. Fix: encode_s3.go's loop now distinguishes: - Reserved-prefix entries ("_*") → continue (forward compat with `_incomplete_uploads/` and `_orphans/`, which are handled by ErrEncodeUnsupportedS3IncompleteUploads / ErrEncodeUnsupportedS3Orphans). - Any other non-directory entry → return ErrS3EncodeNotRegular with the path and mode (matches the existing _bucket.json non- regular guard). Pinned by TestS3EncodeRejectsNonDirectoryBucketEntry and TestS3EncodeIgnoresReservedPrefixEntry. ## Codex P2 v32 (L642): rollback FSM when parent-dir fsync fails v25 added fsyncParentDir after os.Rename in writeAndPublish to make the rename durable. If fsync fails (e.g., on a filesystem that rejects directory fsync, or transient I/O error), writeAndPublish returned the error but the renamed .fsm stayed at --output — the operator sees a "successful" .fsm that may not survive a crash. Fix: extract the rename + fsync + rollback block into a new publishAndFsync helper that os.Remove's the .fsm on fsync failure (same rollback semantics as the v32 sidecar-failure path). publishAndFsync also brings writeAndPublish under the cyclop bound. No new test — the failure path is hard to drive deterministically (would need fsync syscall injection). The success path is exercised by every existing CLI round-trip test; the rollback branch is inspectable by code review against the matching v32 sidecar rollback pattern. ## Codex P2 v32 (L520): rollback must not delete operator-owned sidecar entries v32's rollbackOrphanFSMAndSidecar called os.Remove on the sidecar path unconditionally. When the sidecar path was a pre-existing non-regular entry that OpenSidecarFile correctly refused to clobber (operator-placed symlink, FIFO, empty directory, etc.), the rollback would destructively unlink it merely because the encode failed. Fix: rollbackOrphanFSMAndSidecar now os.Lstats the sidecar path and only os.Removes when info.Mode().IsRegular(). Symlinks (operator- placed), FIFOs, directories, and other non-regular entries are logged-and-skipped. Pinned end-to-end by extending TestCLISidecarWriteRefusesSymlinkTarget: - Existing assertions: victim file at the symlink target survives byte-equal; .fsm at --output is removed by v32 rollback. - NEW v33 assertion: the operator-placed symlink at the sidecar path is PRESERVED (os.Lstat + ModeSymlink check). The assertions are extracted into assertSymlinkSidecarRollbackInvariants to keep the test under the cyclop bound. ## Caller audit per CLAUDE.md semantic-change rule - S3RecordEncoder.Encode: was "silent skip on non-directory top- level entries"; now "fail closed with ErrS3EncodeNotRegular unless name starts with `_`". Sole effect on legitimate dumps is no behavior change (all canonical bucket dirs are directories). Malformed dumps that previously silently half-encoded now fail closed; classifyEncodeError routes via the v9 errors.Mark + ErrEncodeAdapterData layer to exit 2. - publishAndFsync (new helper): sole caller is writeAndPublish. Behavior change ONLY on fsync failure (was: leave .fsm in place; now: roll back). Success path unchanged. - rollbackOrphanFSMAndSidecar: sidecar removal is now gated on IsRegular(). Was destructive on operator-owned non-regular entries; now leaves them alone. Sole caller is encodeOne's success-path sidecar-failure branch. - assertSymlinkSidecarRollbackInvariants (test helper): no production callers; only used by TestCLISidecarWriteRefusesSymlinkTarget. Tests + lint green (incl. gofmt on the new helper's comment block). --- cmd/elastickv-snapshot-encode/main.go | 55 +++++++++++++++++++--- cmd/elastickv-snapshot-encode/main_test.go | 34 ++++++++++--- internal/backup/encode_s3.go | 29 +++++++++++- internal/backup/encode_s3_test.go | 48 +++++++++++++++++++ 4 files changed, 151 insertions(+), 15 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 0a2ca744c..7163ceb6d 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -516,6 +516,29 @@ func rollbackOrphanFSMAndSidecar(outputPath string, logger *slog.Logger) { logger.Warn("rollback orphan .fsm after sidecar failure", "err", rerr) } sidecarPath := backup.EncodeInfoSidecarPath(outputPath) + // Only remove a sidecar known to be a regular file — i.e., one + // this run (or a prior encoder) created. A pre-existing + // non-regular entry (operator-placed symlink, FIFO, directory, + // etc.) is what OpenSidecarFile correctly refuses to clobber + // when the rollback was triggered; os.Remove'ing the + // operator's filesystem object merely because the encode + // failed would be a destructive side effect (codex P2 v32 + // #904). Symlinks the operator may have placed are also + // preserved here — OpenSidecarFile's O_NOFOLLOW failure means + // the symlink was never followed and no truncate happened, so + // the link is unchanged and is the operator's to manage. + info, err := os.Lstat(sidecarPath) + if err != nil { + if !errors.Is(err, os.ErrNotExist) { + logger.Warn("stat sidecar for rollback", "err", err) + } + return + } + if !info.Mode().IsRegular() { + logger.Warn("skip sidecar rollback: --output sidecar path is not a regular file", + "path", sidecarPath, "mode", info.Mode()) + return + } if srerr := os.Remove(sidecarPath); srerr != nil && !errors.Is(srerr, os.ErrNotExist) { logger.Warn("rollback partial sidecar after write failure", "err", srerr) } @@ -627,21 +650,41 @@ func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath removeStaleOutputFSM(cfg.outputPath, logger) return result, errors.Wrap(errSelfTestMismatch, "self-test diff (see "+mismatchPath+")") } - if err := os.Rename(tempPath, cfg.outputPath); err != nil { - return result, errors.Wrap(err, "rename tmp -> output") + if perr := publishAndFsync(tempPath, cfg.outputPath, logger); perr != nil { + return result, perr } publishedTempPath = "" // rename succeeded; defer no-ops + return result, nil +} + +// publishAndFsync renames tempPath → outputPath and then fsyncs the +// parent directory. If the fsync fails, the just-renamed .fsm is +// removed so the operator does not see a non-durable "successful" +// .fsm (codex P2 v24 #904 added the fsync; codex P2 v32 #904 added +// the rollback). Split out of writeAndPublish to keep that function +// under the cyclop bound. +func publishAndFsync(tempPath, outputPath string, logger *slog.Logger) error { + if err := os.Rename(tempPath, outputPath); err != nil { + return errors.Wrap(err, "rename tmp -> output") + } // fsync the parent dir so the rename's new directory entry is // durable. Without this, a power loss / host crash immediately // after a successful encode can lose the new entry (or // resurrect the old one) on filesystems where rename durability // requires syncing the containing directory. Mirrors the repo // pattern used by internal/encryption/sidecar.go + - // internal/raftengine/etcd/persistence.go (codex P2 v24 #904). - if err := fsyncParentDir(cfg.outputPath); err != nil { - return result, errors.Wrap(err, "fsync output dir after rename") + // internal/raftengine/etcd/persistence.go. + if err := fsyncParentDir(outputPath); err != nil { + // Roll back so the operator doesn't see a non-durable + // "successful" .fsm; restoring the consistent absent state + // is the same outcome encodeOne enforces on sidecar-write + // failures (codex P2 v32 #904). + if rerr := os.Remove(outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { + logger.Warn("rollback orphan .fsm after parent-dir fsync failure", "err", rerr) + } + return errors.Wrap(err, "fsync output dir after rename") } - return result, nil + return nil } // fsyncParentDir opens the parent directory of path read-only and diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index eac9af904..f368ad9ae 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -1032,9 +1032,25 @@ func TestCLISidecarWriteRefusesSymlinkTarget(t *testing.T) { if code != exitUserErr { t.Errorf("exit code = %d, want %d (operator-env error, not data-correctness)", code, exitUserErr) } - // Victim file MUST be untouched (the no-follow open refused to - // resolve the symlink; the v32 rollback's os.Remove on - // sidecarPath operates on the symlink itself, not the target). + assertSymlinkSidecarRollbackInvariants(t, victim, victimBody, out, sidecarPath) +} + +// assertSymlinkSidecarRollbackInvariants checks all three post- +// failure invariants for TestCLISidecarWriteRefusesSymlinkTarget: +// +// (1) the symlink target file is unchanged (no-follow open never +// resolved the link), +// (2) the .fsm at --output is removed (v32 rollback fired after +// the .fsm was renamed into place but before encodeOne +// returned), and +// (3) the operator-placed symlink at the sidecar path is +// preserved (v33 rollback only removes regular files; +// non-regular sidecar entries are operator-owned). +// +// Extracted into a helper to keep the test body under the cyclop +// bound. +func assertSymlinkSidecarRollbackInvariants(t *testing.T, victim string, victimBody, out, sidecarPath string) { + t.Helper() got, rerr := os.ReadFile(victim) if rerr != nil { t.Fatalf("read victim: %v", rerr) @@ -1042,13 +1058,17 @@ func TestCLISidecarWriteRefusesSymlinkTarget(t *testing.T) { if string(got) != victimBody { t.Errorf("victim mutated; OpenSidecarFile followed the symlink (codex P2 v25 regression)") } - // .fsm at outputPath MUST be removed by v32's rollback. The - // sidecar-write failure happened AFTER writeAndPublish renamed - // the .fsm into place, so without the rollback we'd have an - // orphan .fsm visible to the restore path. if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { t.Errorf(".fsm at %s should be removed by v32 rollback after sidecar failure", out) } + linkInfo, lerr := os.Lstat(sidecarPath) + if lerr != nil { + t.Errorf("operator-placed symlink at sidecar path was removed by rollback (codex P2 v32 regression): %v", lerr) + return + } + if linkInfo.Mode()&os.ModeSymlink == 0 { + t.Errorf("sidecar path mode = %s; expected symlink preserved by v33 rollback", linkInfo.Mode()) + } } // TestCLIPublishesFsmAndSidecarMode0600 pins claude v4 #904: the diff --git a/internal/backup/encode_s3.go b/internal/backup/encode_s3.go index b0b77dce4..ad63f221b 100644 --- a/internal/backup/encode_s3.go +++ b/internal/backup/encode_s3.go @@ -5,6 +5,7 @@ import ( "encoding/json" "os" "path/filepath" + "strings" "github.com/bootjp/elastickv/internal/s3keys" "github.com/cockroachdb/errors" @@ -83,10 +84,34 @@ func (e *S3RecordEncoder) Encode(b *snapshotBuilder) error { return err } for _, ent := range entries { + name := ent.Name() if !ent.IsDir() { - continue + // Top-level entries under s3/ must be bucket directories. + // A regular file or symlink here means the dump is + // malformed or partially truncated — silently skipping + // would let the encoder publish a partial .fsm with + // the affected bucket omitted (codex P2 v32 #904; the + // manifest's empty S3 scope from populateAdapterScopes + // cannot otherwise distinguish missing bucket from + // dumped-empty bucket). + // + // Reserved-prefix entries that start with "_" (e.g. + // _incomplete_uploads, _orphans) are handled by their + // own dedicated paths and are NOT top-level buckets; + // the fail-closed should not catch them. Today the + // reverse encoder doesn't emit those subtrees at all + // (covered by ErrEncodeUnsupportedS3IncompleteUploads / + // ErrEncodeUnsupportedS3Orphans) so any "_*" entry here + // would have been rejected upstream — but skip them + // here too for forward compat. + if strings.HasPrefix(name, "_") { + continue + } + return errors.Wrapf(ErrS3EncodeNotRegular, + "s3/%s is not a directory (mode=%s); top-level entries under s3/ must be bucket directories", + name, ent.Type()) } - if err := e.encodeBucket(b, root, ent.Name()); err != nil { + if err := e.encodeBucket(b, root, name); err != nil { return err } } diff --git a/internal/backup/encode_s3_test.go b/internal/backup/encode_s3_test.go index 840bd66b0..13487d389 100644 --- a/internal/backup/encode_s3_test.go +++ b/internal/backup/encode_s3_test.go @@ -137,6 +137,54 @@ func TestS3EncodeMissingDirIsNoop(t *testing.T) { } } +// TestS3EncodeRejectsNonDirectoryBucketEntry pins codex P2 v32 #904: +// when an entry directly under s3/ is a regular file or symlink +// rather than a bucket directory, the encoder must fail closed with +// ErrS3EncodeNotRegular rather than silently skipping (which would +// publish a partial .fsm with the affected bucket omitted; the +// manifest's deferred-enumeration empty S3 scope cannot otherwise +// flag the missing data). Reserved-prefix `_*` entries (e.g. +// `_incomplete_uploads`) are explicitly tolerated because they're +// handled by dedicated paths. +func TestS3EncodeRejectsNonDirectoryBucketEntry(t *testing.T) { + t.Parallel() + in := t.TempDir() + if err := os.MkdirAll(filepath.Join(in, "s3"), 0o755); err != nil { + t.Fatalf("mkdir: %v", err) + } + // Plant a regular file where a bucket directory should be. + if err := os.WriteFile(filepath.Join(in, "s3", "stray.txt"), []byte("oops"), 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + b := newSnapshotBuilder(s3EncTS) + err := NewS3RecordEncoder(in).Encode(b) + if !errors.Is(err, ErrS3EncodeNotRegular) { + t.Fatalf("Encode err = %v, want errors.Is ErrS3EncodeNotRegular", err) + } +} + +// TestS3EncodeIgnoresReservedPrefixEntry pins that codex P2 v32's +// fail-closed for non-directory top-level entries does NOT fire on +// reserved-prefix entries (those starting with "_"). The reverse +// encoder's unsupported-features guard handles those subtrees +// separately via ErrEncodeUnsupportedS3IncompleteUploads / +// ErrEncodeUnsupportedS3Orphans. +func TestS3EncodeIgnoresReservedPrefixEntry(t *testing.T) { + t.Parallel() + in := t.TempDir() + if err := os.MkdirAll(filepath.Join(in, "s3"), 0o755); err != nil { + t.Fatalf("mkdir: %v", err) + } + // Reserved-prefix file (e.g., a marker the operator left). + if err := os.WriteFile(filepath.Join(in, "s3", "_marker"), []byte("x"), 0o600); err != nil { + t.Fatalf("WriteFile: %v", err) + } + b := newSnapshotBuilder(s3EncTS) + if err := NewS3RecordEncoder(in).Encode(b); err != nil { + t.Errorf("Encode err = %v, want nil (reserved-prefix entries should be skipped)", err) + } +} + // TestS3EncodeRejectsNonRegularBucketMeta pins the pre-open guard: a // _bucket.json that is a directory is refused with ErrS3EncodeNotRegular. func TestS3EncodeRejectsNonRegularBucketMeta(t *testing.T) { From 03a9883479e80b234c28b05994675c9aa1d17dbb Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 23:59:59 +0900 Subject: [PATCH 34/35] backup: #904 v34 - preserve hard-linked sidecars + CodeRabbit nits (codex P2 v33) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Codex P2 v33: rollback destroys hard-linked sidecars v32's rollbackOrphanFSMAndSidecar gated sidecar removal on info.Mode().IsRegular(). A pre-existing hard link at the sidecar path passes IsRegular(), so the rollback would os.Remove it — destroying an operator-owned link that OpenSidecarFile's Nlink > 1 guard correctly refused to clobber. In a shared backup directory this lets a failed encode unlink files outside the dump tree. ## Fix Track whether OpenSidecarFile actually succeeded on this run. writeSidecar's signature changes from `error` to `(truncated bool, err error)`: - OpenSidecarFile fails (symlink ELOOP, hard-link Nlink>1, FIFO ENXIO, non-regular file, etc.) → return (false, err). Caller MUST NOT remove the sidecar path; the pre-existing entry is operator-owned and was never touched. - OpenSidecarFile succeeds (truncated and chmod'd a fresh or single-link regular file) → return (true, err) on any subsequent failure. Caller rollback removes the partial bytes. rollbackOrphanFSMAndSidecar gains a sidecarTruncated bool parameter. The Lstat + IsRegular check from v33 is removed — the truncated flag is a precise signal of "this run owns the bytes," not a stat-time heuristic that can race with operator changes. ## Pinned by TestCLISidecarWriteRefusesHardLinkTarget (unix-only): pre-plants a victim file with a sentinel body, os.Link's the sidecar path to it, runs the CLI, asserts: - Exit code = exitUserErr. - Victim contents are byte-equal preserved (Nlink guard refused truncation). - Hard link at sidecar path is preserved (sidecarTruncated=false → rollback skips it). - .fsm at outputPath is removed (rollback's FSM branch always fires on sidecar-write failure with successful publish). The v33 TestCLISidecarWriteRefusesSymlinkTarget continues to pass: OpenSidecarFile ELOOP → sidecarTruncated=false → symlink survives. ## CodeRabbit nits folded in 1. `internal/backup/open_nofollow_unix.go:61`: `os.OpenFile(..., flag, 0o600)` now uses the `sidecarFileMode` const (was literal). Keeps the OpenFile mode and the post-Truncate Chmod mode synchronized — a future widening of one is mechanically caught. 2. `internal/backup/open_nofollow_unix_test.go:20-24`: TestOpenSidecarFileEnforcesOwnerOnlyMode now `os.Stat`s the seeded 0o644 file and `t.Skip`s if the environment refused the permissive mode (restrictive umask / FS). Without this, the test could pass spuriously on environments where WriteFile silently produces 0o600. ## Caller audit per CLAUDE.md semantic-change rule - writeSidecar return signature changed from `error` to `(bool, error)`. Sole caller is encodeOne (success-path branch); updated in lock-step to capture both values. The bool flows into rollbackOrphanFSMAndSidecar via a new parameter. - rollbackOrphanFSMAndSidecar gained a sidecarTruncated bool parameter. Sole caller is encodeOne's success-path sidecar-failure branch; updated in lock-step. v33's Lstat + IsRegular check removed (replaced by the precise truncated flag), so non-regular paths are no longer destroyed even by paths the v33 check would have accidentally caught. - openSidecarFile (unix) OpenFile mode arg changed from 0o600 literal to sidecarFileMode const. Same numeric value; no behavior change. All callers (in-package adapter writers + the exported OpenSidecarFile) are unaffected. Tests + lint green. --- cmd/elastickv-snapshot-encode/main.go | 87 ++++++++++++---------- cmd/elastickv-snapshot-encode/main_test.go | 59 +++++++++++++++ internal/backup/open_nofollow_unix.go | 2 +- internal/backup/open_nofollow_unix_test.go | 11 +++ 4 files changed, 118 insertions(+), 41 deletions(-) diff --git a/cmd/elastickv-snapshot-encode/main.go b/cmd/elastickv-snapshot-encode/main.go index 7163ceb6d..d3905d769 100644 --- a/cmd/elastickv-snapshot-encode/main.go +++ b/cmd/elastickv-snapshot-encode/main.go @@ -422,19 +422,21 @@ func encodeOne(cfg *config, logger *slog.Logger) error { // when the encode itself errored before any result was populated // (publishErr != nil && !errSelfTestMismatch) (codex P2 v6 #904). if publishErr == nil || errors.Is(publishErr, errSelfTestMismatch) { - if serr := writeSidecar(cfg, manifest, effectiveTS, overridden, result); serr != nil { + sidecarTruncated, serr := writeSidecar(cfg, manifest, effectiveTS, overridden, result) + if serr != nil { // Surface the sidecar-write failure only if encode itself // succeeded; on mismatch the mismatch error takes priority. if publishErr == nil { // .fsm was just renamed into place by writeAndPublish // but the sidecar write failed → we have an orphan // .fsm without its matching provenance metadata. - // Roll both back to a consistent absent state so the - // operator doesn't see a "successful" restore - // artifact missing its sidecar (claude/codex P2 v31 - // observation on PR #904 — the design contract is - // that .fsm + sidecar move together). - rollbackOrphanFSMAndSidecar(cfg.outputPath, logger) + // Roll back to a consistent absent state. The sidecar + // rollback is gated on sidecarTruncated so we don't + // remove an operator-owned pre-existing entry that + // OpenSidecarFile refused to clobber (claude/codex + // P2 v31 added the rollback; codex P2 v33 #904 added + // the truncation gate for hard-linked sidecars). + rollbackOrphanFSMAndSidecar(cfg.outputPath, sidecarTruncated, logger) return errors.Wrap(serr, "write encode_info sidecar") } logger.Warn("write encode_info sidecar on mismatch", "err", serr) @@ -511,34 +513,25 @@ func buildEncodeOptions(cfg *config, effectiveTS uint64, manifest backup.Manifes // Both os.Remove calls log-and-continue on non-ErrNotExist failures // so the caller's primary sidecar-write error remains the dominant // signal. -func rollbackOrphanFSMAndSidecar(outputPath string, logger *slog.Logger) { +// rollbackOrphanFSMAndSidecar reverts an encode that succeeded in +// publishing the .fsm but failed in writing the sidecar. Always +// removes the just-renamed .fsm at outputPath. The sidecar at +// EncodeInfoSidecarPath(outputPath) is removed ONLY when +// sidecarTruncated is true — i.e., backup.OpenSidecarFile succeeded +// and either created a fresh file or truncated an existing +// single-link regular file. When sidecarTruncated is false the +// existing entry is operator-owned (symlink, hard link with +// Nlink > 1, FIFO, directory, etc.) that OpenSidecarFile refused to +// clobber, so this rollback must NOT destroy it either (codex P2 +// v32 #904 / codex P2 v33 #904). +func rollbackOrphanFSMAndSidecar(outputPath string, sidecarTruncated bool, logger *slog.Logger) { if rerr := os.Remove(outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { logger.Warn("rollback orphan .fsm after sidecar failure", "err", rerr) } - sidecarPath := backup.EncodeInfoSidecarPath(outputPath) - // Only remove a sidecar known to be a regular file — i.e., one - // this run (or a prior encoder) created. A pre-existing - // non-regular entry (operator-placed symlink, FIFO, directory, - // etc.) is what OpenSidecarFile correctly refuses to clobber - // when the rollback was triggered; os.Remove'ing the - // operator's filesystem object merely because the encode - // failed would be a destructive side effect (codex P2 v32 - // #904). Symlinks the operator may have placed are also - // preserved here — OpenSidecarFile's O_NOFOLLOW failure means - // the symlink was never followed and no truncate happened, so - // the link is unchanged and is the operator's to manage. - info, err := os.Lstat(sidecarPath) - if err != nil { - if !errors.Is(err, os.ErrNotExist) { - logger.Warn("stat sidecar for rollback", "err", err) - } - return - } - if !info.Mode().IsRegular() { - logger.Warn("skip sidecar rollback: --output sidecar path is not a regular file", - "path", sidecarPath, "mode", info.Mode()) + if !sidecarTruncated { return } + sidecarPath := backup.EncodeInfoSidecarPath(outputPath) if srerr := os.Remove(sidecarPath); srerr != nil && !errors.Is(srerr, os.ErrNotExist) { logger.Warn("rollback partial sidecar after write failure", "err", srerr) } @@ -780,9 +773,17 @@ func tempOutputPath(output string) (string, error) { return output + ".tmp-" + hex.EncodeToString(buf), nil } -// writeSidecar emits ENCODE_INFO.json next to the published .fsm. -// Path-derived per gemini medium v2 #896. -func writeSidecar(cfg *config, m backup.Manifest, effectiveTS uint64, overridden bool, result backup.EncodeResult) error { +// writeSidecar emits ENCODE_INFO.json next to the published .fsm +// (path-derived per gemini medium v2 #896). Returns (truncated, err): +// truncated is true iff backup.OpenSidecarFile succeeded — i.e., the +// existing path was truncated by THIS run (or a fresh file was +// created). When truncated is false, the caller MUST NOT roll back +// the sidecar path: any pre-existing entry there is operator-owned +// and OpenSidecarFile correctly refused to clobber it (codex P2 v33 +// #904 — hard-linked sidecars in particular pass IsRegular but were +// refused via Nlink>1; v32's IsRegular-only rollback gate would +// have destroyed those). +func writeSidecar(cfg *config, m backup.Manifest, effectiveTS uint64, overridden bool, result backup.EncodeResult) (bool, error) { info := backup.NewEncodeInfo(time.Now()) info.EncoderVersion = version info.InputRoot = cfg.inputPath @@ -812,24 +813,30 @@ func writeSidecar(cfg *config, m backup.Manifest, effectiveTS uint64, overridden // choosing (codex P2 v25 #904). f, err := backup.OpenSidecarFile(sidecarPath) if err != nil { - return errors.Wrap(err, "open sidecar") - } + // Pre-existing entry refused (symlink/hard-link/non-regular). + // Caller must NOT rollback the sidecar path; the file there + // is operator-owned and was never touched by this run. + return false, errors.Wrap(err, "open sidecar") + } + // From this point, f points at a truncated (zero-length) regular + // file owned by this run. Any subsequent failure leaves partial + // bytes (or empty) on disk — the caller's rollback removes them. if err := backup.WriteEncodeInfo(f, info); err != nil { _ = f.Close() - return errors.Wrap(err, "WriteEncodeInfo") + return true, errors.Wrap(err, "WriteEncodeInfo") } if err := f.Sync(); err != nil { _ = f.Close() - return errors.WithStack(err) + return true, errors.WithStack(err) } if err := f.Close(); err != nil { - return errors.WithStack(err) + return true, errors.WithStack(err) } // fsync the parent dir so the new sidecar's directory entry is // durable alongside its bytes. Mirrors the rename path // (codex P2 v24 #904). if err := fsyncParentDir(sidecarPath); err != nil { - return errors.Wrap(err, "fsync sidecar parent dir") + return true, errors.Wrap(err, "fsync sidecar parent dir") } - return nil + return true, nil } diff --git a/cmd/elastickv-snapshot-encode/main_test.go b/cmd/elastickv-snapshot-encode/main_test.go index f368ad9ae..b97cb5761 100644 --- a/cmd/elastickv-snapshot-encode/main_test.go +++ b/cmd/elastickv-snapshot-encode/main_test.go @@ -1071,6 +1071,65 @@ func assertSymlinkSidecarRollbackInvariants(t *testing.T, victim string, victimB } } +// TestCLISidecarWriteRefusesHardLinkTarget pins codex P2 v33 #904: +// when the sidecar path is a hard link to an operator-owned file, +// OpenSidecarFile refuses to truncate (Nlink > 1 guard); the v32 +// rollback's IsRegular()-only gate would have unlinked the hard +// link anyway, destroying operator state. v34 routes the rollback +// through a sidecarTruncated bool so a refused OpenSidecarFile +// keeps the operator's entry intact. Unix-only (Windows hard-link +// semantics differ). +func TestCLISidecarWriteRefusesHardLinkTarget(t *testing.T) { + if isWindows { + t.Skip("hard-link refusal semantics differ on Windows") + } + t.Parallel() + in := t.TempDir() + emitMinimalManifest(t, in, 100) + + outDir := t.TempDir() + out := filepath.Join(outDir, "out.fsm") + sidecarPath := backup.EncodeInfoSidecarPath(out) + // Pre-plant a victim file and hard-link the sidecar path to it. + victimDir := t.TempDir() + victim := filepath.Join(victimDir, "victim.json") + const victimBody = "VICTIM CONTENTS — must survive hard-link rejection" + if err := os.WriteFile(victim, []byte(victimBody), 0o600); err != nil { + t.Fatalf("WriteFile victim: %v", err) + } + if err := os.Link(victim, sidecarPath); err != nil { + t.Fatalf("Link victim → sidecar path: %v", err) + } + + code, err := run([]string{"--input", in, "--output", out}, quietLogger()) + if err == nil { + t.Fatalf("run succeeded; want sidecar open to fail on hard-linked path") + } + if code != exitUserErr { + t.Errorf("exit code = %d, want %d (operator-env error)", code, exitUserErr) + } + // Victim contents MUST be preserved (OpenSidecarFile's Nlink + // guard refused to truncate; v34 rollback's sidecarTruncated=false + // branch then skipped the os.Remove). + got, rerr := os.ReadFile(victim) + if rerr != nil { + t.Fatalf("read victim: %v", rerr) + } + if string(got) != victimBody { + t.Errorf("victim mutated; OpenSidecarFile truncated through hard link (codex P2 v33 regression)") + } + // The hard link at the sidecar path MUST also survive — the + // rollback no longer os.Removes it (codex P2 v33 fix). + if _, statErr := os.Lstat(sidecarPath); statErr != nil { + t.Errorf("hard link at sidecar path was removed by rollback (codex P2 v33 regression): %v", statErr) + } + // .fsm at outputPath MUST be removed (rollback's FSM branch + // always fires when the encode succeeded but sidecar failed). + if _, statErr := os.Stat(out); !os.IsNotExist(statErr) { + t.Errorf(".fsm at %s should be removed by rollback after sidecar failure", out) + } +} + // TestCLIPublishesFsmAndSidecarMode0600 pins claude v4 #904: the // produced .fsm and ENCODE_INFO.json are created with mode 0o600 so a // multi-user backup host does not get a world-readable dataset. The diff --git a/internal/backup/open_nofollow_unix.go b/internal/backup/open_nofollow_unix.go index cf72b759c..9a0d06279 100644 --- a/internal/backup/open_nofollow_unix.go +++ b/internal/backup/open_nofollow_unix.go @@ -58,7 +58,7 @@ func refuseHardLink(info os.FileInfo, path string) error { func openSidecarFile(path string) (*os.File, error) { // Note: NO O_TRUNC here — we truncate after the link-count check. const flag = os.O_WRONLY | os.O_CREATE | syscall.O_NOFOLLOW | syscall.O_NONBLOCK - f, err := os.OpenFile(path, flag, 0o600) //nolint:gosec,mnd // path is composed from output-root + fixed file name; 0600 is the standard owner-only mode + f, err := os.OpenFile(path, flag, sidecarFileMode) //nolint:gosec // path is composed from output-root + fixed file name; sidecarFileMode is the standard owner-only mode used here and by the post-Truncate Chmod below if err != nil { if errors.Is(err, syscall.ELOOP) { return nil, cockroachdberr.WithStack(cockroachdberr.Wrapf(err, diff --git a/internal/backup/open_nofollow_unix_test.go b/internal/backup/open_nofollow_unix_test.go index 7e7eddc77..cc1efde08 100644 --- a/internal/backup/open_nofollow_unix_test.go +++ b/internal/backup/open_nofollow_unix_test.go @@ -22,6 +22,17 @@ func TestOpenSidecarFileEnforcesOwnerOnlyMode(t *testing.T) { if err := os.WriteFile(path, []byte("prior"), 0o644); err != nil { //nolint:gosec // test simulates legacy permissive sidecar t.Fatalf("WriteFile: %v", err) } + // Verify the environment actually honored the broader seed mode; + // a restrictive umask or stricter FS could silently produce 0o600 + // and the test would pass even if the chmod-enforcement logic + // regressed (CodeRabbit nit on PR #904). + seedInfo, err := os.Stat(path) + if err != nil { + t.Fatalf("Stat seeded file: %v", err) + } + if seedInfo.Mode().Perm()&0o077 == 0 { + t.Skipf("environment refused permissive seed mode (got 0o%o); test cannot exercise chmod-enforcement", seedInfo.Mode().Perm()) + } f, err := OpenSidecarFile(path) if err != nil { t.Fatalf("OpenSidecarFile: %v", err) From 365744e178c6a9cd560d10c99e399d13aae6105f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Wed, 3 Jun 2026 00:01:41 +0900 Subject: [PATCH 35/35] backup: #904 v35 - fix codex citation round-number (claude v33 nit) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Claude v33 caught: encode_s3.go:93 attributed the non-directory bucket guard to "codex P2 v32 #904" but the finding was actually flagged in codex's review of commit a2ed548f (v31 commit). v32 was when codex flagged the fsync + sidecar-rollback issues. The attribution round-number was off by one. Doc-only one-character fix (v32 → v31). No behavior change; lint green. --- internal/backup/encode_s3.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/internal/backup/encode_s3.go b/internal/backup/encode_s3.go index ad63f221b..4e537963b 100644 --- a/internal/backup/encode_s3.go +++ b/internal/backup/encode_s3.go @@ -90,7 +90,7 @@ func (e *S3RecordEncoder) Encode(b *snapshotBuilder) error { // A regular file or symlink here means the dump is // malformed or partially truncated — silently skipping // would let the encoder publish a partial .fsm with - // the affected bucket omitted (codex P2 v32 #904; the + // the affected bucket omitted (codex P2 v31 #904; the // manifest's empty S3 scope from populateAdapterScopes // cannot otherwise distinguish missing bucket from // dumped-empty bucket).