Skip to content

SQS standard SendMessageBatch: option-2 dedup for consumer-interleave / attribute-change edges (follow-up to #923) #931

@bootjp

Description

@bootjp

Follow-up to PR #923 (SQS standard SendMessageBatch double-send fix). #923 eliminated the leader-churn double-send by reusing a stable per-entry identity across the in-process retry, and is claude-approved. Two P2 edge cases raised by codex on #923 are intentionally deferred here because closing them properly requires the full option-2 dedup machinery (the same shape used for DynamoDB in #920), which is a substantially larger, gated change than the double-send fix.

P2-1 — retry can clobber a concurrent consumer mutation

When attempt 1 commits but returns a retryable conflict (leader churn), the retry re-PUTs the same data/vis/by-age keys while the OCC transaction only reads [meta, gen]. If a consumer receives or deletes the message during the retry backoff (their own OCC txn mutating those message keys), the retry's second PUT can overwrite the rotated CurrentReceiptToken/VisibleAtMillis or recreate a deleted record. This is a narrower window than the original double-send (needs a consumer to race the retry on a now-visible message) and affects one message rather than every batch entry, but it is a real interleaving the simple stable-key fix does not fence.

P2-2 — committed entry reported Failed[] after retry-time revalidation

If attempt 1 commits and a concurrent SetQueueAttributes tightens a limit (e.g. lowers MaximumMessageSize) before the retry, validateBatchEntry (which now correctly runs every attempt) rejects the entry into Failed[] even though it is already in the queue — an inconsistent client view (stored, but reported failed). Within the standard-queue at-least-once contract; documented in docs/design/2026_06_03_partial_dynamodb_onephase_dedup.md (S3/SQS section, "Residual edge").

Proper fix (option-2 dedup for SQS batch)

Mirror the DynamoDB approach (#920):

  • adapter allocates commitTS locally (Clock().NextFenced()), gated on coordinator.IsLeader() (leader-issued ts);
  • on retry, reuse the write set carrying PrevCommitTS + fence the message data keys in ReadKeys (stable StartTS);
  • FSM exact-ts probe (dedupProbeOnePhase) no-ops the apply if attempt 1's primary key landed at PrevCommitTS → the retry does NOT re-write, so a consumer mutation is preserved and a committed entry is reported success (cached results), not failed;
  • behind a default-off gate (R5 ship-reader-before-writer), like the Redis/DynamoDB dedup.

This is a gated feature, not a hotfix, so it is tracked separately from the double-send correctness fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions