fix(outbox): pg_notify failure is a soft failure, not a tx rollback (C2)#192
Open
jiashuoz wants to merge 1 commit into
Open
fix(outbox): pg_notify failure is a soft failure, not a tx rollback (C2)#192jiashuoz wants to merge 1 commit into
jiashuoz wants to merge 1 commit into
Conversation
`writeOutboxRow`'s pg_notify call returned its error, propagating out
of PublishTx → out of the caller's WithTx → rolling back the entire
trigger transaction. The inline comment said the opposite of what
the code did:
"NOTIFY queue overflow is the only realistic error … Treat as
soft failure: log and let the tx commit."
In the relay path the impact is catastrophic: when the NOTIFY queue
backs up (max_notify_queue_pages defaults to 1024 × 8KB = 8MB, NOT
"8GB" as the comment claimed), every inbound message rolls back —
both the webhook_events row AND the messages row. The relay logs
"failed to record inbound message" and SMTP returns a 4xx to the
MTA. The MTA retries into the same broken state. Inbound mail piles
up undelivered with no recovery path until ops clears the queue.
In the publishPendingApproval / publishRejected paths the impact is
milder (each runs its own standalone WithTx wrapping only the outbox
row), but the event is still silently dropped because that tx rolls
back.
The webhook_events INSERT having succeeded is enough for at-least-
once: the worker has a 1-second fallback poll that picks up rows
regardless of whether NOTIFY fired. NOTIFY is a latency optimization
(sub-100ms wake vs. up-to-1s poll wake), not a correctness primitive.
Fix: log the pg_notify error and return nil. Matches what the
original comment promised; expanded comment explains the trade-off
and points at this audit finding (C2).
Tests:
- TestWriteOutboxRow_PgNotifyFailureIsSoft — extends the existing
fakeExec to fail selectively on pg_notify; asserts writeOutboxRow
returns nil (so the caller's tx commits) and both Exec calls were
still attempted.
- TestWriteOutboxRow_InsertFailureStillPropagates — asserts the
INSERT-failure path STILL propagates the error (we only softened
pg_notify, not all Exec errors), and pg_notify is never called
after an INSERT failure.
- Existing happy-path test continues to pass unchanged.
PublishBestEffortTx (publishApproved) was already correct — it logs
and returns regardless of writeOutboxRow's outcome because SES has
already accepted the send. Only PublishTx had the bug.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Second fix from the audit. `writeOutboxRow`'s pg_notify call returned its error, propagating out of `PublishTx` → out of the caller's `WithTx` → rolling back the entire trigger transaction. The inline comment said the opposite of what the code did:
Blast radius
internal/relay/server.go:377-389)internal/agent/api.go:313)internal/agent/api.go:348)Also: the comment's claim about queue size was wrong. `max_notify_queue_pages` defaults to 1024 × 8KB = 8MB, not the "8GB" the comment said. Under sustained high-throughput writes from many concurrent transactions, 8MB is reachable.
The fix
```go
if _, nerr := exec.Exec(ctx, `SELECT pg_notify('webhook_events_new', '')`); nerr != nil {
log.Printf("[outbox] pg_notify webhook_events_new failed (event=%s, type=%s): %v — worker fallback poll will recover",
e.ID, e.Type, nerr)
}
return nil
```
The webhook_events INSERT having succeeded is enough for at-least-once: the worker has a 1-second fallback poll that picks up the row regardless of whether NOTIFY fired. NOTIFY is a latency optimization (sub-100ms wake vs. up-to-1s poll wake), not a correctness primitive.
Tests
Three tests, two new + one untouched-but-relevant happy path:
Plus the integration suite (`-tags integration`) still passes.
Verification
Other findings
C1 merged in #191. C3-C5 (Critical), H1-H8 (High), plus M/L items tracked separately.
🤖 Generated with Claude Code