Skip to content

fix(outbox): pg_notify failure is a soft failure, not a tx rollback (C2)#192

Open
jiashuoz wants to merge 1 commit into
mainfrom
fix/c2-pg-notify-soft-failure
Open

fix(outbox): pg_notify failure is a soft failure, not a tx rollback (C2)#192
jiashuoz wants to merge 1 commit into
mainfrom
fix/c2-pg-notify-soft-failure

Conversation

@jiashuoz
Copy link
Copy Markdown
Member

@jiashuoz jiashuoz commented Jun 5, 2026

Summary

Second fix from the audit. `writeOutboxRow`'s pg_notify call returned its error, propagating out of `PublishTx` → out of the caller's `WithTx` → rolling back the entire trigger transaction. The inline comment said the opposite of what the code did:

"NOTIFY queue overflow is the only realistic error … Treat as soft failure: log and let the tx commit."

Blast radius

Caller Pre-fix behavior on pg_notify failure
relay (internal/relay/server.go:377-389) Catastrophic. The outbox `PublishTx` is in the same tx as `CreateInboundMessageInTx`. NOTIFY fails → both rows roll back → SMTP returns 4xx → MTA retries into the same broken state. Inbound mail piles up undelivered.
publishPendingApproval (internal/agent/api.go:313) Event silently dropped. The standalone outbox tx rolls back; the underlying pending-approval state was already committed in a different tx. User gets a successful API response but no `email.pending_approval` event fires.
publishRejected (internal/agent/api.go:348) Same as pending: event silently dropped.
publishApproved (uses `PublishBestEffortTx`) Already correct — best-effort variant logs and returns regardless.

Also: the comment's claim about queue size was wrong. `max_notify_queue_pages` defaults to 1024 × 8KB = 8MB, not the "8GB" the comment said. Under sustained high-throughput writes from many concurrent transactions, 8MB is reachable.

The fix

```go
if _, nerr := exec.Exec(ctx, `SELECT pg_notify('webhook_events_new', '')`); nerr != nil {
log.Printf("[outbox] pg_notify webhook_events_new failed (event=%s, type=%s): %v — worker fallback poll will recover",
e.ID, e.Type, nerr)
}
return nil
```

The webhook_events INSERT having succeeded is enough for at-least-once: the worker has a 1-second fallback poll that picks up the row regardless of whether NOTIFY fired. NOTIFY is a latency optimization (sub-100ms wake vs. up-to-1s poll wake), not a correctness primitive.

Tests

Three tests, two new + one untouched-but-relevant happy path:

Test What it pins
`TestWriteOutboxRow_PgNotifyFailureIsSoft` (new) Extends `fakeExec` to fail selectively on pg_notify; asserts `writeOutboxRow` returns nil so caller's tx commits. Both Exec calls (INSERT + NOTIFY) are still attempted.
`TestWriteOutboxRow_InsertFailureStillPropagates` (new) The INSERT-failure path STILL propagates — we only softened pg_notify. Also pins that pg_notify is never called after an INSERT failure.
`TestWriteOutboxRow_HappyPath_TwoExecCalls` (existing) Continues to pass: both Exec calls fire on the happy path.

Plus the integration suite (`-tags integration`) still passes.

Verification

  • `go test ./internal/webhookpub/` — passes
  • `go test ./internal/webhookpub/ -tags integration` — passes
  • Manually verified the log line is emitted with event ID + type for diagnosability

Other findings

C1 merged in #191. C3-C5 (Critical), H1-H8 (High), plus M/L items tracked separately.

🤖 Generated with Claude Code

`writeOutboxRow`'s pg_notify call returned its error, propagating out
of PublishTx → out of the caller's WithTx → rolling back the entire
trigger transaction. The inline comment said the opposite of what
the code did:

    "NOTIFY queue overflow is the only realistic error … Treat as
     soft failure: log and let the tx commit."

In the relay path the impact is catastrophic: when the NOTIFY queue
backs up (max_notify_queue_pages defaults to 1024 × 8KB = 8MB, NOT
"8GB" as the comment claimed), every inbound message rolls back —
both the webhook_events row AND the messages row. The relay logs
"failed to record inbound message" and SMTP returns a 4xx to the
MTA. The MTA retries into the same broken state. Inbound mail piles
up undelivered with no recovery path until ops clears the queue.

In the publishPendingApproval / publishRejected paths the impact is
milder (each runs its own standalone WithTx wrapping only the outbox
row), but the event is still silently dropped because that tx rolls
back.

The webhook_events INSERT having succeeded is enough for at-least-
once: the worker has a 1-second fallback poll that picks up rows
regardless of whether NOTIFY fired. NOTIFY is a latency optimization
(sub-100ms wake vs. up-to-1s poll wake), not a correctness primitive.

Fix: log the pg_notify error and return nil. Matches what the
original comment promised; expanded comment explains the trade-off
and points at this audit finding (C2).

Tests:

- TestWriteOutboxRow_PgNotifyFailureIsSoft — extends the existing
  fakeExec to fail selectively on pg_notify; asserts writeOutboxRow
  returns nil (so the caller's tx commits) and both Exec calls were
  still attempted.
- TestWriteOutboxRow_InsertFailureStillPropagates — asserts the
  INSERT-failure path STILL propagates the error (we only softened
  pg_notify, not all Exec errors), and pg_notify is never called
  after an INSERT failure.
- Existing happy-path test continues to pass unchanged.

PublishBestEffortTx (publishApproved) was already correct — it logs
and returns regardless of writeOutboxRow's outcome because SES has
already accepted the send. Only PublishTx had the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant