Skip to content

fix(nilcc-api): serialize multi-replica payment poller#556

Merged
jcabrero merged 2 commits into
mainfrom
fix/repeated-txhash-warning
Apr 13, 2026
Merged

fix(nilcc-api): serialize multi-replica payment poller#556
jcabrero merged 2 commits into
mainfrom
fix/repeated-txhash-warning

Conversation

@jcabrero
Copy link
Copy Markdown
Member

@jcabrero jcabrero commented Apr 13, 2026

Summary

When nilcc-api runs with desired_count=2 (prod), both replicas poll the burn contract against a single shared block_cursors row with no locking. This branch fixes the race in two stages:

  • Treat duplicate payment inserts as idempotent no-ops. When two replicas race on the same event, the loser fails the payments.tx_hash unique constraint. Previously that bubbled up as a WARN and stalled the poller cursor, causing repeated rescans. Now PaymentService.processEvent catches the unique-constraint violation (via the existing isUniqueConstraint helper), logs INFO Payment already processed: <txHash>, and returns null. Defense-in-depth only — kept as a safety net for manual cursor rollbacks, restart replays, and any future caller of processEvent outside the poller.

  • Serialize the poller on the cursor row. Root-cause fix. doPoll now runs inside a single transaction that begins with SELECT last_processed_block FROM block_cursors WHERE id = $1 FOR UPDATE. Replicas serialize inside Postgres: the waiter blocks until the holder commits, then reads the already-advanced cursor and polls only the new range — no duplicate getLogs, no duplicate processEvent, and the read-modify-write of the cursor is atomic so it cannot regress. PaymentPoller.start() seeds the cursor row via INSERT ... ON CONFLICT DO NOTHING so FOR UPDATE always has a row to lock.

  • Startup/shutdown hardening in main.ts. The poller now starts after SIGTERM/SIGINT handlers are installed, and a start() rejection triggers shutdown() so a poller init failure cannot leave the process up with a half-constructed poller. shutdown() has a reentrancy guard to prevent concurrent signals from double-closing servers.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 13, 2026

Coverage Report for nilcc-api

Status Category Percentage Covered / Total
🔵 Lines 80.73% 5160 / 6391
🔵 Statements 80.73% 5160 / 6391
🔵 Functions 83.71% 257 / 307
🔵 Branches 84.65% 513 / 606
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
nilcc-api/src/main.ts 0% 0% 0% 0% 1-98
nilcc-api/src/payment/payment-poller.ts 8.78% 100% 0% 8.78% 15-19, 22-46, 49-54, 57-70, 73-184
nilcc-api/src/payment/payment.service.ts 85.04% 76.47% 100% 85.04% 25-26, 65-67, 79-83, 119-124
Generated in workflow #207 for commit c8e5b00 by the Vitest Coverage Report Action

With desired_count=2 both replicas share a single block_cursors row and
both poll the burn contract. When they race on the same event, one
replica wins the transaction and the other fails the payments.tx_hash
unique constraint. The QueryFailedError was bubbling up to the poller
as a WARN plus a refusal to advance the cursor, causing every
subsequent poll to rescan the same block range.

Catch the unique-constraint violation inside PaymentService.processEvent
via the existing isUniqueConstraint helper, log at INFO ("Payment
already processed: <txHash>"), and return null. The poller sees a
successful no-op and advances the cursor normally; the row has already
been committed by the winning replica.

This is a defense-in-depth fix only — the follow-up commit eliminates
the race entirely by serializing the poller on the cursor row. Keeping
the catch guards against manual cursor rollbacks, restart replays, and
any future caller of processEvent outside the poller.
With desired_count=2 both replicas were polling the burn contract
independently, racing on the shared block_cursors row. That produced
duplicate getLogs calls, duplicate processEvent invocations, and
non-monotonic cursor writes that caused repeated block rescans.

Wrap the entire doPoll body in a single transaction that begins with
SELECT last_processed_block FROM block_cursors WHERE id = $1 FOR UPDATE.
The row lock serializes replicas inside Postgres: the waiter blocks
until the holder commits, then reads the already-advanced cursor and
polls only the new range — no wasted RPC, no duplicate processing, and
the read-modify-write of the cursor is atomic so it can no longer
regress. PaymentPoller.start() seeds the cursor row via INSERT ... ON
CONFLICT DO NOTHING so FOR UPDATE always has a row to lock.

Also harden the startup path in main.ts: move paymentPoller.start()
after the SIGTERM/SIGINT handlers are installed and trigger shutdown if
start() rejects, so a poller init failure cannot leave the process up
with signal handlers pointing at a half-constructed poller. Add a
reentrancy guard to shutdown() so concurrent signals don't double-close
the servers.

Verified with two nilcc-api replicas in docker-compose against shared
Postgres and anvil: ranges alternate cleanly between replicas with no
overlap, cursor advances monotonically, and burn events are processed
exactly once.
@jcabrero jcabrero changed the title fix: silence repeated tx hash warning on multi-replica deployments fix(nilcc-api): serialize multi-replica payment poller Apr 13, 2026
@jcabrero jcabrero force-pushed the fix/repeated-txhash-warning branch from 001ef2e to c8e5b00 Compare April 13, 2026 14:10
@jcabrero jcabrero merged commit 9d29645 into main Apr 13, 2026
3 checks passed
@jcabrero jcabrero deleted the fix/repeated-txhash-warning branch April 13, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant