fix(nilcc-api): serialize multi-replica payment poller#556
Merged
Conversation
Coverage Report for nilcc-api
File Coverage
|
||||||||||||||||||||||||||||||||||||||||||||||||||
With desired_count=2 both replicas share a single block_cursors row and
both poll the burn contract. When they race on the same event, one
replica wins the transaction and the other fails the payments.tx_hash
unique constraint. The QueryFailedError was bubbling up to the poller
as a WARN plus a refusal to advance the cursor, causing every
subsequent poll to rescan the same block range.
Catch the unique-constraint violation inside PaymentService.processEvent
via the existing isUniqueConstraint helper, log at INFO ("Payment
already processed: <txHash>"), and return null. The poller sees a
successful no-op and advances the cursor normally; the row has already
been committed by the winning replica.
This is a defense-in-depth fix only — the follow-up commit eliminates
the race entirely by serializing the poller on the cursor row. Keeping
the catch guards against manual cursor rollbacks, restart replays, and
any future caller of processEvent outside the poller.
With desired_count=2 both replicas were polling the burn contract independently, racing on the shared block_cursors row. That produced duplicate getLogs calls, duplicate processEvent invocations, and non-monotonic cursor writes that caused repeated block rescans. Wrap the entire doPoll body in a single transaction that begins with SELECT last_processed_block FROM block_cursors WHERE id = $1 FOR UPDATE. The row lock serializes replicas inside Postgres: the waiter blocks until the holder commits, then reads the already-advanced cursor and polls only the new range — no wasted RPC, no duplicate processing, and the read-modify-write of the cursor is atomic so it can no longer regress. PaymentPoller.start() seeds the cursor row via INSERT ... ON CONFLICT DO NOTHING so FOR UPDATE always has a row to lock. Also harden the startup path in main.ts: move paymentPoller.start() after the SIGTERM/SIGINT handlers are installed and trigger shutdown if start() rejects, so a poller init failure cannot leave the process up with signal handlers pointing at a half-constructed poller. Add a reentrancy guard to shutdown() so concurrent signals don't double-close the servers. Verified with two nilcc-api replicas in docker-compose against shared Postgres and anvil: ranges alternate cleanly between replicas with no overlap, cursor advances monotonically, and burn events are processed exactly once.
001ef2e to
c8e5b00
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
nilcc-apiruns withdesired_count=2(prod), both replicas poll the burn contract against a single sharedblock_cursorsrow with no locking. This branch fixes the race in two stages:Treat duplicate payment inserts as idempotent no-ops. When two replicas race on the same event, the loser fails the
payments.tx_hashunique constraint. Previously that bubbled up as a WARN and stalled the poller cursor, causing repeated rescans. NowPaymentService.processEventcatches the unique-constraint violation (via the existingisUniqueConstrainthelper), logs INFOPayment already processed: <txHash>, and returnsnull. Defense-in-depth only — kept as a safety net for manual cursor rollbacks, restart replays, and any future caller ofprocessEventoutside the poller.Serialize the poller on the cursor row. Root-cause fix.
doPollnow runs inside a single transaction that begins withSELECT last_processed_block FROM block_cursors WHERE id = $1 FOR UPDATE. Replicas serialize inside Postgres: the waiter blocks until the holder commits, then reads the already-advanced cursor and polls only the new range — no duplicategetLogs, no duplicateprocessEvent, and the read-modify-write of the cursor is atomic so it cannot regress.PaymentPoller.start()seeds the cursor row viaINSERT ... ON CONFLICT DO NOTHINGsoFOR UPDATEalways has a row to lock.Startup/shutdown hardening in
main.ts. The poller now starts after SIGTERM/SIGINT handlers are installed, and astart()rejection triggersshutdown()so a poller init failure cannot leave the process up with a half-constructed poller.shutdown()has a reentrancy guard to prevent concurrent signals from double-closing servers.