Exponential backoff in runInLoop on failure#2959
Merged
Merged
Conversation
Previously any error thrown inside a runInLoop callback would propagate all the way out of the worker module, the dynamic import in index.ts would reject, and the process would sleep 250ms and exit(1) for PM2 to restart. During a real outage (e.g. Postgres at max_connections), every runInLoop-based worker (inserter, scanner, scanner2, rater, monitor, cleanup, ...) crashes and restarts ~4x per second, each restart reopening its Knex pool against an already-saturated DB and losing in-flight Promise.all work. The system has no way to slow down. This change wraps func() in try/catch, logs the error (still loudly, via console.error, with a consecutive-failure count), and sleeps with exponential backoff (250ms base, doubling, capped at 30s) plus jitter before retrying. On success, the failure counter resets. The existing lastRun:<APP_NAME> health beacon is only refreshed on success, so a worker stuck in the failure path will still trip HEALTH_TIMEOUT-based alerting. Workers that fail for non-transient reasons (bad code, missing env) therefore still surface; they just don't take the DB down with them on the way out. Co-authored-by: Cursor <cursoragent@cursor.com>
howardchung
reviewed
May 18, 2026
howardchung
reviewed
May 18, 2026
howardchung
reviewed
May 18, 2026
After sustained failure the worker should still exit so PM2 can do a clean-slate restart (fresh Knex pool, cleared in-memory state), rather than retrying at the 30s backoff cap forever. Threshold: 10 consecutive failures. With the existing backoff curve (0.5s, 1s, 2s, 4s, 8s, 16s, 30s, 30s, ...) that is 2 -- 2.5 minutes of sustained failure before exiting -- well past any transient DB blip but short enough that a genuinely-broken worker still surfaces via crash + restart instead of silently retrying forever. Co-authored-by: Cursor <cursoragent@cursor.com>
0042fea to
910a926
Compare
howardchung
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously any error thrown inside a
runInLoopcallback would propagateall the way out of the worker module, the dynamic import in index.ts
would reject, and the process would sleep 250ms and exit(1) for PM2 to
restart. During a real outage (e.g. Postgres at max_connections), every
runInLoop-based worker (inserter, scanner, scanner2, rater, monitor,
cleanup, ...) crashes and restarts ~4x per second, each restart
reopening its Knex pool against an already-saturated DB and losing
in-flight Promise.all work. The system has no way to slow down.
This change wraps func() in try/catch, logs the error (still loudly,
via console.error, with a consecutive-failure count), and sleeps with
exponential backoff (250ms base, doubling, capped at 30s) plus jitter
before retrying. On success, the failure counter resets.
After
MAX_CONSECUTIVE_FAILURES(10) in a row, the worker exits so PM2can do a clean-slate restart (fresh Knex pool, cleared in-memory state)
rather than retrying at the 30s backoff cap forever. With the backoff
curve this is ~2 minutes of sustained failure before giving up — past
any transient DB blip, but short enough that a genuinely-wedged worker
still surfaces via crash + restart instead of silently retrying. Mirrors
the inserter watchdog's existing
process.exit(1)pattern.The existing lastRun:<APP_NAME> health beacon is only refreshed on
success, so a worker stuck in the failure path will still trip
HEALTH_TIMEOUT-based alerting. Workers that fail for non-transient
reasons (bad code, missing env) therefore still surface; they just
don't take the DB down with them on the way out.