Skip to content

Exponential backoff in runInLoop on failure#2959

Merged
howardchung merged 3 commits into
odota:masterfrom
ff137:feat/run-in-loop-backoff
May 18, 2026
Merged

Exponential backoff in runInLoop on failure#2959
howardchung merged 3 commits into
odota:masterfrom
ff137:feat/run-in-loop-backoff

Conversation

@ff137
Copy link
Copy Markdown
Contributor

@ff137 ff137 commented May 17, 2026

Previously any error thrown inside a runInLoop callback would propagate
all the way out of the worker module, the dynamic import in index.ts
would reject, and the process would sleep 250ms and exit(1) for PM2 to
restart. During a real outage (e.g. Postgres at max_connections), every
runInLoop-based worker (inserter, scanner, scanner2, rater, monitor,
cleanup, ...) crashes and restarts ~4x per second, each restart
reopening its Knex pool against an already-saturated DB and losing
in-flight Promise.all work. The system has no way to slow down.

This change wraps func() in try/catch, logs the error (still loudly,
via console.error, with a consecutive-failure count), and sleeps with
exponential backoff (250ms base, doubling, capped at 30s) plus jitter
before retrying. On success, the failure counter resets.

After MAX_CONSECUTIVE_FAILURES (10) in a row, the worker exits so PM2
can do a clean-slate restart (fresh Knex pool, cleared in-memory state)
rather than retrying at the 30s backoff cap forever. With the backoff
curve this is ~2 minutes of sustained failure before giving up — past
any transient DB blip, but short enough that a genuinely-wedged worker
still surfaces via crash + restart instead of silently retrying. Mirrors
the inserter watchdog's existing process.exit(1) pattern.

The existing lastRun:<APP_NAME> health beacon is only refreshed on
success, so a worker stuck in the failure path will still trip
HEALTH_TIMEOUT-based alerting. Workers that fail for non-transient
reasons (bad code, missing env) therefore still surface; they just
don't take the DB down with them on the way out.

Previously any error thrown inside a runInLoop callback would propagate
all the way out of the worker module, the dynamic import in index.ts
would reject, and the process would sleep 250ms and exit(1) for PM2 to
restart. During a real outage (e.g. Postgres at max_connections), every
runInLoop-based worker (inserter, scanner, scanner2, rater, monitor,
cleanup, ...) crashes and restarts ~4x per second, each restart
reopening its Knex pool against an already-saturated DB and losing
in-flight Promise.all work. The system has no way to slow down.

This change wraps func() in try/catch, logs the error (still loudly,
via console.error, with a consecutive-failure count), and sleeps with
exponential backoff (250ms base, doubling, capped at 30s) plus jitter
before retrying. On success, the failure counter resets.

The existing lastRun:<APP_NAME> health beacon is only refreshed on
success, so a worker stuck in the failure path will still trip
HEALTH_TIMEOUT-based alerting. Workers that fail for non-transient
reasons (bad code, missing env) therefore still surface; they just
don't take the DB down with them on the way out.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread svc/store/queue.ts
Comment thread svc/store/queue.ts
Comment thread svc/store/queue.ts
After sustained failure the worker should still exit so PM2 can do a clean-slate restart (fresh Knex pool, cleared in-memory state), rather than retrying at the 30s backoff cap forever.

Threshold: 10 consecutive failures. With the existing backoff curve
(0.5s, 1s, 2s, 4s, 8s, 16s, 30s, 30s, ...) that is 2 -- 2.5 minutes of sustained failure before exiting -- well past any transient DB blip
but short enough that a genuinely-broken worker still surfaces via crash + restart instead of silently retrying forever.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ff137 ff137 force-pushed the feat/run-in-loop-backoff branch from 0042fea to 910a926 Compare May 18, 2026 08:50
@howardchung howardchung merged commit 4adf547 into odota:master May 18, 2026
1 check passed
@ff137 ff137 deleted the feat/run-in-loop-backoff branch May 19, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants