Exponential backoff in runInLoop on failure by ff137 · Pull Request #2959 · odota/core

ff137 · 2026-05-17T18:42:07Z

Previously any error thrown inside a runInLoop callback would propagate
all the way out of the worker module, the dynamic import in index.ts
would reject, and the process would sleep 250ms and exit(1) for PM2 to
restart. During a real outage (e.g. Postgres at max_connections), every
runInLoop-based worker (inserter, scanner, scanner2, rater, monitor,
cleanup, ...) crashes and restarts ~4x per second, each restart
reopening its Knex pool against an already-saturated DB and losing
in-flight Promise.all work. The system has no way to slow down.

This change wraps func() in try/catch, logs the error (still loudly,
via console.error, with a consecutive-failure count), and sleeps with
exponential backoff (250ms base, doubling, capped at 30s) plus jitter
before retrying. On success, the failure counter resets.

After MAX_CONSECUTIVE_FAILURES (10) in a row, the worker exits so PM2
can do a clean-slate restart (fresh Knex pool, cleared in-memory state)
rather than retrying at the 30s backoff cap forever. With the backoff
curve this is ~2 minutes of sustained failure before giving up — past
any transient DB blip, but short enough that a genuinely-wedged worker
still surfaces via crash + restart instead of silently retrying. Mirrors
the inserter watchdog's existing process.exit(1) pattern.

The existing lastRun:<APP_NAME> health beacon is only refreshed on
success, so a worker stuck in the failure path will still trip
HEALTH_TIMEOUT-based alerting. Workers that fail for non-transient
reasons (bad code, missing env) therefore still surface; they just
don't take the DB down with them on the way out.

Previously any error thrown inside a runInLoop callback would propagate all the way out of the worker module, the dynamic import in index.ts would reject, and the process would sleep 250ms and exit(1) for PM2 to restart. During a real outage (e.g. Postgres at max_connections), every runInLoop-based worker (inserter, scanner, scanner2, rater, monitor, cleanup, ...) crashes and restarts ~4x per second, each restart reopening its Knex pool against an already-saturated DB and losing in-flight Promise.all work. The system has no way to slow down. This change wraps func() in try/catch, logs the error (still loudly, via console.error, with a consecutive-failure count), and sleeps with exponential backoff (250ms base, doubling, capped at 30s) plus jitter before retrying. On success, the failure counter resets. The existing lastRun:<APP_NAME> health beacon is only refreshed on success, so a worker stuck in the failure path will still trip HEALTH_TIMEOUT-based alerting. Workers that fail for non-transient reasons (bad code, missing env) therefore still surface; they just don't take the DB down with them on the way out. Co-authored-by: Cursor <cursoragent@cursor.com>

After sustained failure the worker should still exit so PM2 can do a clean-slate restart (fresh Knex pool, cleared in-memory state), rather than retrying at the 30s backoff cap forever. Threshold: 10 consecutive failures. With the existing backoff curve (0.5s, 1s, 2s, 4s, 8s, 16s, 30s, 30s, ...) that is 2 -- 2.5 minutes of sustained failure before exiting -- well past any transient DB blip but short enough that a genuinely-broken worker still surfaces via crash + restart instead of silently retrying forever. Co-authored-by: Cursor <cursoragent@cursor.com>

howardchung reviewed May 18, 2026

View reviewed changes

Comment thread svc/store/queue.ts

howardchung reviewed May 18, 2026

View reviewed changes

Comment thread svc/store/queue.ts

howardchung reviewed May 18, 2026

View reviewed changes

Comment thread svc/store/queue.ts

ff137 force-pushed the feat/run-in-loop-backoff branch from 0042fea to 910a926 Compare May 18, 2026 08:50

🎨 Update docstring comment

48a2976

howardchung approved these changes May 18, 2026

View reviewed changes

howardchung merged commit 4adf547 into odota:master May 18, 2026
1 check passed

ff137 deleted the feat/run-in-loop-backoff branch May 19, 2026 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential backoff in runInLoop on failure#2959

Exponential backoff in runInLoop on failure#2959
howardchung merged 3 commits into
odota:masterfrom
ff137:feat/run-in-loop-backoff

ff137 commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ff137 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ff137 commented May 17, 2026 •

edited

Loading