Skip to content

Rewrite of worker pool with multi-task dispatch per worker#26

Open
tcely wants to merge 4 commits intokikkia:masterfrom
tcely:tcely-worker-pool
Open

Rewrite of worker pool with multi-task dispatch per worker#26
tcely wants to merge 4 commits intokikkia:masterfrom
tcely:tcely-worker-pool

Conversation

@tcely
Copy link
Copy Markdown
Contributor

@tcely tcely commented Jan 6, 2026

PR Type

Enhancement


Description

  • Refactor worker pool to process multiple tasks per worker lifecycle

  • Implement per-worker message limits with automatic worker rotation

  • Add robust error handling, timeouts, and recovery mechanisms

  • Introduce deque-based task queue with pluggable implementations

  • Add comprehensive worker lifecycle management and in-flight task tracking


Diagram Walkthrough

flowchart LR
  A["Task Queue<br/>Deque-based"] -->|dispatch| B["Idle Worker<br/>with Budget"]
  B -->|postMessage| C["Worker Process"]
  C -->|message event| D["Task Resolution<br/>resolve/reject"]
  D -->|releaseWorker| E{Budget<br/>Remaining?}
  E -->|Yes| F["Return to Idle Pool"]
  E -->|No| G["Retire & Replace"]
  F -->|fillWorkers| H["Create New Worker<br/>if needed"]
  C -->|error/crash| I["Reject Task<br/>& Retire Worker"]
  I -->|scheduleRefill| H
Loading

File Walkthrough

Relevant files
Enhancement
taskQueueDeque.ts
Add pluggable deque-based task queue implementation           

src/taskQueueDeque.ts

  • New file implementing pluggable deque-based task queue
  • Supports two implementations: @alg/deque and @korkje/deque
  • Adapters provide unified TaskQueue interface with push(), shift(), and
    length
  • Environment variable TASK_QUEUE_DEQUE_IMPL controls which
    implementation is used
+62/-0   
types.ts
Update types for worker limits and task tracking                 

src/types.ts

  • Replace WorkerWithStatus with WorkerWithLimit interface tracking
    message budget
  • Add TaskQueue interface for queue abstraction
  • Add InFlight type to track in-flight tasks with timeout IDs
  • Add SafeCallOptions type for error handling and logging configuration
+30/-3   
utils.ts
Add safe function call utility with error handling             

src/utils.ts

  • Add safeCall() function for safe function invocation with error
    handling
  • Implement looksLikeSafeCallOptions() type guard for options validation
  • Support optional logging, error callbacks, and label-based error
    tracking
  • Preserve caller's this context using Reflect.apply()
+65/-1   
workerPool.ts
Refactor worker pool with budgets and lifecycle management

src/workerPool.ts

  • Complete refactor of worker pool with multi-task dispatch per worker
  • Implement per-worker message budgets with MESSAGES_LIMIT configuration
  • Add comprehensive in-flight task tracking with timeout protection
  • Implement worker lifecycle management: idle pool, retirement, and
    replacement
  • Add recovery mechanism with exponential backoff for pool failures
  • Implement permanent message/error/crash handlers on workers
  • Add task age validation and queue draining on fatal errors
  • Replace simple dispatch loop with fillWorkers() and dispatch()
    functions
  • Track idle workers using both Set and Stack for efficient access
  • Add helper functions: releaseWorker(), setIdle(), setInFlight(),
    clearInFlight(), takeIdleWorker()
+427/-30
Documentation
README.md
Add comprehensive environment variables documentation       

src/README.md

  • New documentation file documenting all environment variables
  • Document TASK_QUEUE_DEQUE_IMPL for queue implementation selection
  • Document MAX_THREADS for worker pool concurrency control
  • Document MESSAGES_LIMIT for per-worker message budget configuration
  • Document other existing variables: API_TOKEN, HOST, PORT, HOME,
    XDG_CACHE_HOME, cache size limits
+90/-0   

@tcely tcely force-pushed the tcely-worker-pool branch from f073ba1 to 2d7bc89 Compare January 7, 2026 21:33
@kikkia
Copy link
Copy Markdown
Owner

kikkia commented Jan 13, 2026

Hey there, I took a look a while back and was thinking a bit on it, what is the benefit here vs the current implementation? The current implementation is simple and easily understandable.

Since the these jobs are only used for the first request on a new script its rare they happen. The public instance only see's a couple of those per week despite serving about 200-300k deciphers a day.

Unless there is some big improvement or bug that I have not seen or experienced at my scale, then I could be open to it, but I think the simplicity and ease of understanding is preferable.

@tcely
Copy link
Copy Markdown
Contributor Author

tcely commented Jan 13, 2026

I have also seen very few actual new players since I put up a server and pointed yt-dlp at it.

I'd very much like to see dispatch using a loop to assign more than one task before calling dispatch again.


A benefit of the new worker-pool implementation is robustness and liveness under edge cases that can otherwise lead to stuck requests, silent leaks, or long-lived degraded behavior. In other words: it’s about “never getting wedged.”

Concretely, the new code adds protections that the current simple model doesn’t provide:

1) Hard liveness guarantees (prevents “hung forever” requests)

New behavior

  • Each in-flight task gets an explicit timeout (IN_FLIGHT_TIMEOUT_MS = 60m). If a worker never responds (crash, deadlock, event loop stall, etc.), the task is rejected and the worker is retired.
  • Tasks sitting in the queue too long are rejected (MAX_TASK_AGE_MS = 30m) rather than being served extremely late.

Current behavior risk

  • If a worker gets into a bad state and never posts back, that request can hang indefinitely, and the pool can slowly degrade (depending on how many workers get wedged).

Even if “new script” events are rare, the impact of a single hung request can be high (clients waiting forever, connection accumulation, etc.).

2) Worker recycling via message budgets (prevents long-lived memory/GC creep)

New behavior

  • Workers have a configurable message budget (MESSAGES_LIMIT, default 10_000). When exhausted, the worker is terminated and replaced.

Why it matters even if first-script is rare

  • These workers execute fairly heavy JavaScript parsing / transforms and may retain memory unintentionally (library/global caches, V8 isolates, fragmentation, etc.). Recycling prevents “one bad long-lived isolate” behavior.
  • This is a common production tactic for long-running worker processes.

If at your scale scripts are rarely processed, the budget will rarely tick down—so overhead is near-zero, but it still protects you against “one worker that got large / leaky over months.”

For very small scale situations, the limit can be significantly lowered.

3) Persistent handlers + in-flight tracking = fewer “state bugs”

New behavior

  • Permanent message, messageerror, and error handlers are attached once per worker.
  • In-flight tasks are tracked by worker in a WeakMap, and “stray messages” cause worker retirement (protects correctness).

Current behavior risk

  • The old code attaches a new message handler per task and removes it on response. That’s simple, but if anything goes wrong (double message, exception before remove, unexpected payload), it’s easier to end up in weird states or leak handlers.
  • The new code trades simplicity for a stronger correctness model: “worker has at most 1 in-flight task, tracked explicitly.”

4) Safer task settle (prevents crashes caused by userland reject/resolve)

New behavior

  • safeCall() is used to invoke resolve/reject defensively. If user code throws during resolve/reject, the pool keeps working.

Current behavior risk

  • If a consumer of execInPool() throws from a resolve path (it happens more than is often expected), the worker-pool logic can be disrupted.

5) Backoff-based recovery scaffolding

The scheduleRefillAndDispatch() path wraps refill/dispatch in a try/catch and has backoff retry. This is effectively a “pool self-heals” loop if internal bookkeeping gets into a bad state.

I agree this is heavier than needed for normal operation; but it’s specifically targeted at the hard-to-debug “pool got into a broken bookkeeping state” incidents.

6) Deque selection is mostly about avoiding pathological array behavior

This PR adds TaskQueue + createTaskQueue() with selectable deque backend.

  • In your workload, queue depth probably stays tiny, so it won’t matter.
  • But it prevents worst-case O(n) behavior from shift/unshift patterns if queue grows (e.g., transient spikes, upstream slowdown, worker crash causing backlog).

Tradeoff: You’re right about simplicity

The current implementation is extremely easy to reason about. The new one is more complex and has more moving parts (idle stack/set, in-flight maps, retire sets, timers, recovery state).

If your primary concern is maintainability and you haven’t seen worker wedging / memory creep, a reasonable compromise would be:

  • Keep the old structure, but cherry-pick two high-value safety features:
    1. In-flight timeout per task (prevents indefinite hangs)
    2. Worker message budget recycling (prevents long-lived isolate creep)
  • Optionally keep safeCall() around (very small surface area, good safety win).

That would retain simplicity while still addressing the biggest “production footguns.”

Bottom line

At your scale, you likely won’t see a throughput improvement because this path is cold most of the time. The benefit is mainly:

  • bounded hangs (timeouts),
  • bounded queue staleness (max age),
  • bounded worker lifetime (message budgets),
  • fewer “pool stuck forever” failure modes (explicit tracking + recovery).

If those failure modes aren’t a concern for your deployment, I’d recommend a reduced-scope version (less recursion by using a loop in dispatch, timeouts + budget recycling) to preserve readability while still getting the reliability wins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants