Skip to content

fix(daemon): resolve silent exit due to self-PID collision and event loop drain#1508

Open
cpucoinio wants to merge 1 commit intoruvnet:mainfrom
cpucoinio:fix/daemon-silent-exit-self-pid-eventloop
Open

fix(daemon): resolve silent exit due to self-PID collision and event loop drain#1508
cpucoinio wants to merge 1 commit intoruvnet:mainfrom
cpucoinio:fix/daemon-silent-exit-self-pid-eventloop

Conversation

@cpucoinio
Copy link
Copy Markdown

Summary

Fixes two bugs that combine to make the background daemon completely non-functional on any machine under moderate CPU load. The daemon would report "started" but immediately exit silently with no log output.

Root Causes

Bug 1 — Self-PID collision (parent writes PID file before child initializes)

startBackgroundDaemon() writes daemon.pid with the child's PID immediately after spawning. When the child starts and calls WorkerDaemon.start(), checkExistingDaemon() reads that file, finds its own PID alive via process.kill(pid, 0), and returns early — treating itself as a duplicate. No workers are ever scheduled.

Fix: Added a self-PID guard in checkExistingDaemon(): if the PID in the file equals process.pid, clear the stale file and return null.

Bug 2 — Event loop drain when all workers are resource-deferred

When system CPU load exceeds maxCpuLoad, all workers are pushed to pendingWorkers. The only retry is scheduled via setTimeout(..., 30_000).unref(). The .unref() means this timer doesn't keep the Node.js event loop alive — so if all workers are deferred simultaneously (common on a busy dev machine), the process exits silently.

Fix: Removed .unref() from the backoff retry timer in processPendingWorkers().

Impact

Both bugs compound: the self-PID issue prevents workers from ever being scheduled, so there are no active timers. Even if that were fixed, the .unref()'d retry timer means any machine where loadavg()[0] exceeds maxCpuLoad (the default cpuCount × 0.8) at startup will also silently exit. On a 12-core machine doing normal dev work, this threshold is routinely exceeded.

Test plan

  • Run claude-flow daemon start and verify daemon status shows ● RUNNING after 3–5 seconds
  • Verify on a machine with loadavg > cpuCount × 0.8 (busy machine) — daemon stays running
  • Verify daemon.pid file is created and contains a live PID
  • Verify daemon stop cleanly stops the process

From your friends at CPUcoin.io. Enjoy!

…loop drain

Bug 1 — checkExistingDaemon() self-PID false positive:
startBackgroundDaemon() writes daemon.pid with the child's PID before the
child process finishes initializing. When the child calls WorkerDaemon.start(),
checkExistingDaemon() reads the file, finds its own PID alive via
process.kill(pid, 0), and returns early as if a duplicate daemon is running.
No workers are ever scheduled, no timers are active, and the process exits.

Fix: added a self-PID guard — if pid === process.pid, clear the file and
return null so initialization proceeds normally.

Bug 2 — .unref() on backoff retry drains the event loop:
When all workers are deferred due to high CPU load (loadavg > maxCpuLoad,
the default cpuCount × 0.8), the only pending timer uses .unref(). This
allows Node.js to exit even though the daemon is supposed to be running.
On a busy dev machine this threshold is routinely exceeded at startup.

Fix: removed .unref() from the processPendingWorkers() backoff timer so
the event loop stays alive while workers are waiting for resources.

These bugs compound: Bug 1 means no workers are scheduled (no timers at
all), and Bug 2 means even a correct startup silently exits on a loaded
machine. Both are required for reliable daemon operation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant