fix(daemon): resolve silent exit due to self-PID collision and event loop drain#1508
Open
cpucoinio wants to merge 1 commit intoruvnet:mainfrom
Open
fix(daemon): resolve silent exit due to self-PID collision and event loop drain#1508cpucoinio wants to merge 1 commit intoruvnet:mainfrom
cpucoinio wants to merge 1 commit intoruvnet:mainfrom
Conversation
…loop drain Bug 1 — checkExistingDaemon() self-PID false positive: startBackgroundDaemon() writes daemon.pid with the child's PID before the child process finishes initializing. When the child calls WorkerDaemon.start(), checkExistingDaemon() reads the file, finds its own PID alive via process.kill(pid, 0), and returns early as if a duplicate daemon is running. No workers are ever scheduled, no timers are active, and the process exits. Fix: added a self-PID guard — if pid === process.pid, clear the file and return null so initialization proceeds normally. Bug 2 — .unref() on backoff retry drains the event loop: When all workers are deferred due to high CPU load (loadavg > maxCpuLoad, the default cpuCount × 0.8), the only pending timer uses .unref(). This allows Node.js to exit even though the daemon is supposed to be running. On a busy dev machine this threshold is routinely exceeded at startup. Fix: removed .unref() from the processPendingWorkers() backoff timer so the event loop stays alive while workers are waiting for resources. These bugs compound: Bug 1 means no workers are scheduled (no timers at all), and Bug 2 means even a correct startup silently exits on a loaded machine. Both are required for reliable daemon operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two bugs that combine to make the background daemon completely non-functional on any machine under moderate CPU load. The daemon would report "started" but immediately exit silently with no log output.
Root Causes
Bug 1 — Self-PID collision (parent writes PID file before child initializes)
startBackgroundDaemon()writesdaemon.pidwith the child's PID immediately after spawning. When the child starts and callsWorkerDaemon.start(),checkExistingDaemon()reads that file, finds its own PID alive viaprocess.kill(pid, 0), and returns early — treating itself as a duplicate. No workers are ever scheduled.Fix: Added a self-PID guard in
checkExistingDaemon(): if the PID in the file equalsprocess.pid, clear the stale file and returnnull.Bug 2 — Event loop drain when all workers are resource-deferred
When system CPU load exceeds
maxCpuLoad, all workers are pushed topendingWorkers. The only retry is scheduled viasetTimeout(..., 30_000).unref(). The.unref()means this timer doesn't keep the Node.js event loop alive — so if all workers are deferred simultaneously (common on a busy dev machine), the process exits silently.Fix: Removed
.unref()from the backoff retry timer inprocessPendingWorkers().Impact
Both bugs compound: the self-PID issue prevents workers from ever being scheduled, so there are no active timers. Even if that were fixed, the
.unref()'d retry timer means any machine whereloadavg()[0]exceedsmaxCpuLoad(the defaultcpuCount × 0.8) at startup will also silently exit. On a 12-core machine doing normal dev work, this threshold is routinely exceeded.Test plan
claude-flow daemon startand verifydaemon statusshows● RUNNINGafter 3–5 secondsloadavg > cpuCount × 0.8(busy machine) — daemon stays runningdaemon.pidfile is created and contains a live PIDdaemon stopcleanly stops the processFrom your friends at CPUcoin.io. Enjoy!