feat(browse): clean-drain disconnect detection + shutdown postmortem#1630
feat(browse): clean-drain disconnect detection + shutdown postmortem#1630odominguez7 wants to merge 5 commits into
Conversation
After proc.unref() in startServer(), the spawned server's stdout/stderr write to FDs that nothing reads. Every console.log/error from the detached server is lost, so a 'why did the daemon die' question has no log to consult. Open the log file in append mode, point the child's stdio at the fd, stash the path on the proc handle so the startup-failure path can read the tail when the pipe is no longer attached. Tail read is bounded to the last 64KB via seek-from-end so a multi-failed-startup log doesn't get fully read into memory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…n-drain vs crash
The launched-mode browser.on('disconnected') handler treated every
Chromium disappear as a crash. When the user closes their last tab,
Chromium naturally quits because it has nothing left to display and
fires 'disconnected'. The old code printed 'FATAL: Chromium process
crashed' and process.exit(1) on that path, killing the sidebar-agent
and any in-flight PTY work.
Smoking gun in the daemon log:
[browse] Tab closed (id=1, remaining=0)
[browse] FATAL: Chromium process crashed or was killed. Server exiting.
classifyDisconnect({pagesSize, mode}) is a pure helper that splits by
pages.size: zero pages means clean drain (browser-empty-{launched,rehead},
exit 0, no FATAL); pages still live means real crash (chromium-crash-*,
exit 1, FATAL preserved). Wired into both the launched-mode handler and
the post-rehead handler — they collapse from two near-duplicate inline
blocks into one shared classification.
onDisconnect signature gains an optional exitCode parameter so the
classification's exit code flows through to the server's wiring instead
of being decorative. The existing browser-disconnect call site from
the headed-mode page-close chain omits exitCode and gets the default
2 (preserves the user-close convention).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…very exit path
Every shutdown — idle-timeout, parent-exit, SIGINT/SIGTERM,
browser-disconnect, the new browser-empty-* clean drain, chromium-crash-*,
and explicit /stop — now writes <stateDir>/last-shutdown.json with
{ts, pid, reason, mode, uptimeSeconds, detail}. Written atomically via
tmp-then-rename (${outPath}.${pid}.tmp) so two concurrent browse
instances sharing a state dir cannot interleave JSON bytes.
The shutdownReasonRecorded flag gates the fallback 'shutdown-called'
reason in buildFetchHandler.shutdown() — specific callers record their
own more-specific reason first; the fallback only fires when nothing
else did. Most-recent-wins overwrite is intentional (a SIGINT that
arrives after a browser-disconnect IS the actual cause of teardown).
onDisconnect dispatcher uses the exit code passed by classifyDisconnect
(0 for clean drain, 1 for crash) or defaults to 2 for the existing
user-close convention from the page-close chain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ELOG (v1.43.1.0) 5-test suite pins all four (mode × pagesSize) combinations of classifyDisconnect plus the reason-tag suffix invariant. If a future refactor changes the reason format or breaks the clean-drain split, CI catches it before the next user's sidebar dies. Cross-model adversarial pass via /codex challenge surfaced 14 findings; the 5 must-fix and should-fix items (exit-code wiring being decorative, postmortem exitCode internal consistency, atomic write, bounded tail read, exit-code pass-through refactor) landed before sign-off. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restore VERSION, CHANGELOG.md, and package.json to upstream/main values so this PR doesn't claim a release slot. Maintainer can bump and write the entry when integrating into the next wave. Keeps all functional changes (classifyDisconnect helper, recordShutdownReason postmortem, stdio tee, 5-test regression suite). Only undoes the version claim.
|
This needs a rebase against the now-merged launch-hardening wave #1629 before the clean-disconnect part is safe to review. Current This PR's new Can you rebase and keep the postmortem / stdio-tee pieces on top of the existing |
Summary
The browse daemon's launched-mode
disconnectedhandler treats every Chromium disappear as a crash. When a user closes their last tab in headed mode, Chromium naturally quits because it has nothing left to display — and the daemon logsFATAL: Chromium process crashed,process.exit(1)s, and tears down the sidebar-agent along with any in-flight PTY work.The smoking gun shows up cleanly in any developer's daemon log:
Same event, two stories. The exit happens microseconds after the close — and detached server stdio went to FDs nothing read after
proc.unref(), so the daemon's own messages disappeared into the void. No log to consult, no postmortem to read.This PR ships the postmortem instrumentation that surfaced the bug AND the fix.
Three changes:
classifyDisconnect()inbrowse/src/browser-manager.ts— pure helper, splitsdisconnectedevents bypages.size. Zero pages → clean drain (browser-empty-*, exit 0, no FATAL log). Pages still live → real crash (chromium-crash-*, exit 1, FATAL preserved). Wired into both the launched-mode handler (line 291) and the post-rehead handler (line 1393). Collapses two near-duplicate inline blocks into one shared classification.recordShutdownReason()inbrowse/src/server.ts— writes<stateDir>/last-shutdown.jsonatomically (.pid.tmp+fs.renameSync) on every exit path. Payload:{ts, pid, reason, mode, uptimeSeconds, detail}. Specific callers (idle-timeout, parent-exit, SIGINT/SIGTERM, browser-disconnect, browser-empty-, chromium-crash-) record their tagged reason before callingactiveShutdown; a fallbackshutdown-calledreason fires only if nothing else recorded one. Most-recent-wins overwrite is intentional — gated byshutdownReasonRecordedflag so the fallback doesn't clobber specifics.stdio tee in
browse/src/cli.ts— opens<stateDir>/browse-server.login append mode, points the spawned server's stdio at the fd. Detached server'sconsole.log/console.errornow surviveproc.unref(). Startup-failure path reads the last 64KB viafs.openSync+fs.statSync+ seek-from-end +fs.readSync+fs.closeSyncinfinally, so a multi-failed-startup log doesn't get fully read into memory.Plus:
BrowserManager.onDisconnectsignature gains an optionalexitCode?: 0 | 1 | 2parameter so the classification's exit code flows through to the server's wiring instead of being decorative (the original draft had this as a bug — exit codes claimed but not delivered because the wiring always calledactiveShutdown(2)).Test plan
bun test browse/test/browser-manager-classify-disconnect.test.ts— 5 new tests, all green. Covers launched/rehead × pagesSize 0/N + reason-tag suffix invariant.bun test browse/test/browser-manager-{unit,custom-chromium}.test.ts browse/test/server-embedder-terminal-port.test.ts browse/test/build-command-response.test.ts— 26 tests, all green. No regressions in neighboring suites that exercise BrowserManager or the server shutdown path.bun run build— clean compile across all binaries + node-compat bundle.codex exec --reasoning-effort high(OpenAI Codex CLI). Surfaced 14 findings on the first pass. Addressed the 5 must-fix / should-fix items (exit-code wiring being decorative, postmortemexitCodeinternal consistency, atomic write, bounded tail read, exit-code pass-through refactor) before requesting handshake. Codex sign-off on the final state: "the fixes close the known gaps without introducing new risk and tests pass."bun teston this branch and onupstream/main(git checkout upstream/main --detach) both produce 258 failures, identical set. None of the failing files are in this PR's diff. The failures cluster on Bun child-process PATH inheritance in sandboxed shells (env-driven, not regression).Deferred findings (for transparency)
These are real Codex findings I'm explicitly NOT addressing in this PR — happy to discuss if you'd rather see any of them in this round:
disconnectedcould fire beforepage.on('close')drains the pages map. Observed-failure log shows the close drained first; speculative.pagesSize === 0(e.g., crash before first tab) misclassifies as clean drain.wirePageEventscreates the initial tab immediately — degenerate case.recordShutdownReasonreads module-levelconfig.stateFile. Same pre-existing pattern that v1.42.1.0'sownsTerminalAgentwork flagged as a follow-up.browse-server.log— future improvement.(proc as any).__logPathstash — relies on Bun proc being extensible. Stable API.Files
browse/src/browser-manager.ts— newclassifyDisconnecthelper + onDisconnect signature change + handler refactorsbrowse/src/server.ts— newrecordShutdownReason+ reason tagging across all exit paths + onDisconnect dispatcher uses passed exit codebrowse/src/cli.ts— stdio tee + bounded tail read on startup failurebrowse/test/browser-manager-classify-disconnect.test.ts— new 5-test suiteVersioning
VERSION, CHANGELOG.md, and package.json are intentionally NOT bumped in this PR — deferring to maintainer integration so this branch doesn't claim a release slot. The functional commits stand on their own; the 5th commit on this branch (
chore: defer VERSION + CHANGELOG bump to maintainer integration) restores those files toupstream/mainvalues.