Skip to content

bug(extension+mcp-server): WS bridge goes zombie after long idle; one-sided keepalive cannot detect it (timeouts -32001 until manual restart) #51

Description

@apireno

Symptom

After DOMShell has been idle for hours/days (no MCP traffic, Claude Desktop wasn't actively driving the browser), the next domshell_execute call from a client hits an MCP -32001 Request timed out. The container is healthy, the WS socket on port 9876 still reads ESTABLISHED in netstat, the side panel says "connected (authenticated)" — but commands sent over the WS never get a response. Only fix observed: restart the thv proxy AND the Chrome extension (both — neither alone reliably recovers).

Reproduced live on 2026-06-12 against @apireno/domshell@2.0.3 running thv-managed (thv list workload domshell-mcp-server, Up 3 days). Container log shows ~10 MCP session initialized→ sending SESSION_START to extension entries with NO acknowledgement back from the extension between them. After manual restart of the extension + proxy kick: Extension disconnectedExtension connected (authenticated) → working again, the next pwd returns in ~58 ms.

Root cause

The keepalive in src/background/index.ts:696-708 is one-sided:

chrome.alarms.onAlarm.addListener((alarm) => {
  if (alarm.name !== KEEPALIVE_ALARM) return;
  // The mere act of this listener firing wakes the service worker.
  if (ws?.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify({ type: "pong" }));   // one-way send
  }
  if (wsEnabled && wsToken && (!ws || ws.readyState !== WebSocket.OPEN) && !wsReconnectTimer) {
    wsConnect();
  }
});

The extension sends a "pong" and considers itself healthy if WebSocket.OPEN. It never confirms that anything actually comes back from the server. The server-side (mcp-server/index.ts) is symmetric — it only sends EXECUTE messages, never heartbeats.

When the underlying TCP pipe goes "zombie" (TCP layer ESTABLISHED but WS framing or JS handler wedged — common during MV3 service-worker suspension cycles + macOS network sleep transitions), ws.send() succeeds at the TCP layer (acked by the kernel) but never reaches the server's WS event handler. Both sides keep WebSocket.OPEN true. The keepalive sees "OPEN" and skips reconnect. The user has to manually intervene.

Why the chrome.alarms keepalive isn't enough

The 24-second alarm DOES successfully wake the service worker (verified by the listener firing). But the alarm only checks ws.readyState, which is a JS-level state that doesn't reflect whether the WS framing layer is actually functional. The alarm is preventing SW suspension correctly; it's just doing the wrong liveness check.

Proposed fix (two-sided)

  1. Server-side periodic PINGmcp-server/index.ts sends { "type": "PING", "id": <timestamp> } over the WS every ~20 seconds. Track per-connection.
  2. Extension-side PONG reply + last-seen tracking — in the existing onmessage handler, recognize PING, reply with PONG, AND update lastInboundAt = Date.now().
  3. Extension-side liveness check in the alarm — in the alarm listener, if wsEnabled && (Date.now() - lastInboundAt) > 60_000, force-close the WS (ws.close()) and rely on the existing onclose reconnect path. The 60s window survives 2 missed server PINGs.

This catches all the failure modes:

  • Server-side TCP alive but extension JS dead → lastInbound stops, alarm reconnects
  • Server died and TCP RST not received → lastInbound stops, alarm reconnects
  • Network change (e.g. WiFi handoff) → lastInbound stops, alarm reconnects
  • macOS sleep/wake cycles dropping the WS without notification → lastInbound stops

Alternative: WebSocket protocol-level ping/pong frames

The ws library on the server has built-in ping/pong frame support that Chrome's WS implementation responds to automatically. Cleaner protocol-wise but:

  • Doesn't necessarily wake the suspended SW (auto-pong happens at the C++ layer, not the JS event loop)
  • Doesn't update lastInboundAt in our JS state

So WS-level pings would need to be paired with an SW-waking mechanism anyway. The application-level PING approach above gives both liveness AND SW wakeup in one channel.

Related

This issue joins the kernel-side queue for the next CWS-justifying extension release.

Sequencing

Both the server-side PING and extension-side PONG/check need to land together. Server-only changes don't help (extension would ignore unknown PING type today); extension-only changes don't help (nothing to detect on inbound). Bundle into a single release: DOMShell extension v1.3.2 + MCP server 2.0.4.

Workaround until then: if the side panel shows "connected" but commands hang, restart the extension via chrome://extensions/ → DOMShell → toggle off+on. Optionally also kick the thv proxy (thv restart domshell-mcp-server) to flush the dead socket on the server side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions