Skip to content

Fix premature disconnect force-fail of a live client tool call#322175

Draft
roblourens wants to merge 1 commit into
mainfrom
agents/implement-user-reports-that-agent-host-session-w-111e782c
Draft

Fix premature disconnect force-fail of a live client tool call#322175
roblourens wants to merge 1 commit into
mainfrom
agents/implement-user-reports-that-agent-host-session-w-111e782c

Conversation

@roblourens

Copy link
Copy Markdown
Member

Problem

A user's agent-host session appeared hung for ~9 hours on a runTests client tool call. Auto-approve was off, so a confirmation prompt was expected — but the user never saw any confirmation UI, and the session just sat there until they manually aborted.

The tool call was actually force-failed almost immediately (~18ms after it started) as if the owning client had disconnected — even though the client was connected the whole time and went on to run the tests successfully (199/199 passed ~2 min later). That premature failure destroyed the tool call before its real confirmation prompt could be shown, which is what made the session look stuck.

How disconnect handling was supposed to work

When a client tool call is pending, ProtocolServerHandler protects against the owning client going away by arming a grace-period timeout (_startClientToolCallDisconnectTimeout). If the client doesn't come back within CLIENT_TOOL_CALL_DISCONNECT_TIMEOUT (30s), the call is force-failed via _completeDisconnectedClientToolCalls so it can't hang forever. Two paths arm this timeout:

  1. _handleClientDisconnected — from a transport's onClose, when a connected client that owns a pending call disconnects.
  2. _checkOrphanedClientToolCalls — on every ChatToolCallStart/ChatToolCallReady broadcast, to catch calls issued for an already-gone client (no onClose fires for those).

The timeout decides a client is gone from two signals:

  • Arm condition: IClientRecord.connection === undefined.
  • Delay: max(0, TIMEOUT - (now - lastSeenAt)) — a "residual window" so a client that disconnected a while ago gets the remaining grace, not a fresh 30s. If it reconnects and re-subscribes in time, _clearClientToolCallDisconnectTimeout cancels the timer.

What was missing

The whole scheme assumes lastSeenAt reflects recent liveness and that connection === undefined reliably means gone. Neither held:

  1. lastSeenAt was never refreshed from ordinary traffic. It was only stamped at handshake, reconnect, and disconnect — so for a long-lived, chatty client it was stale by minutes or hours. When the orphan check armed the timeout, now - lastSeenAt was huge, so the delay collapsed to 0 and the timeout fired immediately instead of granting any grace.

  2. connection is a single pointer per clientId. When a clientId is reused across overlapping transports (a window reconnects before its previous transport's onClose is observed), the pointer only tracks the most-recent transport. Closing that transport clears connection even though another transport for the same logical client is still live. So connection === undefined did not imply the client was gone.

Together these turned the "grace window" into an instant kill for exactly the live client it was meant to protect: the orphan check saw connection === undefined, armed with a zero delay derived from a stale lastSeenAt, and force-failed a perfectly healthy tool call.

Fix

  • Track liveness from real traffic: _markClientSeen bumps lastSeenAt on every inbound frame for an established client (not just at handshake/disconnect).
  • Re-verify at fire time: when the timeout fires, _isClientGraceExpired checks whether any frame arrived within the grace window (recency-based, not the connection pointer) before force-failing. If the client is still alive, re-arm — but only while it still owns a pending tool call, so an active client never leaves a perpetual timer ticking.

This keeps the original guarantee (a genuinely-gone client's pending calls still fail after the grace window) while no longer killing a client that is demonstrably alive. Fixing this is necessary and sufficient to restore the confirmation UI in the incident.

Tests

Adds two regression tests to protocolServerHandler.test.ts:

  • A live client on a second transport keeps its tool call Streaming across multiple grace windows despite connection === undefined (this fails without the fix).
  • A client that goes silent on every transport still has its tool call failed after the grace window (proves the disconnect path still works).

Existing ProtocolServerHandler suite (55 tests) unchanged and green; 57 passing total. npm run typecheck-client passes.

A separate, secondary hang mechanism on the SDK permission-deferred path was also investigated. It is not included here: fixing the force-fail above is sufficient to prevent the incident, and this change is intentionally scoped to the root cause.

(Written by Copilot)

A client tool call could be force-failed as "disconnected" while the
owning client was in fact still connected and actively sending frames,
destroying the tool call (and its pending confirmation) before the user
ever saw the confirmation prompt.

The disconnect-grace machinery decided a client was gone from two
signals: `IClientRecord.connection === undefined` (arm condition) and
`lastSeenAt` (grace-window delay). Two gaps made it misfire:

- `lastSeenAt` was only updated at handshake/disconnect, never on
  ordinary inbound frames, so for a long-lived chatty client it was
  stale by minutes. The arm delay `max(0, TIMEOUT - elapsed)` then
  collapsed to 0 and the timeout fired immediately.
- `connection` is a single pointer per clientId. When a clientId is
  reused across overlapping transports, closing the most-recent
  transport clears `connection` even though another transport for the
  same client is still live, so `connection === undefined` does not
  imply the client is gone.

Fix: track liveness from real traffic by bumping `lastSeenAt` on every
inbound frame, and re-verify liveness when the timeout fires
(recency-based, not connection-based) before force-failing. If the
client is still alive, re-arm only while it still owns a pending tool
call so an active client never leaves a perpetual timer running.

Adds regression tests: a live client on a second transport survives
despite `connection === undefined`, and a silent owner still fails after
the grace window.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 22:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a failure mode in the agent-host protocol server where a pending client-owned tool call could be force-failed immediately (as if the client disconnected) even though the client was still alive on another overlapping transport. The change makes disconnect-grace decisions rely on recent inbound traffic rather than the single connection pointer, preventing premature tool-call failure that can suppress expected confirmation UI.

Changes:

  • Refresh per-client liveness (lastSeenAt) from every inbound frame for established clients.
  • Re-check liveness when the disconnect-grace timeout fires and only force-fail when the client has been silent for the full grace window; otherwise re-arm while a pending tool call still exists.
  • Add regression tests covering overlapping transports (live-but-untracked connection) and the silent-client failure path.
Show a summary per file
File Description
src/vs/platform/agentHost/node/protocolServerHandler.ts Updates disconnect-grace logic to use traffic-based liveness and re-verify on timeout fire before force-failing client tool calls.
src/vs/platform/agentHost/test/node/protocolServerHandler.test.ts Adds regression tests ensuring pending client tool calls aren’t force-failed while the client is still sending frames on another transport, and still fail after silence.

Copilot's findings

  • Files reviewed: 2/2 changed files
  • Comments generated: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants