Fix premature disconnect force-fail of a live client tool call#322175
Draft
roblourens wants to merge 1 commit into
Draft
Fix premature disconnect force-fail of a live client tool call#322175roblourens wants to merge 1 commit into
roblourens wants to merge 1 commit into
Conversation
A client tool call could be force-failed as "disconnected" while the owning client was in fact still connected and actively sending frames, destroying the tool call (and its pending confirmation) before the user ever saw the confirmation prompt. The disconnect-grace machinery decided a client was gone from two signals: `IClientRecord.connection === undefined` (arm condition) and `lastSeenAt` (grace-window delay). Two gaps made it misfire: - `lastSeenAt` was only updated at handshake/disconnect, never on ordinary inbound frames, so for a long-lived chatty client it was stale by minutes. The arm delay `max(0, TIMEOUT - elapsed)` then collapsed to 0 and the timeout fired immediately. - `connection` is a single pointer per clientId. When a clientId is reused across overlapping transports, closing the most-recent transport clears `connection` even though another transport for the same client is still live, so `connection === undefined` does not imply the client is gone. Fix: track liveness from real traffic by bumping `lastSeenAt` on every inbound frame, and re-verify liveness when the timeout fires (recency-based, not connection-based) before force-failing. If the client is still alive, re-arm only while it still owns a pending tool call so an active client never leaves a perpetual timer running. Adds regression tests: a live client on a second transport survives despite `connection === undefined`, and a silent owner still fails after the grace window. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a failure mode in the agent-host protocol server where a pending client-owned tool call could be force-failed immediately (as if the client disconnected) even though the client was still alive on another overlapping transport. The change makes disconnect-grace decisions rely on recent inbound traffic rather than the single connection pointer, preventing premature tool-call failure that can suppress expected confirmation UI.
Changes:
- Refresh per-client liveness (
lastSeenAt) from every inbound frame for established clients. - Re-check liveness when the disconnect-grace timeout fires and only force-fail when the client has been silent for the full grace window; otherwise re-arm while a pending tool call still exists.
- Add regression tests covering overlapping transports (live-but-untracked connection) and the silent-client failure path.
Show a summary per file
| File | Description |
|---|---|
| src/vs/platform/agentHost/node/protocolServerHandler.ts | Updates disconnect-grace logic to use traffic-based liveness and re-verify on timeout fire before force-failing client tool calls. |
| src/vs/platform/agentHost/test/node/protocolServerHandler.test.ts | Adds regression tests ensuring pending client tool calls aren’t force-failed while the client is still sending frames on another transport, and still fail after silence. |
Copilot's findings
- Files reviewed: 2/2 changed files
- Comments generated: 0
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A user's agent-host session appeared hung for ~9 hours on a
runTestsclient tool call. Auto-approve was off, so a confirmation prompt was expected — but the user never saw any confirmation UI, and the session just sat there until they manually aborted.The tool call was actually force-failed almost immediately (~18ms after it started) as if the owning client had disconnected — even though the client was connected the whole time and went on to run the tests successfully (199/199 passed ~2 min later). That premature failure destroyed the tool call before its real confirmation prompt could be shown, which is what made the session look stuck.
How disconnect handling was supposed to work
When a client tool call is pending,
ProtocolServerHandlerprotects against the owning client going away by arming a grace-period timeout (_startClientToolCallDisconnectTimeout). If the client doesn't come back withinCLIENT_TOOL_CALL_DISCONNECT_TIMEOUT(30s), the call is force-failed via_completeDisconnectedClientToolCallsso it can't hang forever. Two paths arm this timeout:_handleClientDisconnected— from a transport'sonClose, when a connected client that owns a pending call disconnects._checkOrphanedClientToolCalls— on everyChatToolCallStart/ChatToolCallReadybroadcast, to catch calls issued for an already-gone client (noonClosefires for those).The timeout decides a client is gone from two signals:
IClientRecord.connection === undefined.max(0, TIMEOUT - (now - lastSeenAt))— a "residual window" so a client that disconnected a while ago gets the remaining grace, not a fresh 30s. If it reconnects and re-subscribes in time,_clearClientToolCallDisconnectTimeoutcancels the timer.What was missing
The whole scheme assumes
lastSeenAtreflects recent liveness and thatconnection === undefinedreliably means gone. Neither held:lastSeenAtwas never refreshed from ordinary traffic. It was only stamped at handshake, reconnect, and disconnect — so for a long-lived, chatty client it was stale by minutes or hours. When the orphan check armed the timeout,now - lastSeenAtwas huge, so the delay collapsed to 0 and the timeout fired immediately instead of granting any grace.connectionis a single pointer perclientId. When aclientIdis reused across overlapping transports (a window reconnects before its previous transport'sonCloseis observed), the pointer only tracks the most-recent transport. Closing that transport clearsconnectioneven though another transport for the same logical client is still live. Soconnection === undefineddid not imply the client was gone.Together these turned the "grace window" into an instant kill for exactly the live client it was meant to protect: the orphan check saw
connection === undefined, armed with a zero delay derived from a stalelastSeenAt, and force-failed a perfectly healthy tool call.Fix
_markClientSeenbumpslastSeenAton every inbound frame for an established client (not just at handshake/disconnect)._isClientGraceExpiredchecks whether any frame arrived within the grace window (recency-based, not theconnectionpointer) before force-failing. If the client is still alive, re-arm — but only while it still owns a pending tool call, so an active client never leaves a perpetual timer ticking.This keeps the original guarantee (a genuinely-gone client's pending calls still fail after the grace window) while no longer killing a client that is demonstrably alive. Fixing this is necessary and sufficient to restore the confirmation UI in the incident.
Tests
Adds two regression tests to
protocolServerHandler.test.ts:Streamingacross multiple grace windows despiteconnection === undefined(this fails without the fix).Existing ProtocolServerHandler suite (55 tests) unchanged and green; 57 passing total.
npm run typecheck-clientpasses.(Written by Copilot)