Skip to content

SCAL-308282 Track polling and first response latency#143

Open
kanwarbajwa wants to merge 1 commit into
mainfrom
SCAL-308282_FirstVisibleResponseLatency
Open

SCAL-308282 Track polling and first response latency#143
kanwarbajwa wants to merge 1 commit into
mainfrom
SCAL-308282_FirstVisibleResponseLatency

Conversation

@kanwarbajwa

Copy link
Copy Markdown
Contributor

Summary

Instrument the analysis polling bridge so we can measure backend readiness, client polling behavior, and sessions that never get polled.

What Changed

  • add shared analysis metric helpers for session counters, poll wait, first buffered update, first poll delay, first non-empty response, never-polled sessions, and storage errors
  • persist timing metadata in the ConversationStorageServer Durable Object and expose it through the storage client so one-shot lifecycle metrics are emitted once per response
  • record request-scoped analysis counters and poll-wait histograms in the MCP server analysis tool methods
  • add focused specs for the conversation storage timing state and the analysis tool metrics flow

Notes

  • first-buffered, first-poll, first-non-empty, and never-polled metrics are emitted from the storage Durable Object because it owns the cross-request conversation state
  • local design docs were updated for our internal reference but are not part of this commit

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive metrics tracking for analysis sessions, including session creation, message delivery, and polling performance. It adds a metrics state to the ConversationStorageServer to track timing events like the first buffered update and first non-empty response, and updates the MCPServer to record these metrics during tool execution. Feedback identifies an opportunity to refactor duplicated message retrieval logic in the storage server into a private helper method to improve maintainability.

Comment thread src/servers/conversation-storage-server.ts Outdated
@kanwarbajwa kanwarbajwa force-pushed the SCAL-308282_FirstVisibleResponseLatency branch 3 times, most recently from 85306dd to aa8366a Compare May 8, 2026 20:18
Comment thread src/metrics/runtime/analysis-metrics.ts Outdated
Comment thread src/servers/mcp-server.ts Outdated
Comment thread src/servers/conversation-storage-server.ts Outdated
@kanwarbajwa kanwarbajwa force-pushed the SCAL-308282_FirstVisibleResponseLatency branch 2 times, most recently from 2b05c6e to c06d6bc Compare May 12, 2026 02:37
Instrument the analysis polling bridge so we can measure backend readiness,
client polling behavior, and sessions that never get polled.

- add shared analysis metric helpers for session counters, poll wait,
  first buffered update, first poll delay, first non-empty response,
  never-polled sessions, and storage errors
- persist timing metadata in the ConversationStorageServer Durable Object
  and expose it through the storage client so one-shot lifecycle metrics
  are emitted once per response
- record request-scoped analysis counters and poll-wait histograms in the
  MCP server analysis tool methods
- add focused specs for the conversation storage timing state and the
  analysis tool metrics flow

- first-buffered, first-poll, first-non-empty, and never-polled metrics are
  emitted from the storage Durable Object because it owns the cross-request
  conversation state
- local design docs were updated for our internal reference but are not part
  of this commit

- npm run lint
- npm test -- --coverage.enabled=false
@kanwarbajwa kanwarbajwa force-pushed the SCAL-308282_FirstVisibleResponseLatency branch from c06d6bc to 4e1020f Compare May 12, 2026 03:53
return new Response("Not Found", { status: 404 });
}
} catch (err) {
await this.recordStorageErrorMetric(this.toStorageOperation(operation));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer to do this after the console.error line, for two reasons:

  1. Logging error in console is always reliable and we should do it first before other more complex error handling which can introduce further failure modes resulting in non-logging
  2. This is async, so I want to log the error immediately first and foremost

}
}

private toStorageOperation(operation: string): StreamStorageOperation {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be static

case "messages":
return STREAM_STORAGE_OPERATIONS.messages;
default:
return STREAM_STORAGE_OPERATIONS.unknown;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are never recording STREAM_STORAGE_OPERATIONS.alarm, can we remove it?

* conversationId, to ensure no users can access each other's conversations.
* The route path uses the conversation id. Isolation between users happens at the
* Durable Object id level by hashing the access token and combining it with the
* conversation id when constructing the stub name.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change these header comments? We are still using the storageId concept from my last change, there should be no change here.

const body =
((await request
.json()
.catch(() => ({}))) as Partial<ConversationMetricsState>) ?? {};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the top-level header comment of this class to mention that we are passing metrics data in the request body

const [existingIsDone, metricsState] = await Promise.all([
this.state.storage.get<boolean>(IS_DONE_KEY),
this.getConversationMetricsState(),
]);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just call getConversationStateSnapshot() here and do away with getConversationMetricsState(). It feels unclean to be fetching 2 different keys in separate transactions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants