SCAL-308282 Track polling and first response latency#143
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive metrics tracking for analysis sessions, including session creation, message delivery, and polling performance. It adds a metrics state to the ConversationStorageServer to track timing events like the first buffered update and first non-empty response, and updates the MCPServer to record these metrics during tool execution. Feedback identifies an opportunity to refactor duplicated message retrieval logic in the storage server into a private helper method to improve maintainability.
85306dd to
aa8366a
Compare
2b05c6e to
c06d6bc
Compare
Instrument the analysis polling bridge so we can measure backend readiness, client polling behavior, and sessions that never get polled. - add shared analysis metric helpers for session counters, poll wait, first buffered update, first poll delay, first non-empty response, never-polled sessions, and storage errors - persist timing metadata in the ConversationStorageServer Durable Object and expose it through the storage client so one-shot lifecycle metrics are emitted once per response - record request-scoped analysis counters and poll-wait histograms in the MCP server analysis tool methods - add focused specs for the conversation storage timing state and the analysis tool metrics flow - first-buffered, first-poll, first-non-empty, and never-polled metrics are emitted from the storage Durable Object because it owns the cross-request conversation state - local design docs were updated for our internal reference but are not part of this commit - npm run lint - npm test -- --coverage.enabled=false
c06d6bc to
4e1020f
Compare
| return new Response("Not Found", { status: 404 }); | ||
| } | ||
| } catch (err) { | ||
| await this.recordStorageErrorMetric(this.toStorageOperation(operation)); |
There was a problem hiding this comment.
Nit: prefer to do this after the console.error line, for two reasons:
- Logging error in console is always reliable and we should do it first before other more complex error handling which can introduce further failure modes resulting in non-logging
- This is async, so I want to log the error immediately first and foremost
| } | ||
| } | ||
|
|
||
| private toStorageOperation(operation: string): StreamStorageOperation { |
| case "messages": | ||
| return STREAM_STORAGE_OPERATIONS.messages; | ||
| default: | ||
| return STREAM_STORAGE_OPERATIONS.unknown; |
There was a problem hiding this comment.
It looks like we are never recording STREAM_STORAGE_OPERATIONS.alarm, can we remove it?
| * conversationId, to ensure no users can access each other's conversations. | ||
| * The route path uses the conversation id. Isolation between users happens at the | ||
| * Durable Object id level by hashing the access token and combining it with the | ||
| * conversation id when constructing the stub name. |
There was a problem hiding this comment.
Why did we change these header comments? We are still using the storageId concept from my last change, there should be no change here.
| const body = | ||
| ((await request | ||
| .json() | ||
| .catch(() => ({}))) as Partial<ConversationMetricsState>) ?? {}; |
There was a problem hiding this comment.
Let's update the top-level header comment of this class to mention that we are passing metrics data in the request body
| const [existingIsDone, metricsState] = await Promise.all([ | ||
| this.state.storage.get<boolean>(IS_DONE_KEY), | ||
| this.getConversationMetricsState(), | ||
| ]); |
There was a problem hiding this comment.
Let's just call getConversationStateSnapshot() here and do away with getConversationMetricsState(). It feels unclean to be fetching 2 different keys in separate transactions.
Summary
Instrument the analysis polling bridge so we can measure backend readiness, client polling behavior, and sessions that never get polled.
What Changed
Notes