You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Claude Agent SDK SDKResultMessage.usage object carries four token counters (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens), plus a per-model modelUsage map (Anthropic SDK docs). The executor's terminal Claude Agent SDK execution completed log line at src/core/executor.ts:439 reads only the two cache counters at src/core/executor.ts:445 and src/core/executor.ts:446; input_tokens, output_tokens, and the entire modelUsage breakdown are dropped on the floor. The companion buildExecutionResult at src/core/executor.ts:456 propagates only success, durationMs, costUsd, numTurns into ExecutionResult (interface at src/types.ts:82), so even if a caller wanted to read tokens, the pipeline-level type strips them. The downstream pipeline.completed event at src/core/pipeline.ts:482 carries the same four fields plus pipeline_wall_clock_ms, and markExecutionCompleted at src/orchestrator/history.ts:159 persists cost_usd, duration_ms, num_turns against the executions row (table at src/db/migrations/001_initial.sql:10, cost column at src/db/migrations/001_initial.sql:25) — no token columns exist anywhere in the schema. The official Anthropic guidance explicitly says "Track these [cache tokens] separately from input_tokens to understand caching savings" (cost-tracking docs), but the current logging records the cache numerator without the input_tokens denominator that makes a hit-ratio computable.
The asymmetry is starker still against the Bedrock single-turn adaptor in src/ai/llm-client.ts, which already tracks both inputTokens and outputTokens cleanly through a typed usage shape (src/ai/llm-client.ts:151-152, accumulator at src/ai/llm-client.ts:606-608) — the multi-turn SDK path is the only LLM call site in the codebase that leaves token information unobserved. The docs/operate/observability.md:33 field table documents cost_usd as the sole agent-side LLM telemetry field; tokens are not mentioned, mirroring their absence from the logs. The PROMPT_CACHE_LAYOUT=cacheable rollout (see executor inline comment at src/core/executor.ts:434-438) is justified specifically by cache savings, yet operators flipping it cannot quantify the savings ratio without the missing input_tokens baseline.
The remediation extends the existing pattern verbatim. Add four fields (inputTokens, outputTokens, plus the existing two cache counters) to the executor's terminal log object at src/core/executor.ts:439-450, add a fifth structured field carrying a stable summary of modelUsage (a small array of { model, inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, costUSD }), thread the four scalar counters through ExecutionResult (at src/types.ts:82) and buildExecutionResult (at src/core/executor.ts:456) so they reach pipeline.completed at src/core/pipeline.ts:482, and add four nullable INTEGER columns + one JSONB column to executions in a new 015_executions_tokens.sql migration so the markExecutionCompleted UPDATE at src/orchestrator/history.ts:159-174 can persist them for FR-014-style aggregates. Also extend the .strict() schema home at src/core/log-fields.ts:18-37 so the new fields are pinned by the existing field-drift test and won't silently regress.
Diagram
flowchart LR
SDK["Claude Agent SDK<br/>query iterator"]
Result["SDKResultMessage.usage<br/>+ modelUsage map"]
SDK --> Result
Result -.->|"logged today"| Cache["cache_read_input_tokens<br/>cache_creation_input_tokens"]
Result -.->|"dropped today"| Drop1["input_tokens"]
Result -.->|"dropped today"| Drop2["output_tokens"]
Result -.->|"dropped today"| Drop3["modelUsage per-model<br/>inputTokens, outputTokens,<br/>cacheReadInputTokens, costUSD"]
Cache --> Log["src/core/executor.ts line 439<br/>Claude Agent SDK execution completed"]
Log --> Pipeline["pipeline.completed<br/>src/core/pipeline.ts line 482<br/>cost_usd + num_turns only"]
Pipeline --> DB["executions table<br/>cost_usd, duration_ms,<br/>num_turns only"]
classDef today fill:#196f3d;color:#ffffff
classDef gap fill:#7d3c98;color:#ffffff
classDef code fill:#1a5276;color:#ffffff
class Cache,Log today
class Drop1,Drop2,Drop3 gap
class SDK,Result,Pipeline,DB code
Loading
Rationale
Cost without tokens is opaque. The current cost_usd log + DB column tells an operator how much a single execution was billed, but not whether that bill came from a 500 KB prompt running for two turns or a 5 KB prompt running for fifty turns. Two pathologies that look identical in cost_usd look dramatically different in input_tokens / output_tokens: an oversized prompt-build is an input_tokens problem (fix at src/core/prompt-builder.ts), a runaway tool-call loop is an output_tokens + num_turns problem (fix at the agent prompt). The Anthropic docs themselves call out that the SDK price model is a client-side estimate that can drift from authoritative billing (cost-tracking docs warning box); operators that want to reconcile against the real bill need the underlying token counts, not just the estimate.
Cache observability is the load-bearing case. Issue #134 closed the prompt-cache-stability hole by stripping per-job cwd from the SDK system prompt, and the inline comment at src/core/executor.ts:434-438 explicitly says operators flipping PROMPT_CACHE_LAYOUT=cacheable should watch for non-zero cache_read_input_tokens on the second+ run of the same shape. But "non-zero cache_read" is not enough to assert that caching is working: a job with 100 KB of prompt and 1 KB of cache read is wasting the cache; a job with 100 KB of prompt and 95 KB of cache read is using it correctly. The ratio cache_read_input_tokens / (input_tokens + cache_read_input_tokens + cache_creation_input_tokens) is the metric that matters, and input_tokens is the missing denominator. Persisting tokens to executions unblocks the per-installation, per-workflow, per-day SQL aggregates that the existing dispatch_stats.ts queries already do for cost.
The modelUsage gap matters because the codebase already runs more than one model per execution path: the default agent runs Opus 4.7 (config.model default), while the Bedrock-side intent classification in src/ai/llm-client.ts runs a separate (typically Haiku) model. The Anthropic docs flag this exact scenario as the reason modelUsage exists: "useful when you run multiple models (for example, Haiku for subagents and Opus for the main agent) and want to see where tokens are going" (cost-tracking docs). The shape is also the canonical input to the upstream OpenTelemetry GenAI semantic conventions, which define gen_ai.client.token.usage as a histogram filterable by gen_ai.token.type (input/output) and a per-model dimension (OTel GenAI metrics) — wiring this pattern in now keeps the door open for a future OTel exporter without retrofit.
References
Internal:
src/core/executor.ts:439#executeAgent — Claude Agent SDK execution completed log line; the chokepoint where input_tokens / output_tokens / modelUsage are read from result.usage and dropped.
src/core/executor.ts:445 — cacheReadInputTokens field, the proven precedent for adding sibling token fields.
src/core/pipeline.ts:482 — pipeline.completed event; downstream consumer that would benefit from token fields.
src/orchestrator/history.ts:159#markExecutionCompleted — the UPDATE that persists cost_usd / duration_ms / num_turns; natural site for new token columns.
src/orchestrator/history.ts:161 — the result parameter shape, missing token fields.
src/db/migrations/001_initial.sql:10 — CREATE TABLE executions; no token columns exist.
src/db/migrations/001_initial.sql:25 — cost_usd NUMERIC(10,6); the precedent column to mirror.
src/core/log-fields.ts:18#CORE_PIPELINE_LOG_EVENTS — the strict-schema home that pins pipeline.completed against drift.
src/ai/llm-client.ts:151 — already-typed inputTokens on the Bedrock single-turn adaptor; precedent for the same TypeScript field shape on the multi-turn path.
docs/operate/observability.md:33 — cost_usd field documentation; the table that would gain the new token rows.
External:
Track cost and usage — Claude Agent SDK docs — authoritative spec for SDKResultMessage.usage (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens) and modelUsage per-model breakdown; the docs explicitly recommend tracking cache tokens separately from input_tokens to compute caching savings.
Extend the executor's terminal log at src/core/executor.ts:439-450 to include inputTokens: result?.usage?.input_tokens, outputTokens: result?.usage?.output_tokens, and a compact modelUsage array derived from result?.modelUsage (one entry per model with inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, costUSD). Keep the existing two cache fields untouched so log-shipper consumers don't break.
Add inputTokens?, outputTokens?, cacheReadInputTokens?, cacheCreationInputTokens?, and modelUsage?: ReadonlyArray<{ model: string; inputTokens: number; outputTokens: number; cacheReadInputTokens: number; cacheCreationInputTokens: number; costUsd: number }> to ExecutionResult at src/types.ts:82. Populate them in buildExecutionResult at src/core/executor.ts:456, mirroring the existing costUsd / numTurns conditional-assignment pattern that respects exactOptionalPropertyTypes.
Thread the four scalar token counters into the pipeline.completed log at src/core/pipeline.ts:482 so the field set on the terminal line matches the executor line. Don't include the full modelUsage array in the terminal line: it's already on the executor line, and re-emitting it doubles log bytes for no consumer benefit.
Add 015_executions_tokens.sql adding four nullable INTEGER columns (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens) and one nullable JSONB column (model_usage) to the executions table. Extend markExecutionCompleted at src/orchestrator/history.ts:159 to write them. Existing FR-014 dispatch_stats queries continue to work unchanged (the new columns are additive).
Extend src/core/log-fields.ts:29#PipelineStageLogSchema-style strict shape coverage so a new PipelineCompletedLogSchema pins the seven structured fields (event, success, durationMs, costUsd, numTurns, pipeline_wall_clock_ms, plus the four token integers); the co-located round-trip test catches field-name drift the same way it does for the existing schema.
Update docs/operate/observability.md — extend the ## Common log fields table around line 33 with the new token rows, and add a short subsection under ## Core pipeline log fields explaining cache hit-ratio interpretation (cache_read / (input + cache_read + cache_creation)).
This run evaluated the agent-side execution telemetry surface (src/core/executor.ts, src/core/pipeline.ts, src/types.ts, src/ai/llm-client.ts, src/orchestrator/history.ts, src/db/migrations/001_initial.sql, src/core/log-fields.ts) against the canonical observability conventions established in docs/operate/observability.md and the upstream Anthropic SDK + OpenTelemetry GenAI cost-tracking specifications. Cross-checked against the ten prior area: observability issues (#164, #166, #170, #172, #174, #175, #177, #181, #184, #187) to ensure no material overlap — none of them touch usage, modelUsage, or any token field. The asymmetry against the already-instrumented Bedrock single-turn path in src/ai/llm-client.ts was the decisive corroborating signal. The MCP-tool-call-duration angle and the per-handler token-cost aggregate angle (workflow_runs join) are natural follow-ups; both are deliberately scoped out here to keep this finding small and reviewable.
Generated by the scheduled research action on 2026-05-30
Finding
The Claude Agent SDK
SDKResultMessage.usageobject carries four token counters (input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens), plus a per-modelmodelUsagemap (Anthropic SDK docs). The executor's terminalClaude Agent SDK execution completedlog line atsrc/core/executor.ts:439reads only the two cache counters atsrc/core/executor.ts:445andsrc/core/executor.ts:446;input_tokens,output_tokens, and the entiremodelUsagebreakdown are dropped on the floor. The companionbuildExecutionResultatsrc/core/executor.ts:456propagates onlysuccess,durationMs,costUsd,numTurnsintoExecutionResult(interface atsrc/types.ts:82), so even if a caller wanted to read tokens, the pipeline-level type strips them. The downstreampipeline.completedevent atsrc/core/pipeline.ts:482carries the same four fields pluspipeline_wall_clock_ms, andmarkExecutionCompletedatsrc/orchestrator/history.ts:159persistscost_usd,duration_ms,num_turnsagainst theexecutionsrow (table atsrc/db/migrations/001_initial.sql:10, cost column atsrc/db/migrations/001_initial.sql:25) — no token columns exist anywhere in the schema. The official Anthropic guidance explicitly says "Track these [cache tokens] separately frominput_tokensto understand caching savings" (cost-tracking docs), but the current logging records the cache numerator without theinput_tokensdenominator that makes a hit-ratio computable.The asymmetry is starker still against the Bedrock single-turn adaptor in
src/ai/llm-client.ts, which already tracks bothinputTokensandoutputTokenscleanly through a typedusageshape (src/ai/llm-client.ts:151-152, accumulator atsrc/ai/llm-client.ts:606-608) — the multi-turn SDK path is the only LLM call site in the codebase that leaves token information unobserved. Thedocs/operate/observability.md:33field table documentscost_usdas the sole agent-side LLM telemetry field; tokens are not mentioned, mirroring their absence from the logs. The PROMPT_CACHE_LAYOUT=cacheable rollout (see executor inline comment atsrc/core/executor.ts:434-438) is justified specifically by cache savings, yet operators flipping it cannot quantify the savings ratio without the missinginput_tokensbaseline.The remediation extends the existing pattern verbatim. Add four fields (
inputTokens,outputTokens, plus the existing two cache counters) to the executor's terminal log object atsrc/core/executor.ts:439-450, add a fifth structured field carrying a stable summary ofmodelUsage(a small array of{ model, inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, costUSD }), thread the four scalar counters throughExecutionResult(atsrc/types.ts:82) andbuildExecutionResult(atsrc/core/executor.ts:456) so they reachpipeline.completedatsrc/core/pipeline.ts:482, and add four nullable INTEGER columns + one JSONB column toexecutionsin a new015_executions_tokens.sqlmigration so the markExecutionCompleted UPDATE atsrc/orchestrator/history.ts:159-174can persist them for FR-014-style aggregates. Also extend the.strict()schema home atsrc/core/log-fields.ts:18-37so the new fields are pinned by the existing field-drift test and won't silently regress.Diagram
flowchart LR SDK["Claude Agent SDK<br/>query iterator"] Result["SDKResultMessage.usage<br/>+ modelUsage map"] SDK --> Result Result -.->|"logged today"| Cache["cache_read_input_tokens<br/>cache_creation_input_tokens"] Result -.->|"dropped today"| Drop1["input_tokens"] Result -.->|"dropped today"| Drop2["output_tokens"] Result -.->|"dropped today"| Drop3["modelUsage per-model<br/>inputTokens, outputTokens,<br/>cacheReadInputTokens, costUSD"] Cache --> Log["src/core/executor.ts line 439<br/>Claude Agent SDK execution completed"] Log --> Pipeline["pipeline.completed<br/>src/core/pipeline.ts line 482<br/>cost_usd + num_turns only"] Pipeline --> DB["executions table<br/>cost_usd, duration_ms,<br/>num_turns only"] classDef today fill:#196f3d;color:#ffffff classDef gap fill:#7d3c98;color:#ffffff classDef code fill:#1a5276;color:#ffffff class Cache,Log today class Drop1,Drop2,Drop3 gap class SDK,Result,Pipeline,DB codeRationale
Cost without tokens is opaque. The current
cost_usdlog + DB column tells an operator how much a single execution was billed, but not whether that bill came from a 500 KB prompt running for two turns or a 5 KB prompt running for fifty turns. Two pathologies that look identical incost_usdlook dramatically different ininput_tokens/output_tokens: an oversized prompt-build is aninput_tokensproblem (fix atsrc/core/prompt-builder.ts), a runaway tool-call loop is anoutput_tokens+num_turnsproblem (fix at the agent prompt). The Anthropic docs themselves call out that the SDK price model is a client-side estimate that can drift from authoritative billing (cost-tracking docs warning box); operators that want to reconcile against the real bill need the underlying token counts, not just the estimate.Cache observability is the load-bearing case. Issue #134 closed the prompt-cache-stability hole by stripping per-job
cwdfrom the SDK system prompt, and the inline comment atsrc/core/executor.ts:434-438explicitly says operators flippingPROMPT_CACHE_LAYOUT=cacheableshould watch for non-zerocache_read_input_tokenson the second+ run of the same shape. But "non-zero cache_read" is not enough to assert that caching is working: a job with 100 KB of prompt and 1 KB of cache read is wasting the cache; a job with 100 KB of prompt and 95 KB of cache read is using it correctly. The ratiocache_read_input_tokens / (input_tokens + cache_read_input_tokens + cache_creation_input_tokens)is the metric that matters, andinput_tokensis the missing denominator. Persisting tokens toexecutionsunblocks the per-installation, per-workflow, per-day SQL aggregates that the existingdispatch_stats.tsqueries already do for cost.The
modelUsagegap matters because the codebase already runs more than one model per execution path: the default agent runs Opus 4.7 (config.modeldefault), while the Bedrock-side intent classification insrc/ai/llm-client.tsruns a separate (typically Haiku) model. The Anthropic docs flag this exact scenario as the reasonmodelUsageexists: "useful when you run multiple models (for example, Haiku for subagents and Opus for the main agent) and want to see where tokens are going" (cost-tracking docs). The shape is also the canonical input to the upstream OpenTelemetry GenAI semantic conventions, which definegen_ai.client.token.usageas a histogram filterable bygen_ai.token.type(input/output) and a per-model dimension (OTel GenAI metrics) — wiring this pattern in now keeps the door open for a future OTel exporter without retrofit.References
Internal:
src/core/executor.ts:439#executeAgent—Claude Agent SDK execution completedlog line; the chokepoint whereinput_tokens/output_tokens/modelUsageare read fromresult.usageand dropped.src/core/executor.ts:445—cacheReadInputTokensfield, the proven precedent for adding sibling token fields.src/core/executor.ts:446—cacheCreationInputTokensfield, ditto.src/core/executor.ts:456#buildExecutionResult— the type-narrow constructor that stripsusagebefore returningExecutionResult.src/types.ts:82—ExecutionResultinterface; missinginputTokens,outputTokens,cacheReadInputTokens,cacheCreationInputTokens,modelUsage.src/core/pipeline.ts:482—pipeline.completedevent; downstream consumer that would benefit from token fields.src/orchestrator/history.ts:159#markExecutionCompleted— the UPDATE that persistscost_usd/duration_ms/num_turns; natural site for new token columns.src/orchestrator/history.ts:161— theresultparameter shape, missing token fields.src/db/migrations/001_initial.sql:10—CREATE TABLE executions; no token columns exist.src/db/migrations/001_initial.sql:25—cost_usd NUMERIC(10,6); the precedent column to mirror.src/core/log-fields.ts:18#CORE_PIPELINE_LOG_EVENTS— the strict-schema home that pinspipeline.completedagainst drift.src/ai/llm-client.ts:151— already-typedinputTokenson the Bedrock single-turn adaptor; precedent for the same TypeScript field shape on the multi-turn path.docs/operate/observability.md:33—cost_usdfield documentation; the table that would gain the new token rows.External:
SDKResultMessage.usage(input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens) andmodelUsageper-model breakdown; the docs explicitly recommend tracking cache tokens separately frominput_tokensto compute caching savings.gen_ai.client.token.usageas a histogram filterable bygen_ai.token.type(input/output) and by model, the upstream observability shape this finding aligns to.gen_ai.usage.input_tokensandgen_ai.usage.output_tokensas the named attributes every LLM client SHOULD emit.Suggested Next Steps
src/core/executor.ts:439-450to includeinputTokens: result?.usage?.input_tokens,outputTokens: result?.usage?.output_tokens, and a compactmodelUsagearray derived fromresult?.modelUsage(one entry per model withinputTokens,outputTokens,cacheReadInputTokens,cacheCreationInputTokens,costUSD). Keep the existing two cache fields untouched so log-shipper consumers don't break.inputTokens?,outputTokens?,cacheReadInputTokens?,cacheCreationInputTokens?, andmodelUsage?: ReadonlyArray<{ model: string; inputTokens: number; outputTokens: number; cacheReadInputTokens: number; cacheCreationInputTokens: number; costUsd: number }>toExecutionResultatsrc/types.ts:82. Populate them inbuildExecutionResultatsrc/core/executor.ts:456, mirroring the existingcostUsd/numTurnsconditional-assignment pattern that respectsexactOptionalPropertyTypes.pipeline.completedlog atsrc/core/pipeline.ts:482so the field set on the terminal line matches the executor line. Don't include the fullmodelUsagearray in the terminal line: it's already on the executor line, and re-emitting it doubles log bytes for no consumer benefit.015_executions_tokens.sqladding four nullableINTEGERcolumns (input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens) and one nullableJSONBcolumn (model_usage) to theexecutionstable. ExtendmarkExecutionCompletedatsrc/orchestrator/history.ts:159to write them. Existing FR-014 dispatch_stats queries continue to work unchanged (the new columns are additive).src/core/log-fields.ts:29#PipelineStageLogSchema-style strict shape coverage so a newPipelineCompletedLogSchemapins the seven structured fields (event,success,durationMs,costUsd,numTurns,pipeline_wall_clock_ms, plus the four token integers); the co-located round-trip test catches field-name drift the same way it does for the existing schema.docs/operate/observability.md— extend the## Common log fieldstable around line 33 with the new token rows, and add a short subsection under## Core pipeline log fieldsexplaining cache hit-ratio interpretation (cache_read / (input + cache_read + cache_creation)).sum(input_tokens) / sum(input_tokens + cache_read_input_tokens)regression — a sudden drop in cache-read share is the signature of a prompt-cache stability regression (the failure mode perf(agent-sdk): claude_code preset embeds per-job cwd, breaking system-prompt cache across every dispatched run #134 closed).Areas Evaluated
This run evaluated the agent-side execution telemetry surface (
src/core/executor.ts,src/core/pipeline.ts,src/types.ts,src/ai/llm-client.ts,src/orchestrator/history.ts,src/db/migrations/001_initial.sql,src/core/log-fields.ts) against the canonical observability conventions established indocs/operate/observability.mdand the upstream Anthropic SDK + OpenTelemetry GenAI cost-tracking specifications. Cross-checked against the ten priorarea: observabilityissues (#164, #166, #170, #172, #174, #175, #177, #181, #184, #187) to ensure no material overlap — none of them touchusage,modelUsage, or any token field. The asymmetry against the already-instrumented Bedrock single-turn path insrc/ai/llm-client.tswas the decisive corroborating signal. The MCP-tool-call-duration angle and the per-handler token-cost aggregate angle (workflow_runs join) are natural follow-ups; both are deliberately scoped out here to keep this finding small and reviewable.Generated by the scheduled research action on 2026-05-30