Skip to content

feat(observability): log input/output tokens + modelUsage from SDKResultMessage; persist to executions table #192

@chrisleekr

Description

@chrisleekr

Finding

The Claude Agent SDK SDKResultMessage.usage object carries four token counters (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens), plus a per-model modelUsage map (Anthropic SDK docs). The executor's terminal Claude Agent SDK execution completed log line at src/core/executor.ts:439 reads only the two cache counters at src/core/executor.ts:445 and src/core/executor.ts:446; input_tokens, output_tokens, and the entire modelUsage breakdown are dropped on the floor. The companion buildExecutionResult at src/core/executor.ts:456 propagates only success, durationMs, costUsd, numTurns into ExecutionResult (interface at src/types.ts:82), so even if a caller wanted to read tokens, the pipeline-level type strips them. The downstream pipeline.completed event at src/core/pipeline.ts:482 carries the same four fields plus pipeline_wall_clock_ms, and markExecutionCompleted at src/orchestrator/history.ts:159 persists cost_usd, duration_ms, num_turns against the executions row (table at src/db/migrations/001_initial.sql:10, cost column at src/db/migrations/001_initial.sql:25) — no token columns exist anywhere in the schema. The official Anthropic guidance explicitly says "Track these [cache tokens] separately from input_tokens to understand caching savings" (cost-tracking docs), but the current logging records the cache numerator without the input_tokens denominator that makes a hit-ratio computable.

The asymmetry is starker still against the Bedrock single-turn adaptor in src/ai/llm-client.ts, which already tracks both inputTokens and outputTokens cleanly through a typed usage shape (src/ai/llm-client.ts:151-152, accumulator at src/ai/llm-client.ts:606-608) — the multi-turn SDK path is the only LLM call site in the codebase that leaves token information unobserved. The docs/operate/observability.md:33 field table documents cost_usd as the sole agent-side LLM telemetry field; tokens are not mentioned, mirroring their absence from the logs. The PROMPT_CACHE_LAYOUT=cacheable rollout (see executor inline comment at src/core/executor.ts:434-438) is justified specifically by cache savings, yet operators flipping it cannot quantify the savings ratio without the missing input_tokens baseline.

The remediation extends the existing pattern verbatim. Add four fields (inputTokens, outputTokens, plus the existing two cache counters) to the executor's terminal log object at src/core/executor.ts:439-450, add a fifth structured field carrying a stable summary of modelUsage (a small array of { model, inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, costUSD }), thread the four scalar counters through ExecutionResult (at src/types.ts:82) and buildExecutionResult (at src/core/executor.ts:456) so they reach pipeline.completed at src/core/pipeline.ts:482, and add four nullable INTEGER columns + one JSONB column to executions in a new 015_executions_tokens.sql migration so the markExecutionCompleted UPDATE at src/orchestrator/history.ts:159-174 can persist them for FR-014-style aggregates. Also extend the .strict() schema home at src/core/log-fields.ts:18-37 so the new fields are pinned by the existing field-drift test and won't silently regress.

Diagram

flowchart LR
    SDK["Claude Agent SDK<br/>query iterator"]
    Result["SDKResultMessage.usage<br/>+ modelUsage map"]

    SDK --> Result

    Result -.->|"logged today"| Cache["cache_read_input_tokens<br/>cache_creation_input_tokens"]
    Result -.->|"dropped today"| Drop1["input_tokens"]
    Result -.->|"dropped today"| Drop2["output_tokens"]
    Result -.->|"dropped today"| Drop3["modelUsage per-model<br/>inputTokens, outputTokens,<br/>cacheReadInputTokens, costUSD"]

    Cache --> Log["src/core/executor.ts line 439<br/>Claude Agent SDK execution completed"]
    Log --> Pipeline["pipeline.completed<br/>src/core/pipeline.ts line 482<br/>cost_usd + num_turns only"]
    Pipeline --> DB["executions table<br/>cost_usd, duration_ms,<br/>num_turns only"]

    classDef today fill:#196f3d;color:#ffffff
    classDef gap fill:#7d3c98;color:#ffffff
    classDef code fill:#1a5276;color:#ffffff

    class Cache,Log today
    class Drop1,Drop2,Drop3 gap
    class SDK,Result,Pipeline,DB code
Loading

Rationale

Cost without tokens is opaque. The current cost_usd log + DB column tells an operator how much a single execution was billed, but not whether that bill came from a 500 KB prompt running for two turns or a 5 KB prompt running for fifty turns. Two pathologies that look identical in cost_usd look dramatically different in input_tokens / output_tokens: an oversized prompt-build is an input_tokens problem (fix at src/core/prompt-builder.ts), a runaway tool-call loop is an output_tokens + num_turns problem (fix at the agent prompt). The Anthropic docs themselves call out that the SDK price model is a client-side estimate that can drift from authoritative billing (cost-tracking docs warning box); operators that want to reconcile against the real bill need the underlying token counts, not just the estimate.

Cache observability is the load-bearing case. Issue #134 closed the prompt-cache-stability hole by stripping per-job cwd from the SDK system prompt, and the inline comment at src/core/executor.ts:434-438 explicitly says operators flipping PROMPT_CACHE_LAYOUT=cacheable should watch for non-zero cache_read_input_tokens on the second+ run of the same shape. But "non-zero cache_read" is not enough to assert that caching is working: a job with 100 KB of prompt and 1 KB of cache read is wasting the cache; a job with 100 KB of prompt and 95 KB of cache read is using it correctly. The ratio cache_read_input_tokens / (input_tokens + cache_read_input_tokens + cache_creation_input_tokens) is the metric that matters, and input_tokens is the missing denominator. Persisting tokens to executions unblocks the per-installation, per-workflow, per-day SQL aggregates that the existing dispatch_stats.ts queries already do for cost.

The modelUsage gap matters because the codebase already runs more than one model per execution path: the default agent runs Opus 4.7 (config.model default), while the Bedrock-side intent classification in src/ai/llm-client.ts runs a separate (typically Haiku) model. The Anthropic docs flag this exact scenario as the reason modelUsage exists: "useful when you run multiple models (for example, Haiku for subagents and Opus for the main agent) and want to see where tokens are going" (cost-tracking docs). The shape is also the canonical input to the upstream OpenTelemetry GenAI semantic conventions, which define gen_ai.client.token.usage as a histogram filterable by gen_ai.token.type (input/output) and a per-model dimension (OTel GenAI metrics) — wiring this pattern in now keeps the door open for a future OTel exporter without retrofit.

References

Internal:

  • src/core/executor.ts:439#executeAgentClaude Agent SDK execution completed log line; the chokepoint where input_tokens / output_tokens / modelUsage are read from result.usage and dropped.
  • src/core/executor.ts:445cacheReadInputTokens field, the proven precedent for adding sibling token fields.
  • src/core/executor.ts:446cacheCreationInputTokens field, ditto.
  • src/core/executor.ts:456#buildExecutionResult — the type-narrow constructor that strips usage before returning ExecutionResult.
  • src/types.ts:82ExecutionResult interface; missing inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, modelUsage.
  • src/core/pipeline.ts:482pipeline.completed event; downstream consumer that would benefit from token fields.
  • src/orchestrator/history.ts:159#markExecutionCompleted — the UPDATE that persists cost_usd / duration_ms / num_turns; natural site for new token columns.
  • src/orchestrator/history.ts:161 — the result parameter shape, missing token fields.
  • src/db/migrations/001_initial.sql:10CREATE TABLE executions; no token columns exist.
  • src/db/migrations/001_initial.sql:25cost_usd NUMERIC(10,6); the precedent column to mirror.
  • src/core/log-fields.ts:18#CORE_PIPELINE_LOG_EVENTS — the strict-schema home that pins pipeline.completed against drift.
  • src/ai/llm-client.ts:151 — already-typed inputTokens on the Bedrock single-turn adaptor; precedent for the same TypeScript field shape on the multi-turn path.
  • docs/operate/observability.md:33cost_usd field documentation; the table that would gain the new token rows.

External:

Suggested Next Steps

  1. Extend the executor's terminal log at src/core/executor.ts:439-450 to include inputTokens: result?.usage?.input_tokens, outputTokens: result?.usage?.output_tokens, and a compact modelUsage array derived from result?.modelUsage (one entry per model with inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, costUSD). Keep the existing two cache fields untouched so log-shipper consumers don't break.
  2. Add inputTokens?, outputTokens?, cacheReadInputTokens?, cacheCreationInputTokens?, and modelUsage?: ReadonlyArray<{ model: string; inputTokens: number; outputTokens: number; cacheReadInputTokens: number; cacheCreationInputTokens: number; costUsd: number }> to ExecutionResult at src/types.ts:82. Populate them in buildExecutionResult at src/core/executor.ts:456, mirroring the existing costUsd / numTurns conditional-assignment pattern that respects exactOptionalPropertyTypes.
  3. Thread the four scalar token counters into the pipeline.completed log at src/core/pipeline.ts:482 so the field set on the terminal line matches the executor line. Don't include the full modelUsage array in the terminal line: it's already on the executor line, and re-emitting it doubles log bytes for no consumer benefit.
  4. Add 015_executions_tokens.sql adding four nullable INTEGER columns (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens) and one nullable JSONB column (model_usage) to the executions table. Extend markExecutionCompleted at src/orchestrator/history.ts:159 to write them. Existing FR-014 dispatch_stats queries continue to work unchanged (the new columns are additive).
  5. Extend src/core/log-fields.ts:29#PipelineStageLogSchema-style strict shape coverage so a new PipelineCompletedLogSchema pins the seven structured fields (event, success, durationMs, costUsd, numTurns, pipeline_wall_clock_ms, plus the four token integers); the co-located round-trip test catches field-name drift the same way it does for the existing schema.
  6. Update docs/operate/observability.md — extend the ## Common log fields table around line 33 with the new token rows, and add a short subsection under ## Core pipeline log fields explaining cache hit-ratio interpretation (cache_read / (input + cache_read + cache_creation)).
  7. Add one alert recipe to the existing alerts section: per-installation sum(input_tokens) / sum(input_tokens + cache_read_input_tokens) regression — a sudden drop in cache-read share is the signature of a prompt-cache stability regression (the failure mode perf(agent-sdk): claude_code preset embeds per-job cwd, breaking system-prompt cache across every dispatched run #134 closed).

Areas Evaluated

This run evaluated the agent-side execution telemetry surface (src/core/executor.ts, src/core/pipeline.ts, src/types.ts, src/ai/llm-client.ts, src/orchestrator/history.ts, src/db/migrations/001_initial.sql, src/core/log-fields.ts) against the canonical observability conventions established in docs/operate/observability.md and the upstream Anthropic SDK + OpenTelemetry GenAI cost-tracking specifications. Cross-checked against the ten prior area: observability issues (#164, #166, #170, #172, #174, #175, #177, #181, #184, #187) to ensure no material overlap — none of them touch usage, modelUsage, or any token field. The asymmetry against the already-instrumented Bedrock single-turn path in src/ai/llm-client.ts was the decisive corroborating signal. The MCP-tool-call-duration angle and the per-handler token-cost aggregate angle (workflow_runs join) are natural follow-ups; both are deliberately scoped out here to keep this finding small and reviewable.

Generated by the scheduled research action on 2026-05-30

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions