feat(observability): log input/output tokens + modelUsage from SDKResultMessage; persist to executions table

## Finding

The Claude Agent SDK `SDKResultMessage.usage` object carries four token counters (`input_tokens`, `output_tokens`, `cache_read_input_tokens`, `cache_creation_input_tokens`), plus a per-model `modelUsage` map ([Anthropic SDK docs](https://code.claude.com/docs/en/agent-sdk/cost-tracking)). The executor's terminal `Claude Agent SDK execution completed` log line at `src/core/executor.ts:439` reads only the two cache counters at `src/core/executor.ts:445` and `src/core/executor.ts:446`; `input_tokens`, `output_tokens`, and the entire `modelUsage` breakdown are dropped on the floor. The companion `buildExecutionResult` at `src/core/executor.ts:456` propagates only `success`, `durationMs`, `costUsd`, `numTurns` into `ExecutionResult` (interface at `src/types.ts:82`), so even if a caller wanted to read tokens, the pipeline-level type strips them. The downstream `pipeline.completed` event at `src/core/pipeline.ts:482` carries the same four fields plus `pipeline_wall_clock_ms`, and `markExecutionCompleted` at `src/orchestrator/history.ts:159` persists `cost_usd`, `duration_ms`, `num_turns` against the `executions` row (table at `src/db/migrations/001_initial.sql:10`, cost column at `src/db/migrations/001_initial.sql:25`) — no token columns exist anywhere in the schema. The official Anthropic guidance explicitly says "Track these [cache tokens] separately from `input_tokens` to understand caching savings" ([cost-tracking docs](https://code.claude.com/docs/en/agent-sdk/cost-tracking)), but the current logging records the cache numerator without the `input_tokens` denominator that makes a hit-ratio computable.

The asymmetry is starker still against the Bedrock single-turn adaptor in `src/ai/llm-client.ts`, which already tracks both `inputTokens` and `outputTokens` cleanly through a typed `usage` shape (`src/ai/llm-client.ts:151-152`, accumulator at `src/ai/llm-client.ts:606-608`) — the multi-turn SDK path is the only LLM call site in the codebase that leaves token information unobserved. The `docs/operate/observability.md:33` field table documents `cost_usd` as the sole agent-side LLM telemetry field; tokens are not mentioned, mirroring their absence from the logs. The PROMPT_CACHE_LAYOUT=cacheable rollout (see executor inline comment at `src/core/executor.ts:434-438`) is justified specifically by cache savings, yet operators flipping it cannot quantify the savings ratio without the missing `input_tokens` baseline.

The remediation extends the existing pattern verbatim. Add four fields (`inputTokens`, `outputTokens`, plus the existing two cache counters) to the executor's terminal log object at `src/core/executor.ts:439-450`, add a fifth structured field carrying a stable summary of `modelUsage` (a small array of `{ model, inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, costUSD }`), thread the four scalar counters through `ExecutionResult` (at `src/types.ts:82`) and `buildExecutionResult` (at `src/core/executor.ts:456`) so they reach `pipeline.completed` at `src/core/pipeline.ts:482`, and add four nullable INTEGER columns + one JSONB column to `executions` in a new `015_executions_tokens.sql` migration so the markExecutionCompleted UPDATE at `src/orchestrator/history.ts:159-174` can persist them for FR-014-style aggregates. Also extend the `.strict()` schema home at `src/core/log-fields.ts:18-37` so the new fields are pinned by the existing field-drift test and won't silently regress.

## Diagram

```mermaid
flowchart LR
 SDK["Claude Agent SDK query iterator"]
 Result["SDKResultMessage.usage + modelUsage map"]

 SDK --> Result

 Result -.->|"logged today"| Cache["cache_read_input_tokens cache_creation_input_tokens"]
 Result -.->|"dropped today"| Drop1["input_tokens"]
 Result -.->|"dropped today"| Drop2["output_tokens"]
 Result -.->|"dropped today"| Drop3["modelUsage per-model inputTokens, outputTokens, cacheReadInputTokens, costUSD"]

 Cache --> Log["src/core/executor.ts line 439 Claude Agent SDK execution completed"]
 Log --> Pipeline["pipeline.completed src/core/pipeline.ts line 482 cost_usd + num_turns only"]
 Pipeline --> DB["executions table cost_usd, duration_ms, num_turns only"]

 classDef today fill:#196f3d;color:#ffffff
 classDef gap fill:#7d3c98;color:#ffffff
 classDef code fill:#1a5276;color:#ffffff

 class Cache,Log today
 class Drop1,Drop2,Drop3 gap
 class SDK,Result,Pipeline,DB code
```

## Rationale

Cost without tokens is opaque. The current `cost_usd` log + DB column tells an operator how much a single execution was billed, but not whether that bill came from a 500 KB prompt running for two turns or a 5 KB prompt running for fifty turns. Two pathologies that look identical in `cost_usd` look dramatically different in `input_tokens` / `output_tokens`: an oversized prompt-build is an `input_tokens` problem (fix at `src/core/prompt-builder.ts`), a runaway tool-call loop is an `output_tokens` + `num_turns` problem (fix at the agent prompt). The Anthropic docs themselves call out that the SDK price model is a client-side estimate that can drift from authoritative billing ([cost-tracking docs warning box](https://code.claude.com/docs/en/agent-sdk/cost-tracking)); operators that want to reconcile against the real bill need the underlying token counts, not just the estimate.

Cache observability is the load-bearing case. Issue #134 closed the prompt-cache-stability hole by stripping per-job `cwd` from the SDK system prompt, and the inline comment at `src/core/executor.ts:434-438` explicitly says operators flipping `PROMPT_CACHE_LAYOUT=cacheable` should watch for non-zero `cache_read_input_tokens` on the second+ run of the same shape. But "non-zero cache_read" is not enough to assert that caching is working: a job with 100 KB of prompt and 1 KB of cache read is wasting the cache; a job with 100 KB of prompt and 95 KB of cache read is using it correctly. The ratio `cache_read_input_tokens / (input_tokens + cache_read_input_tokens + cache_creation_input_tokens)` is the metric that matters, and `input_tokens` is the missing denominator. Persisting tokens to `executions` unblocks the per-installation, per-workflow, per-day SQL aggregates that the existing `dispatch_stats.ts` queries already do for cost.

The `modelUsage` gap matters because the codebase already runs more than one model per execution path: the default agent runs Opus 4.7 (`config.model` default), while the Bedrock-side intent classification in `src/ai/llm-client.ts` runs a separate (typically Haiku) model. The Anthropic docs flag this exact scenario as the reason `modelUsage` exists: "useful when you run multiple models (for example, Haiku for subagents and Opus for the main agent) and want to see where tokens are going" ([cost-tracking docs](https://code.claude.com/docs/en/agent-sdk/cost-tracking)). The shape is also the canonical input to the upstream OpenTelemetry GenAI semantic conventions, which define `gen_ai.client.token.usage` as a histogram filterable by `gen_ai.token.type` (input/output) and a per-model dimension ([OTel GenAI metrics](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/)) — wiring this pattern in now keeps the door open for a future OTel exporter without retrofit.

## References

**Internal**:

- `src/core/executor.ts:439#executeAgent` — `Claude Agent SDK execution completed` log line; the chokepoint where `input_tokens` / `output_tokens` / `modelUsage` are read from `result.usage` and dropped.
- `src/core/executor.ts:445` — `cacheReadInputTokens` field, the proven precedent for adding sibling token fields.
- `src/core/executor.ts:446` — `cacheCreationInputTokens` field, ditto.
- `src/core/executor.ts:456#buildExecutionResult` — the type-narrow constructor that strips `usage` before returning `ExecutionResult`.
- `src/types.ts:82` — `ExecutionResult` interface; missing `inputTokens`, `outputTokens`, `cacheReadInputTokens`, `cacheCreationInputTokens`, `modelUsage`.
- `src/core/pipeline.ts:482` — `pipeline.completed` event; downstream consumer that would benefit from token fields.
- `src/orchestrator/history.ts:159#markExecutionCompleted` — the UPDATE that persists `cost_usd` / `duration_ms` / `num_turns`; natural site for new token columns.
- `src/orchestrator/history.ts:161` — the `result` parameter shape, missing token fields.
- `src/db/migrations/001_initial.sql:10` — `CREATE TABLE executions`; no token columns exist.
- `src/db/migrations/001_initial.sql:25` — `cost_usd NUMERIC(10,6)`; the precedent column to mirror.
- `src/core/log-fields.ts:18#CORE_PIPELINE_LOG_EVENTS` — the strict-schema home that pins `pipeline.completed` against drift.
- `src/ai/llm-client.ts:151` — already-typed `inputTokens` on the Bedrock single-turn adaptor; precedent for the same TypeScript field shape on the multi-turn path.
- `docs/operate/observability.md:33` — `cost_usd` field documentation; the table that would gain the new token rows.

**External**:

- [Track cost and usage — Claude Agent SDK docs](https://code.claude.com/docs/en/agent-sdk/cost-tracking) — authoritative spec for `SDKResultMessage.usage` (`input_tokens`, `output_tokens`, `cache_read_input_tokens`, `cache_creation_input_tokens`) and `modelUsage` per-model breakdown; the docs explicitly recommend tracking cache tokens separately from `input_tokens` to compute caching savings.
- [Semantic conventions for generative AI metrics — OpenTelemetry](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/) — defines `gen_ai.client.token.usage` as a histogram filterable by `gen_ai.token.type` (input/output) and by model, the upstream observability shape this finding aligns to.
- [Semantic conventions for generative AI systems — OpenTelemetry](https://opentelemetry.io/docs/specs/semconv/gen-ai/) — the umbrella spec that frames `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens` as the named attributes every LLM client SHOULD emit.

## Suggested Next Steps

1. Extend the executor's terminal log at `src/core/executor.ts:439-450` to include `inputTokens: result?.usage?.input_tokens`, `outputTokens: result?.usage?.output_tokens`, and a compact `modelUsage` array derived from `result?.modelUsage` (one entry per model with `inputTokens`, `outputTokens`, `cacheReadInputTokens`, `cacheCreationInputTokens`, `costUSD`). Keep the existing two cache fields untouched so log-shipper consumers don't break.
2. Add `inputTokens?`, `outputTokens?`, `cacheReadInputTokens?`, `cacheCreationInputTokens?`, and `modelUsage?: ReadonlyArray<{ model: string; inputTokens: number; outputTokens: number; cacheReadInputTokens: number; cacheCreationInputTokens: number; costUsd: number }>` to `ExecutionResult` at `src/types.ts:82`. Populate them in `buildExecutionResult` at `src/core/executor.ts:456`, mirroring the existing `costUsd` / `numTurns` conditional-assignment pattern that respects `exactOptionalPropertyTypes`.
3. Thread the four scalar token counters into the `pipeline.completed` log at `src/core/pipeline.ts:482` so the field set on the terminal line matches the executor line. Don't include the full `modelUsage` array in the terminal line: it's already on the executor line, and re-emitting it doubles log bytes for no consumer benefit.
4. Add `015_executions_tokens.sql` adding four nullable `INTEGER` columns (`input_tokens`, `output_tokens`, `cache_read_input_tokens`, `cache_creation_input_tokens`) and one nullable `JSONB` column (`model_usage`) to the `executions` table. Extend `markExecutionCompleted` at `src/orchestrator/history.ts:159` to write them. Existing FR-014 dispatch_stats queries continue to work unchanged (the new columns are additive).
5. Extend `src/core/log-fields.ts:29#PipelineStageLogSchema`-style strict shape coverage so a new `PipelineCompletedLogSchema` pins the seven structured fields (`event`, `success`, `durationMs`, `costUsd`, `numTurns`, `pipeline_wall_clock_ms`, plus the four token integers); the co-located round-trip test catches field-name drift the same way it does for the existing schema.
6. Update `docs/operate/observability.md` — extend the `## Common log fields` table around line 33 with the new token rows, and add a short subsection under `## Core pipeline log fields` explaining cache hit-ratio interpretation (`cache_read / (input + cache_read + cache_creation)`).
7. Add one alert recipe to the existing alerts section: per-installation `sum(input_tokens) / sum(input_tokens + cache_read_input_tokens)` regression — a sudden drop in cache-read share is the signature of a prompt-cache stability regression (the failure mode #134 closed).

## Areas Evaluated

This run evaluated the agent-side execution telemetry surface (`src/core/executor.ts`, `src/core/pipeline.ts`, `src/types.ts`, `src/ai/llm-client.ts`, `src/orchestrator/history.ts`, `src/db/migrations/001_initial.sql`, `src/core/log-fields.ts`) against the canonical observability conventions established in `docs/operate/observability.md` and the upstream Anthropic SDK + OpenTelemetry GenAI cost-tracking specifications. Cross-checked against the ten prior `area: observability` issues (#164, #166, #170, #172, #174, #175, #177, #181, #184, #187) to ensure no material overlap — none of them touch `usage`, `modelUsage`, or any token field. The asymmetry against the already-instrumented Bedrock single-turn path in `src/ai/llm-client.ts` was the decisive corroborating signal. The MCP-tool-call-duration angle and the per-handler token-cost aggregate angle (workflow_runs join) are natural follow-ups; both are deliberately scoped out here to keep this finding small and reviewable.

*Generated by the scheduled research action on 2026-05-30*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): log input/output tokens + modelUsage from SDKResultMessage; persist to executions table #192

Finding

Diagram

Rationale

References

Suggested Next Steps

Areas Evaluated

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(observability): log input/output tokens + modelUsage from SDKResultMessage; persist to executions table #192

Description

Finding

Diagram

Rationale

References

Suggested Next Steps

Areas Evaluated

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions