Aar tracks token usage end-to-end: from the raw provider response through session accumulation, cost estimation, visual display, and hard budget enforcement. This page documents every stage of that pipeline.
See also:
configuration.mdfor the fullAgentConfigreference, andproviders.mdfor provider setup and credentials.
Every time a provider returns a response (streamed or not), Aar captures the token
counts reported by that provider, attaches them to a ProviderMeta event, and
accumulates them on the current Session. The session totals drive two independent
checks — a token-count budget and a cost-in-USD limit — both evaluated after each
provider call. Estimated cost is loaded from agent/core/pricing.json (shipped
with the package), optionally merged with ~/.aar/pricing.json (user override),
using longest-prefix model-name matching. All four
transports (CLI chat, inline TUI, full-screen TUI, web) surface these counts to
the user; the TUIs additionally show colour-coded warnings as limits approach.
Local and Ollama models that don't match any pricing-table entry report $0.00.
provider.stream()
│
├─ StreamDelta(done=False, text="…") ← no token data
├─ StreamDelta(done=False, text="…") ← no token data
│ …
└─ StreamDelta(done=True, meta=ProviderMeta(usage={…}))
│
_consume_stream() ← in provider_runner.py
│
returns ProviderResponse(meta=ProviderMeta(…))
│
run_loop stamps duration_ms on meta
│
emits ProviderMeta as event
│
session.accumulate(meta)
│
budget / cost enforcement check
Each StreamDelta carries plain text. Token counts are present only on the
final delta where done=True. _consume_stream() extracts delta.meta from
that terminal delta and returns it as part of ProviderResponse.meta. run_loop
then stamps duration_ms on the meta object before emitting it as a named event.
provider.complete()
│
└─ ProviderResponse(meta=ProviderMeta(usage={…}))
│
run_loop stamps duration_ms on meta
│
emits ProviderMeta as event
│
session.accumulate(meta)
│
budget / cost enforcement check
provider.complete() returns a fully-populated ProviderResponse directly. The
remainder of the pipeline (emit → accumulate → enforce) is identical to the
streaming path.
ProviderMeta.usage is a plain dict[str, int] with at minimum:
| Key | Meaning |
|---|---|
input_tokens |
Tokens in the prompt |
output_tokens |
Tokens in the completion |
Providers may include additional keys (e.g. cache breakdowns). An empty dict
({}) means the provider did not report counts for this call (see §3 for the
Ollama caveat).
| Provider | Non-streaming | Streaming |
|---|---|---|
| Anthropic | usage block in the response body |
Collected at the message_stop SSE event; attached to the final StreamDelta(done=True) |
| OpenAI | usage block in the response body |
Requested via stream_options: {include_usage: true}; trailing usage chunk on final delta |
| Ollama | prompt_eval_count / eval_count in response body |
Same fields on the final done: true chunk; attached to the final StreamDelta(done=True) |
| Generic | usage block in response body if present |
usage from SSE chunks if present; attached to the final StreamDelta(done=True) |
Anthropic — Usage is reliably present in every response. When
prompt caching is enabled ("prompt_caching": true
in the provider’s extra config), the response includes two additional fields:
cache_read_input_tokens and cache_creation_input_tokens. Aar captures these as
cache_read_tokens and cache_write_tokens in ProviderMeta.usage. Cost
estimation uses the cache_read_per_million and cache_write_per_million pricing
fields to compute accurate costs when caching is active.
OpenAI — Aar explicitly opts in to usage reporting on streamed responses by
sending stream_options: {include_usage: true}. Without this, OpenAI omits usage
from streaming responses entirely.
Ollama — Reports prompt_eval_count (input) and eval_count (output). These
fields are only present when the model runtime performs actual evaluation. If the
entire prompt is served from KV-cache and the runtime elides the fields, usage
will be {} (empty dict). In that case token counts are treated as zero for that
turn — no budget is deducted and no display line is rendered.
Generic — Best-effort: the adapter forwards whatever usage key is present in
the response body or SSE chunks. If the endpoint does not include usage data,
usage will be {}.
A dim, right-aligned usage line is printed after each assistant response:
(150in / 80out)
- Rendered only when
ProviderMeta.usageis a non-empty dict. - There is no config option to suppress it in
chatmode.
A per-turn token line (dim, right-aligned) is appended after each assistant
response block, using the same Xin / Yout format.
Configuration:
"tui": {
"layout": {
"token_usage": {
"visible": true
}
}
}Set "visible": false to hide the per-turn token line entirely. When hidden, the
session totals are still accumulated — only the display is suppressed.
This transport has two independent token surfaces:
The header permanently displays cumulative session totals:
aar running tokens: 1 240in / 620out $0.0182
- Updated after every
ProviderMetaevent fires. - Not affected by
token_usage.visible— it is always shown. - Cost is displayed alongside tokens when a matching pricing entry exists.
During a streaming response the state field in the header transitions through
the following stages:
idle
│
│ first StreamDelta received
▼
streaming… ← header state field changes to "streaming…"
│
│ StreamChunk(finished=True) fires (stream closed)
▼
running ← state reverts to current AgentState label
│
│ ProviderMeta event fires
▼
running ← header token/cost counts snap to new cumulative total
The streaming… label is a visual hint only; it does not map to an AgentState
value.
The same per-turn Xin / Yout line as tui mode. Also controlled by
token_usage.visible. Hiding it does not affect the header bar counts.
Session holds cumulative counters for the lifetime of a single agent.run()
call:
| Field | Type | Description |
|---|---|---|
total_input_tokens |
int |
Sum of all input_tokens across every step |
total_output_tokens |
int |
Sum of all output_tokens across every step |
total_cost |
float |
Estimated USD cost accumulated across all steps |
Derived property:
session.total_tokens == session.total_input_tokens + session.total_output_tokens
Accumulation timing: totals are updated immediately after every provider call, before the budget/cost enforcement checks run. This means enforcement always sees the up-to-date totals.
Reset behaviour: the Session object is created fresh at the start of each
agent.run() call. Limits therefore apply per-run, not across the lifetime of the
Agent instance.
Session restore: when resuming a persisted session (e.g. via aar chat --resume
or the ACP session/load endpoint), token and cost tallies (total_input_tokens,
total_output_tokens, total_cost) are restored from the saved state. This means
budget enforcement continues from where the previous run left off rather than
resetting to zero.
Pricing is loaded from two JSON files merged in order (later overrides earlier):
agent/core/pricing.json— shipped with the package; the built-in baseline.~/.aar/pricing.json— optional user override; loaded if the file exists.
The format is a flat JSON object whose keys are model-name prefixes and values hold four price fields (all USD per 1 million tokens):
| Field | Meaning |
|---|---|
input_per_million |
Standard input token price |
output_per_million |
Standard output token price |
cache_read_per_million |
Cache-read discount price |
cache_write_per_million |
Cache-write surcharge price |
Keys starting with _ (e.g. "_comment") are ignored and may be used as inline
comments. Example:
{
"_comment": "USD per 1M tokens. Keys are model-name prefixes.",
"claude-sonnet-4": { "input_per_million": 3.0, "output_per_million": 15.0, "cache_read_per_million": 0.30, "cache_write_per_million": 3.75 },
"gemma4": { "input_per_million": 0.05, "output_per_million": 0.10, "cache_read_per_million": 0.0, "cache_write_per_million": 0.0 }
}Built-in entries as of the initial release include: claude-sonnet-4, claude-opus-4,
claude-3-5-haiku, claude-3-5-sonnet, gpt-4o, gpt-4o-mini, gpt-4.1,
gpt-4.1-mini, gpt-4.1-nano, o3, o3-mini, o4-mini.
aar init writes ~/.aar/pricing.template.json — a copy of the full built-in
pricing table — as a starting point for customisation. Copy it to
~/.aar/pricing.json and edit as needed.
To invalidate the in-process pricing cache (useful in tests or after editing the
file at runtime), call reload_pricing_table() from agent.core.tokens.
get_pricing(model) iterates the pricing table sorted longest prefix first,
returning the first entry whose key is a prefix of the supplied model string. This
means a model name like claude-sonnet-4-20250514 matches the claude-sonnet-4
key correctly, and a hypothetical claude-sonnet-4-5-turbo would match
claude-sonnet-4 (not a shorter key) if no more-specific key existed.
- Prices are approximate and reflect public pricing pages as of mid-2025.
- No caching discounts, batching tiers, or enterprise pricing are modelled.
- Future price changes require editing
agent/core/pricing.jsonor adding overrides to~/.aar/pricing.json. - Cache token costs (
cache_read_tokens,cache_write_tokens) are included in the calculation when reported by the provider but are not yet broken out in the default display.
Models whose name does not match any prefix in the pricing table are assigned
None pricing. Cost is displayed as $0.00 and total_cost remains 0.0.
cost_limit enforcement is effectively disabled for these models (the limit is
never reached).
Two fractional thresholds control when the counters switch to a warning style:
| Threshold | Default | Applies to |
|---|---|---|
token_warning_threshold |
0.8 |
session.total_tokens >= token_budget * threshold |
cost_warning_threshold |
0.8 |
session.total_cost >= cost_limit * threshold |
When a threshold is crossed:
- Inline TUI (
tui): the per-turn token line switches tousage_warning_style(theme key, defaultbold red). - Full-screen TUI (
tui --fixed): the header token/cost counter switches totokens_warning_style(header styles key, defaultbold red).
Warnings are visual only. The agent continues running normally until the hard
limit is reached. Thresholds have no effect when the corresponding limit is 0
(unlimited).
Both checks run inside run_loop after each provider call and session
accumulation step, before the next iteration begins.
if token_budget > 0 and session.total_tokens >= token_budget:
emit ErrorEvent(message="Token budget exceeded (X/Y)", recoverable=False)
set AgentState.BUDGET_EXCEEDED
return # loop terminates
if cost_limit > 0.0 and session.total_cost >= cost_limit:
emit ErrorEvent(message="Cost limit exceeded ($X/$Y)", recoverable=False)
set AgentState.BUDGET_EXCEEDED
return # loop terminates
| Event / state | Value |
|---|---|
AgentState |
BUDGET_EXCEEDED |
ErrorEvent.message |
"Token budget exceeded (X/Y)" or cost variant |
ErrorEvent.recoverable |
False |
The loop returns immediately after setting the state. No further tool calls or provider calls are made. The session totals at the point of termination are preserved and remain readable.
All fields live on AgentConfig. See configuration.md for
the full config schema.
| Field | Type | Default | Meaning |
|---|---|---|---|
token_budget |
int |
0 |
Maximum total tokens per run. 0 = unlimited. |
cost_limit |
float |
0.0 |
Maximum estimated USD cost per run. 0.0 = unlimited. |
token_warning_threshold |
float |
0.8 |
Fraction of token_budget at which warning style activates. |
cost_warning_threshold |
float |
0.8 |
Fraction of cost_limit at which warning style activates. |
streaming |
bool |
False |
Enable streaming path. Recommended for interactive transports. |
These keys live under tui.layout (inline TUI) or tui_fixed.header_styles
(full-screen TUI) in the JSON config. Theme-level style keys are set under
tui.theme or tui_fixed.theme.
| Config path | Transport | Default | Effect |
|---|---|---|---|
tui.layout.token_usage.visible |
tui |
true |
Show/hide the per-turn Xin / Yout line in the scrollable TUI. |
tui.layout.token_usage.visible |
tui --fixed |
true |
Show/hide the per-turn body token line (header is unaffected). |
tui.theme.usage_style |
tui |
dim |
Rich style applied to the token line under normal conditions. |
tui.theme.usage_warning_style |
tui |
bold red |
Rich style applied to the token line when a warning threshold fires. |
tui_fixed.header_styles.tokens_style |
tui --fixed |
dim |
Rich style for the header token/cost counter under normal conditions. |
tui_fixed.header_styles.tokens_warning_style |
tui --fixed |
bold red |
Rich style for the header counter when a warning threshold fires. |
Style values use Rich markup syntax (e.g.
"bold red","dim cyan","italic on dark_blue"). Seethemes.mdfor the full theme reference.