chore(weave): Add db and server support for cached tokens#6507
chore(weave): Add db and server support for cached tokens#6507andrewtruong wants to merge 13 commits intomasterfrom
Conversation
|
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=68df4ca0de199e1eae21559fe58b3c0af6cb9bca |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
HiveMind Sessions4 sessions · 25h 44m · $29
View all sessions in HiveMind → Run |
There was a problem hiding this comment.
🔴 filter_out_current_costs ignores cache cost fields, causing updated cache pricing to never be seeded
The filter_out_current_costs function at weave/trace_server/costs/insert_costs.py:116-150 determines whether a cost entry already exists in the DB by comparing only prompt_token_cost (mapped from cost["input"]), completion_token_cost (mapped from cost["output"]), and effective_date. It does not compare the new cache_read_input or cache_creation_input fields. Similarly, get_current_costs at weave/trace_server/costs/insert_costs.py:22-39 only queries llm_id, prompt_token_cost, completion_token_cost, effective_date — it doesn't fetch cache cost columns at all.
This means if cost_checkpoint.json is updated to add cache pricing for a model that already has matching prompt/completion costs and effective_date in the database, the entry will be incorrectly filtered out as a duplicate, and the new cache costs will never be inserted.
(Refers to lines 130-143)
Prompt for agents
In weave/trace_server/costs/insert_costs.py, update get_current_costs (lines 22-39) to also SELECT cache_read_input_token_cost and cache_creation_input_token_cost from llm_token_prices. Then update filter_out_current_costs (lines 116-150) to unpack those additional columns from the current_costs tuples and include them in the comparison at lines 132-135. Add two additional math.isclose checks: one comparing cache_read_input_token_cost with cost.get('cache_read_input', 0) and another comparing cache_creation_input_token_cost with cost.get('cache_creation_input', 0).
Was this helpful? React with 👍 or 👎 to provide feedback.
| Cost metrics are computed post-query by multiplying token counts by prices from llm_token_prices: | ||
| - input_cost: input_tokens * prompt_token_cost |
There was a problem hiding this comment.
🟡 total_cost docstring not updated to reflect inclusion of cache costs
The UsageMetric docstring at weave/trace_server/trace_server_interface.py:3063-3071 states total_cost: input_cost + output_cost, but the actual implementation in _compute_costs_for_buckets (weave/trace_server/clickhouse_trace_server_batched.py:1137-1146) now computes total_cost = input_cost + output_cost + cache_read_total + cache_creation_total. Users relying on the documented formula will have incorrect expectations about what total_cost includes.
(Refers to lines 3067-3071)
Was this helpful? React with 👍 or 👎 to provide feedback.
| '"prompt_tokens_total_cost":', toString(prompt_tokens * prompt_token_cost), ',', | ||
| '"cache_read_input_token_cost":', toString(cache_read_input_token_cost), ',', | ||
| '"cache_creation_input_token_cost":', toString(cache_creation_input_token_cost), ',', | ||
| '"prompt_tokens_total_cost":', toString((prompt_tokens - cache_read_input_tokens) * prompt_token_cost), ',', |
There was a problem hiding this comment.
🔴 prompt_tokens_total_cost double-charges cache_creation_input_tokens in ClickHouse SQL
The prompt_tokens_total_cost formula subtracts cache_read_input_tokens from prompt_tokens but does not subtract cache_creation_input_tokens. Since providers like Anthropic include both cache-read and cache-creation tokens in the total input_tokens count, cache_creation_input_tokens are double-charged: once at the regular prompt rate (included in prompt_tokens_total_cost) and again at the cache-creation rate (in cache_creation_input_tokens_total_cost). The comment in the SQLite path (sqlite_trace_server.py:875-876) confirms the intent: "Subtract cached tokens: they are billed at the cache rate, not the regular input rate" — but only one of the two cache token types is subtracted.
| '"prompt_tokens_total_cost":', toString((prompt_tokens - cache_read_input_tokens) * prompt_token_cost), ',', | |
| '"prompt_tokens_total_cost":', toString((prompt_tokens - cache_read_input_tokens - cache_creation_input_tokens) * prompt_token_cost), ',', |
Was this helpful? React with 👍 or 👎 to provide feedback.
| "prompt_tokens_total_cost": ( | ||
| prompt_tokens - cache_read_input_tokens | ||
| ) | ||
| * prompt_cost, |
There was a problem hiding this comment.
🔴 prompt_tokens_total_cost double-charges cache_creation_input_tokens in SQLite path
Same issue as the ClickHouse SQL path: the SQLite cost calculation at sqlite_trace_server.py:877-880 computes prompt_tokens_total_cost as (prompt_tokens - cache_read_input_tokens) * prompt_cost, but fails to also subtract cache_creation_input_tokens. This causes cache-creation tokens to be billed at both the regular prompt rate and the cache-creation rate.
| "prompt_tokens_total_cost": ( | |
| prompt_tokens - cache_read_input_tokens | |
| ) | |
| * prompt_cost, | |
| "prompt_tokens_total_cost": ( | |
| prompt_tokens | |
| - cache_read_input_tokens | |
| - cache_creation_input_tokens | |
| ) | |
| * prompt_cost, |
Was this helpful? React with 👍 or 👎 to provide feedback.
https://coreweave.atlassian.net/browse/WB-32599
Relevant wiring to add cached token support to the backend