feat(cost): per-modality token breakdown in Python SDK (F3 Track C)#596
feat(cost): per-modality token breakdown in Python SDK (F3 Track C)#596john-weiler wants to merge 2 commits into
Conversation
Extends LlmMetrics with optional num_image_input_tokens / num_audio_input_tokens /
num_audio_output_tokens fields and threads them through the full ingestion path:
LlmMetrics model → logger.add_llm_span() → base_handler → span_params.
Extraction is implemented in two integrations:
LangChain (primary path):
- _extract_gemini_modality_breakdown() checks two surfaces:
1. message.usage_metadata.input_token_details / output_token_details
(langchain-google-genai >= 2.x with LangChain Core UsageMetadata)
2. message.response_metadata.prompt_tokens_details / candidates_tokens_details
(raw Gemini API ModalityTokenCount list forwarded by the adapter)
- Both sync and async handlers updated.
galileo-adk (native Gemini SDK path):
- _extract_usage_metadata() extended to walk prompt_tokens_details /
candidates_tokens_details (ModalityTokenCount list with .modality enum
or string) and bucket into image/audio counts.
- span_manager.end_llm() carries the 3 new fields through to end_node().
All other integrations (crewai, openai_agents, otel) unchanged — they go
through the OpenAI-compat path which drops modality breakdown (per RFC).
Tests: 4 new TestParseLlmResult cases + 6 new TestExtractUsageMetadata
ADK cases; all existing tests pass.
Part of F3 multimodal token/cost support.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #596 +/- ##
==========================================
- Coverage 83.31% 83.13% -0.19%
==========================================
Files 125 125
Lines 10659 10771 +112
==========================================
+ Hits 8881 8954 +73
- Misses 1778 1817 +39 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
|
||
| # Per-modality breakdown — only present on native Gemini SDK responses. | ||
| prompt_details = getattr(usage, "prompt_tokens_details", None) | ||
| candidates_details = getattr(usage, "candidates_tokens_details", None) | ||
| if prompt_details or candidates_details: | ||
| image_in = 0 | ||
| audio_in = 0 | ||
| audio_out = 0 | ||
| has_prompt = bool(prompt_details) | ||
| has_candidates = bool(candidates_details) | ||
| for entry in prompt_details or []: | ||
| modality_attr = getattr(entry, "modality", None) |
There was a problem hiding this comment.
_extract_usage_metadata duplicates the Gemini modality breakdown in ::_extract_gemini_modality_breakdown, should we extract a shared helper so both paths use the same logic?
Want Baz to fix this for you? Activate Fixer
| def test_gemini_modality_from_usage_metadata_input_token_details(self) -> None: | ||
| """Gemini native path: per-modality breakdown from usage_metadata.input/output_token_details.""" | ||
| ai_message = AIMessage(content="hello") | ||
| ai_message.usage_metadata = { | ||
| "input_tokens": 110, |
There was a problem hiding this comment.
These new Gemini modality tests don't follow the AGENTS.md # Given/# When/# Then pytest style, should we add those comments to each case so the new coverage matches the repo's canonical pattern?
Want Baz to fix this for you? Activate Fixer You can also update your AI coding guidelines based on this comment by apply pr to [branch name]
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests/test_langchain.py around lines 1300-1304, update the new Gemini modality test(s)
that were added in this hunk to follow the repository’s canonical pytest style from
AGENTS.md. For each new test method in that range (e.g.,
`test_gemini_modality_from_usage_metadata_input_token_details`,
`test_gemini_modality_from_response_metadata_prompt_tokens_details`,
`test_no_gemini_modality_for_text_only_response`, and
`test_gemini_modality_input_token_details_takes_precedence`), add sentence-case `#
Given: ...`, `# When: ...`, and `# Then: ...` comments describing the setup, the call to
`parse_llm_result`, and the assertions. Keep the existing logic the same—only
add/adjust the comment structure so it matches the earlier token-usage tests in the same
file/class.
There was a problem hiding this comment.
Commit bee7b89 addressed this comment by adding # Given, # When, and # Then comments to the new Gemini modality tests in tests/test_langchain.py. The new cases now match the repo’s canonical pytest style without changing the test logic.
| response_meta = getattr(msg, "response_metadata", None) | ||
| if isinstance(response_meta, dict): | ||
| prompt_details = response_meta.get("prompt_tokens_details") or [] | ||
| candidates_details = response_meta.get("candidates_tokens_details") or [] |
There was a problem hiding this comment.
AIMessage.response_metadata only reads top-level token detail keys; can we also read response_metadata['usage_metadata'] so prompt_details / candidates_details (and image_input_tokens / audio_*) aren’t left empty, with a top-level fallback if needed?
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
src/galileo/handlers/langchain/utils.py around lines 58-117, in the helper
_extract_gemini_modality_breakdown, fix the Gemini response_metadata parsing to support
the adapter’s nested structure. Right now it only reads
response_metadata['prompt_tokens_details'] and ['candidates_tokens_details'] at the top
level, which leaves prompt_details/candidates_details empty when the real adapter nests
them under response_metadata['usage_metadata']. Refactor by first checking
response_metadata for usage_metadata and extracting
prompt_tokens_details/candidates_tokens_details from there, falling back to the current
top-level keys if the nested block is missing, then keep the rest of the modality token
assignment logic unchanged.
There was a problem hiding this comment.
Commit bee7b89 addressed this comment by teaching _extract_gemini_modality_breakdown to read nested response_metadata['usage_metadata'] in addition to the existing top-level token detail lists. It now falls back to the top-level keys when nested details are absent, so Gemini modality counts are no longer left empty when the adapter uses the nested structure.
| if image_in is None and audio_in is None and audio_out is None: | ||
| return None, None, None | ||
| return image_in, audio_in, audio_out |
There was a problem hiding this comment.
_extract_gemini_modality_breakdown() currently returns (None, None, None) for detail lists that only contain TEXT, which makes LlmMetrics.to_dict() serialize null where ADK emits 0; should we return zeroes when the detail lists are present but no counted modalities are found?
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
src/galileo/handlers/langchain/utils.py around lines 58-117, update the
_extract_gemini_modality_breakdown() logic so it doesn’t return (None, None, None)
solely because AUDIO/IMAGE counts weren’t found. Track whether the modality “detail
lists” were actually present (prompt_tokens_details and/or candidates_tokens_details
from message.response_metadata, and usage_metadata
input_token_details/output_token_details if applicable); if those lists are present but
contain no AUDIO/IMAGE entries (only TEXT or other modalities), return (0, 0, 0) with
the corresponding fields set to 0 instead of None. Keep (None, None, None) only when
neither surface provides any modality breakdown data at all. Ensure this matches the ADK
behavior and preserves null-vs-zero meaning for downstream traces.
There was a problem hiding this comment.
Commit bee7b89 addressed this comment by changing _extract_gemini_modality_breakdown() to track whether modality detail lists exist and return 0 values when they do but contain no AUDIO/IMAGE entries. It now reserves (None, None, None) for the case where no breakdown data is present at all, preserving the null-vs-zero distinction the comment asked for.
| # Per-modality breakdown — None means "counted as text in the flat totals". | ||
| # Only populated for providers that return modality-level token counts (e.g. Gemini native). | ||
| num_image_input_tokens: None | Unset | int = UNSET | ||
| num_audio_input_tokens: None | Unset | int = UNSET | ||
| num_audio_output_tokens: None | Unset | int = UNSET |
There was a problem hiding this comment.
src/galileo/resources/ is regenerated from openapi.yaml, but LlmMetrics in the spec doesn't include the new modality fields, so the next regen will drop them and break src/galileo/logger/logger.py:1247-1255; should we move this into the OpenAPI source/generator input instead of patching the generated file?
Want Baz to fix this for you? Activate Fixer You can also update your AI coding guidelines based on this comment by apply pr to [branch name]
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
src/galileo/resources/models/llm_metrics.py around lines 30-34 and in the additions to
`LlmMetrics.to_dict()` and `LlmMetrics.from_dict()` (new fields
`num_image_input_tokens`, `num_audio_input_tokens`, `num_audio_output_tokens`), don’t
keep these changes in the generated client boundary. Instead, update the OpenAPI source
schema that generates `LlmMetrics` (likely openapi.yaml) to add these three properties
(type int, nullable/optional using the same conventions as the other token fields) so
regeneration won’t delete them. After updating the OpenAPI schema, remove the manual
edits from the generated llm_metrics.py and regenerate via
scripts/auto-generate-api-client.sh, then run the failing logger tests
(src/galileo/logger/logger.py around 1247-1255) to ensure there are no unexpected-kwarg
errors.
There was a problem hiding this comment.
We will regenerate once API changes are in ...
There was a problem hiding this comment.
One clarification on the premise: regeneration won't actually break logger.py:1247-1255. That code constructs galileo_core.schemas.logging.span.LlmMetrics (a pydantic model with model_config extra='allow'), not the edited attrs class in src/galileo/resources/models/llm_metrics.py. I verified the modality kwargs are accepted and serialized via model_dump() purely through extra='allow' — so the write/ingest path does not depend on this generated-file edit at all. The edit only affects the read-side models (LlmSpan, extended_llm_span_record, partial_extended_llm_span_record) that call LlmMetrics.from_dict(). So the regen concern is real for the read path, but the generated-file change is functionally unnecessary for emitting the new fields.
| def test_returns_empty_for_no_response(self, observer: GalileoObserver) -> None: | ||
| assert observer._extract_usage_metadata(None) == {} | ||
|
|
||
| def test_returns_empty_for_no_usage_metadata(self, observer: GalileoObserver) -> None: | ||
| response = MagicMock() |
There was a problem hiding this comment.
Can we add # Given: .../# When: .../# Then: ... comments to each TestExtractUsageMetadata case so it matches AGENTS.md and the earlier tests in this file?
Want Baz to fix this for you? Activate Fixer You can also update your AI coding guidelines based on this comment by apply pr to [branch name]
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
galileo-adk/tests/test_observer.py around lines 290-372 within the
`TestExtractUsageMetadata` test class, update each test method (e.g.,
`test_returns_empty_for_no_response`, `test_returns_empty_for_no_usage_metadata`,
`test_flat_tokens_no_modality`, `test_audio_and_image_modality_breakdown`,
`test_no_audio_in_candidates_returns_zero_audio_out`,
`test_modality_as_string_instead_of_enum`) to follow AGENTS.md’s canonical pytest
style. Add non-empty sentence-case `# Given: ...`, `# When: ...`, and `# Then: ...`
comments for the setup, the call to `observer._extract_usage_metadata(...)`, and the
assertions. Make the comments consistent with the existing earlier tests in this file,
and ensure every test in this new block includes them.
There was a problem hiding this comment.
Commit bee7b89 addressed this comment by adding # Given:, # When:, and # Then: comments to each TestExtractUsageMetadata test in the new block. The updated tests now match the canonical pytest style used earlier in the file.
| if part_type == "text": | ||
| text = part.get("text") if isinstance(part, dict) else getattr(part, "text", "") | ||
| blocks.append(TextContentBlock(text=text or "", index=idx)) | ||
| elif part_type == "image_url": | ||
| image_url = part.get("image_url") if isinstance(part, dict) else getattr(part, "image_url", {}) |
There was a problem hiding this comment.
This branch re-implements the type→content block conversion already in galileo/utils/serialization._convert_langchain_content_block, so should we reuse that helper and only materialize TextContentBlock/DataContentBlock here?
Want Baz to fix this for you? Activate Fixer
| # When content is a list of OpenAI content parts (multimodal), convert to | ||
| # DataContentBlock/TextContentBlock so the UI renders the actual media. | ||
| if isinstance(content, list): | ||
| blocks = _openai_content_parts_to_blocks(content) | ||
| if blocks is not None: | ||
| return LoggedMessage(content=blocks, role=MessageRole(role)) |
There was a problem hiding this comment.
content is list returns LoggedMessage(content=blocks, role=...) without tool_calls/tool_call_id, so mixed array-content tool messages lose linkage compared to the Message(...) path — should we forward those fields here too?
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
src/galileo/openai/extractors.py around lines 327-332 inside convert_to_galileo_message,
the branch handling multimodal content lists (`if isinstance(content, list)`) returns
`LoggedMessage(content=blocks, role=MessageRole(role))` but drops `galileo_tool_calls`
and `tool_call_id` that were computed just above. Refactor this return so the
LoggedMessage includes the same tool linkage as the normal Message path by forwarding
`tool_calls=galileo_tool_calls` and `tool_call_id=tool_call_id` (or construct a Message
with those fields and then wrap it as a LoggedMessage if LoggedMessage doesn’t accept
them directly). Ensure the role value remains consistent with the existing
`MessageRole(role)` conversion.
|
/astra review |
There was a problem hiding this comment.
⚠️ This review was generated by an AI agent (Astra) and may contain mistakes. Please verify all suggestions independently.
Verdict: request_changes — The core per-modality token threading is sound, but the bundled OpenAI multimodal-message path introduces a tool-linkage regression and ships with effectively no test coverage.
General Comments
- 🟡 minor (design): The new multimodal path makes
convert_to_galileo_messagereturn aLoggedMessagewhosecontentis a list ofTextContentBlock/DataContentBlock. These messages are then placed intoLlmSpan.input/output, whose declared type (galileo_coreMessage.content) isstr | list[ContentPart]. When the span is serialized viamodel_dump(), pydantic emits aPydanticSerializationUnexpectedValuewarning ("Expectedstr") for every multimodal message. I verified the block data does survive serialization, so this is not data loss, but it means aUserWarningis logged on every multimodal OpenAI call — noisy for a telemetry path that is meant to observe without interfering. Consider whethergalileo_coreneeds a content-block-aware span type, or suppress/handle the warning at the boundary.
Follow-ups
Suggested follow-up work that could be tracked as Shortcut stories:
galileo-adk/src/galileo_adk/observer.py:503-557:_extract_usage_metadata(ADK) and_extract_gemini_modality_breakdown(LangChain utils) implement the same Gemini per-modality walk with subtly different null-vs-zero semantics: ADK only setsimage_input_tokens/audio_input_tokenswhenprompt_tokens_detailsis present andaudio_output_tokenswhencandidates_tokens_detailsis present (otherwise the key is absent → None downstream), while the LangChain path always returns a 0-filled 3-tuple once any detail data exists. Consider extracting a shared helper so both paths agree on the null/zero contract (matches reviewer comment r3337251851).src/galileo/openai/extractors.py:221-267:_openai_content_parts_to_blocksre-implementstype→content-block conversion that already exists ingalileo/utils/serialization._convert_langchain_content_block. Consider reusing/sharing that helper to avoid drift (matches reviewer comment r3373853343).
| if isinstance(content, list): | ||
| blocks = _openai_content_parts_to_blocks(content) | ||
| if blocks is not None: | ||
| return LoggedMessage(content=blocks, role=MessageRole(role)) |
There was a problem hiding this comment.
🟠 major (bug): This branch returns a LoggedMessage built only from content and role, discarding the galileo_tool_calls and tool_call_id that were just computed above (lines 302-325). For a tool-role message with array (multimodal) content, or an assistant message that carries both array content and tool calls, the tool-call linkage is silently dropped — whereas the non-list Message(...) path below preserves it. I confirmed this: convert_to_galileo_message({'role':'tool','tool_call_id':'call_123','content':[{'type':'text','text':'result'}]}) returns a LoggedMessage with tool_call_id=None. LoggedMessage inherits these fields from Message, so forward them.
| if isinstance(content, list): | |
| blocks = _openai_content_parts_to_blocks(content) | |
| if blocks is not None: | |
| return LoggedMessage(content=blocks, role=MessageRole(role)) | |
| if isinstance(content, list): | |
| blocks = _openai_content_parts_to_blocks(content) | |
| if blocks is not None: | |
| return LoggedMessage( | |
| content=blocks, | |
| role=MessageRole(role), | |
| tool_calls=galileo_tool_calls, | |
| tool_call_id=tool_call_id, | |
| ) |
| # Surface 1: usage_metadata.input_token_details / output_token_details | ||
| usage_meta = getattr(msg, "usage_metadata", None) | ||
| if isinstance(usage_meta, dict): | ||
| input_details = usage_meta.get("input_token_details") or {} | ||
| output_details = usage_meta.get("output_token_details") or {} | ||
| if isinstance(input_details, dict) and input_details: | ||
| has_detail_data = True | ||
| if "audio" in input_details: | ||
| audio_in = input_details["audio"] | ||
| if "image" in input_details: | ||
| image_in = input_details["image"] | ||
| if isinstance(output_details, dict) and output_details: | ||
| has_detail_data = True | ||
| if "audio" in output_details: | ||
| audio_out = output_details["audio"] |
There was a problem hiding this comment.
🟡 minor (question): Surface 1 fires for any provider whose AIMessage.usage_metadata populates input_token_details/output_token_details — not just Gemini. Anthropic (cache_read/cache_creation) and OpenAI (cache_read/audio/reasoning) via LangChain both populate these. Because a non-empty input_details sets has_detail_data=True even when it contains no audio/image key, the final return coerces missing modalities to 0. Net effect: every Anthropic/OpenAI text call through the LangChain handler now emits image_input_tokens=0 and audio_input_tokens=0 (and, when only output_token_details is present, input fields are still forced to 0). Is that intended? It contradicts the PR's stated "OpenAI-compat path drops breakdown per RFC", and turns the previously-absent fields into 0 for all providers. If only Gemini should be in scope, gate on the provider; otherwise the gemini-specific naming/docstring is misleading.
There was a problem hiding this comment.
🟠 major (testing): The new _openai_content_parts_to_blocks helper and the LoggedMessage multimodal branch in convert_to_galileo_message have no unit tests (codecov reports ~15% patch coverage on this file). This is new, non-trivial logic with several easy-to-get-wrong paths: data-URI parsing with a fallback that base64-encodes the entire URI on parse failure, dict-vs-object access for each part type, the None-on-unrecognized-part fallback to str(content), and the tool-linkage issue flagged separately. Please add tests covering: text/image_url(data URI + plain URL)/input_audio parts, an unrecognized part type (falls back to Message), and a tool/assistant message with array content (tool linkage preserved).
fercor-cisco
left a comment
There was a problem hiding this comment.
Please check the Astra feedback and test failures.
| status_code=status_code, | ||
| image_input_tokens=usage.get("image_input_tokens"), | ||
| audio_input_tokens=usage.get("audio_input_tokens"), | ||
| audio_output_tokens=usage.get("audio_output_tokens"), |
There was a problem hiding this comment.
image_output_tokens missing here and many other spots
User description
Summary
LlmMetricswithnum_image_input_tokens,num_audio_input_tokens,num_audio_output_tokens(optional, forward-compat safe)logger.add_llm_span()→base_handler→span_params_extract_gemini_modality_breakdown()checks three surfaces (priority order):message.usage_metadata.input_token_details/output_token_details(langchain-google-genai ≥ 2.x with LangChain CoreUsageMetadata)message.response_metadata.prompt_tokens_details/candidates_tokens_details(raw Gemini API list shape)message.response_metadata["usage_metadata"]nested dict — defensive fallback for providers that nest usage underresponse_metadata(added in response to review comment; covers forward-compat if upstreamlangchain-google-genaiadds modality info there)ChatGenerationor rawAIMessageinterchangeably_extract_usage_metadata()walks native SDKprompt_tokens_details/candidates_tokens_details(handles.modalityas enum or string);span_manager.end_llm()carries 3 new fields through toend_node()Known limitation — langchain-google-genai 4.2.4
Discovered during local E2E testing:
langchain-google-genai4.2.4 (latest) does not surface Gemini's per-modality breakdown on theAIMessage. The wrapper at chat_models.py:1263 hardcodes:…reading
prompt_token_countandcached_content_token_countfrom the Gemini response but droppingprompt_tokens_details(which contains the modality breakdown). Verified end-to-end:prompt_tokens_detailsreturned to consumergoogle-genaiSDK directly[(TEXT, 10), (AUDIO, 8)]✓langchain-google-genai→AIMessage.usage_metadata.input_token_details{"cache_read": 0}(modality stripped)langchain-google-genai→AIMessage.response_metadatafinish_reason,model_name,safety_ratings,model_providerUntil
langchain-google-genaiis patched upstream, the LangChain Gemini path will not produce per-modality cost data. The extractor in this PR is correct and will start producing data automatically once the upstream surfaces it. Per-modality cost continues to work for:galileo-adk(usesgoogle-genaiSDK directly — verified breakdown reaches the extractor)logger.add_llm_span(audio_input_tokens=…, …)callsPath forward: file an upstream PR against
langchain-google-genaito threadprompt_tokens_detailsthroughinput_token_details. Estimated ~10 lines.Test plan
TestParseLlmResultcases (Surface 1, Surface 2, text-only, precedence)TestParseLlmResultcases (Surface 3 nested-usage, Surface 1 wins over Surface 3)TestExtractUsageMetadataADK cases covering audio+image, text-only, empty candidates, enum-vs-string modalitygoogle-genaidirect +logger.add_llm_span(audio_input_tokens=…)Depends on runners
feat/f3-multimodal-cost-foundation(orbit) + runners PR #2480.🤖 Generated with Claude Code
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Enable per-modality token accounting across Gemini-native paths by extending
LLMEndResult,LlmMetrics, and the span/logging stack so everyGalileoLoggerspan can persist image/audio usage from both LangChain handlers and the ADK observer. Ensure the new modality inputs flow throughSpanManager/GalileoBaseHandlerlogging and validate the extraction logic via LangChain and ADK observer tests while also rendering OpenAI multimodal responses as content blocks.contentparts intoTextContentBlock/DataContentBlockpayloads so the UI shows imported images or audio, returning aLoggedMessagewhenever the parts are understood.Modified files (1)
Latest Contributors(2)
GalileoCallback,GalileoAsyncCallback, and ADK observer metadata, thread them throughLLMEndResult,LlmMetrics,GalileoLogger,SpanManager, andGalileoBaseHandler, and cover the extraction surfaces with the new LangChain and ADK observer tests.Modified files (10)
Latest Contributors(2)