-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
intelligenceLayer: Summaries, insights, action itemsLayer: Summaries, insights, action itemsmaintainerLane: High-risk, cross-system changesLane: High-risk, cross-system changesp1Priority: Critical (score 22-29)Priority: Critical (score 22-29)
Description
Context
BQ export of users/{uid}/llm_usage shows the conversation processing pipeline consuming 62% of total LLM spend (updated Feb 6 data from @mon).
| Feature | % of Calls | % of Cost | Model |
|---|---|---|---|
conv_action_items |
4.9% | 24.1% | gpt-5.1 |
conversation_processing (legacy umbrella) |
7.0% | 18.5% | gpt-5.1 |
conv_structure |
4.9% | 17.9% | gpt-5.1 |
conv_apps |
9.8% | 13.5% | gpt-5.1 + mini |
gpt-5.1 = 23.4% of calls but 86.4% of cost. The conversation pipeline alone = 74% of total LLM cost.
Root Cause
Every non-discarded conversation triggers 5-6 LLM calls. The same full transcript is sent to gpt-5.1 three separate times:
| Step | Feature | Model | % of spend |
|---|---|---|---|
| 1 | conv_discard |
gpt-4.1-mini | cheap |
| 2 | conv_structure |
gpt-5.1 | 15.5% |
| 3 | conv_action_items |
gpt-5.1 | 19.8% |
| 4 | conv_folder |
gpt-4.1-mini | cheap |
| 5 | conv_apps (suggest) |
gpt-4.1-mini | cheap |
| 6 | conv_apps (execute) |
gpt-5.1 | 12.2% |
Additional waste paths
sync.py:606andpostprocess_conversation.py:115both callprocess_conversation(force_process=True)on already-processed conversations → 2-3x LLM cost_trigger_apps()runs suggestion LLM even when user has a preferred app setextract_action_items()fetches 50 recent items for dedup context every call
Key Findings (from deep analysis)
- Prompt caching is the safest win — restructure structure + action_items calls to share the transcript as a common prefix, letting OpenAI's automatic caching handle it. No quality risk. See Enable 24h prompt cache retention on gpt-5.1 calls (fix ordering + add params) #4672.
- Merging structure + action_items into one call is risky — structure prompt encourages compression ("condense into summary") while action_items demands expansion ("Read the ENTIRE conversation"). Instruction conflict. Needs A/B eval on ~100 conversations before shipping.
- Merging with app execute is not viable —
app.memory_promptis untrusted user-generated text. Mixing with strict schema extraction risks prompt injection. - conv_apps needs the full transcript — it's the main summarization step, structured overview is not sufficient.
conversations_to_stringdouble-summarization — includes both overview AND app result (redundant content fed to daily summary LLM calls).
Sub-issues
| # | Issue | Est. Savings | State | PR |
|---|---|---|---|---|
| #4636 | Merge conv_structure + conv_action_items into single call | ~15-20% | OPEN — needs A/B eval | PR #4689 (open) |
| #4639 | Skip conv_apps suggestion when preferred app set | ~2-3% | OPEN | PR #4683 (open) |
| #4640 | Cap action item dedup context 50 → 10 | ~1-2% | OPEN | PR #4684 (open) |
| #4641 | Avoid full reprocess in sync/postprocess paths | ~5-10% | OPEN | — |
| — | CLOSED — no model downgrades | |||
| — | CLOSED — needs full transcript | — |
Related (not sub-issues but address same pipeline)
| # | Title | State | PR |
|---|---|---|---|
| #4672 | Enable 24h prompt cache retention on gpt-5.1 (safe win) | OPEN | PR #4674 (open) |
| #4655 | Fix double-summarization in conversations_to_string | OPEN | PR #4682 (open, replaces closed #4663) |
| #4656 | Evaluate merged prompt quality with A/B comparison | OPEN | depends on #4636 |
Expected Outcome
Target: 25-30% reduction in conversation pipeline LLM costs without degrading output quality.
Data Source
BQ table: based-hardware:llm_usage.raw
Last updated: 2026-02-09 by @Chen — reflects new PRs from beastoin (#4682, #4683, #4684), closed sub-issues, and current state
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
intelligenceLayer: Summaries, insights, action itemsLayer: Summaries, insights, action itemsmaintainerLane: High-risk, cross-system changesLane: High-risk, cross-system changesp1Priority: Critical (score 22-29)Priority: Critical (score 22-29)