Skip to content

Optimize conversation processing LLM costs (62% of total spend) #4635

@beastoin

Description

@beastoin

Context

BQ export of users/{uid}/llm_usage shows the conversation processing pipeline consuming 62% of total LLM spend (updated Feb 6 data from @mon).

Feature % of Calls % of Cost Model
conv_action_items 4.9% 24.1% gpt-5.1
conversation_processing (legacy umbrella) 7.0% 18.5% gpt-5.1
conv_structure 4.9% 17.9% gpt-5.1
conv_apps 9.8% 13.5% gpt-5.1 + mini

gpt-5.1 = 23.4% of calls but 86.4% of cost. The conversation pipeline alone = 74% of total LLM cost.

Root Cause

Every non-discarded conversation triggers 5-6 LLM calls. The same full transcript is sent to gpt-5.1 three separate times:

Step Feature Model % of spend
1 conv_discard gpt-4.1-mini cheap
2 conv_structure gpt-5.1 15.5%
3 conv_action_items gpt-5.1 19.8%
4 conv_folder gpt-4.1-mini cheap
5 conv_apps (suggest) gpt-4.1-mini cheap
6 conv_apps (execute) gpt-5.1 12.2%

Additional waste paths

  • sync.py:606 and postprocess_conversation.py:115 both call process_conversation(force_process=True) on already-processed conversations → 2-3x LLM cost
  • _trigger_apps() runs suggestion LLM even when user has a preferred app set
  • extract_action_items() fetches 50 recent items for dedup context every call

Key Findings (from deep analysis)

  1. Prompt caching is the safest win — restructure structure + action_items calls to share the transcript as a common prefix, letting OpenAI's automatic caching handle it. No quality risk. See Enable 24h prompt cache retention on gpt-5.1 calls (fix ordering + add params) #4672.
  2. Merging structure + action_items into one call is risky — structure prompt encourages compression ("condense into summary") while action_items demands expansion ("Read the ENTIRE conversation"). Instruction conflict. Needs A/B eval on ~100 conversations before shipping.
  3. Merging with app execute is not viableapp.memory_prompt is untrusted user-generated text. Mixing with strict schema extraction risks prompt injection.
  4. conv_apps needs the full transcript — it's the main summarization step, structured overview is not sufficient.
  5. conversations_to_string double-summarization — includes both overview AND app result (redundant content fed to daily summary LLM calls).

Sub-issues

# Issue Est. Savings State PR
#4636 Merge conv_structure + conv_action_items into single call ~15-20% OPEN — needs A/B eval PR #4689 (open)
#4639 Skip conv_apps suggestion when preferred app set ~2-3% OPEN PR #4683 (open)
#4640 Cap action item dedup context 50 → 10 ~1-2% OPEN PR #4684 (open)
#4641 Avoid full reprocess in sync/postprocess paths ~5-10% OPEN
#4637 Downgrade action_items to gpt-4.1-mini CLOSED — no model downgrades PR #4686 (closed)
#4638 Use structured overview for conv_apps CLOSED — needs full transcript

Related (not sub-issues but address same pipeline)

# Title State PR
#4672 Enable 24h prompt cache retention on gpt-5.1 (safe win) OPEN PR #4674 (open)
#4655 Fix double-summarization in conversations_to_string OPEN PR #4682 (open, replaces closed #4663)
#4656 Evaluate merged prompt quality with A/B comparison OPEN depends on #4636

Expected Outcome

Target: 25-30% reduction in conversation pipeline LLM costs without degrading output quality.

Data Source

BQ table: based-hardware:llm_usage.raw


Last updated: 2026-02-09 by @Chen — reflects new PRs from beastoin (#4682, #4683, #4684), closed sub-issues, and current state

Metadata

Metadata

Assignees

No one assigned

    Labels

    intelligenceLayer: Summaries, insights, action itemsmaintainerLane: High-risk, cross-system changesp1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions