Optimize conversation processing LLM costs (62% of total spend)

## Context

BQ export of `users/{uid}/llm_usage` shows the conversation processing pipeline consuming **62% of total LLM spend** (updated Feb 6 data from @mon).

| Feature | % of Calls | % of Cost | Model |
|---------|-----------|-----------|-------|
| `conv_action_items` | 4.9% | **24.1%** | gpt-5.1 |
| `conversation_processing` (legacy umbrella) | 7.0% | **18.5%** | gpt-5.1 |
| `conv_structure` | 4.9% | **17.9%** | gpt-5.1 |
| `conv_apps` | 9.8% | **13.5%** | gpt-5.1 + mini |

**gpt-5.1 = 23.4% of calls but 86.4% of cost.** The conversation pipeline alone = 74% of total LLM cost.

## Root Cause

Every non-discarded conversation triggers 5-6 LLM calls. The **same full transcript is sent to gpt-5.1 three separate times**:

| Step | Feature | Model | % of spend |
|------|---------|-------|-----------|
| 1 | `conv_discard` | gpt-4.1-mini | cheap |
| 2 | `conv_structure` | **gpt-5.1** | 15.5% |
| 3 | `conv_action_items` | **gpt-5.1** | 19.8% |
| 4 | `conv_folder` | gpt-4.1-mini | cheap |
| 5 | `conv_apps` (suggest) | gpt-4.1-mini | cheap |
| 6 | `conv_apps` (execute) | **gpt-5.1** | 12.2% |

### Additional waste paths

- `sync.py:606` and `postprocess_conversation.py:115` both call `process_conversation(force_process=True)` on already-processed conversations → 2-3x LLM cost
- `_trigger_apps()` runs suggestion LLM even when user has a preferred app set
- `extract_action_items()` fetches 50 recent items for dedup context every call

## Key Findings (from deep analysis)

1. **Prompt caching is the safest win** — restructure structure + action_items calls to share the transcript as a common prefix, letting OpenAI's automatic caching handle it. No quality risk. See #4672.
2. **Merging structure + action_items into one call is risky** — structure prompt encourages compression ("condense into summary") while action_items demands expansion ("Read the ENTIRE conversation"). Instruction conflict. **Needs A/B eval on ~100 conversations before shipping.**
3. **Merging with app execute is not viable** — `app.memory_prompt` is untrusted user-generated text. Mixing with strict schema extraction risks prompt injection.
4. **conv_apps needs the full transcript** — it's the main summarization step, structured overview is not sufficient.
5. **`conversations_to_string` double-summarization** — includes both overview AND app result (redundant content fed to daily summary LLM calls).

## Sub-issues

| # | Issue | Est. Savings | State | PR |
|---|-------|-------------|-------|-----|
| #4636 | Merge conv_structure + conv_action_items into single call | ~15-20% | **OPEN** — needs A/B eval | PR #4689 (open) |
| #4639 | Skip conv_apps suggestion when preferred app set | ~2-3% | **OPEN** | PR #4683 (open) |
| #4640 | Cap action item dedup context 50 → 10 | ~1-2% | **OPEN** | PR #4684 (open) |
| #4641 | Avoid full reprocess in sync/postprocess paths | ~5-10% | **OPEN** | — |
| ~~#4637~~ | ~~Downgrade action_items to gpt-4.1-mini~~ | — | **CLOSED** — no model downgrades | ~~PR #4686 (closed)~~ |
| ~~#4638~~ | ~~Use structured overview for conv_apps~~ | — | **CLOSED** — needs full transcript | — |

### Related (not sub-issues but address same pipeline)

| # | Title | State | PR |
|---|-------|-------|-----|
| #4672 | Enable 24h prompt cache retention on gpt-5.1 (safe win) | OPEN | PR #4674 (open) |
| #4655 | Fix double-summarization in conversations_to_string | OPEN | PR #4682 (open, replaces closed #4663) |
| #4656 | Evaluate merged prompt quality with A/B comparison | OPEN | depends on #4636 |

## Expected Outcome

Target: **25-30% reduction** in conversation pipeline LLM costs without degrading output quality.

## Data Source

BQ table: `based-hardware:llm_usage.raw`

---
*Last updated: 2026-02-09 by @chen — reflects new PRs from beastoin (#4682, #4683, #4684), closed sub-issues, and current state*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize conversation processing LLM costs (62% of total spend) #4635

Context

Root Cause

Additional waste paths

Key Findings (from deep analysis)

Sub-issues

Related (not sub-issues but address same pipeline)

Expected Outcome

Data Source

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	% of Calls	% of Cost	Model
`conv_action_items`	4.9%	24.1%	gpt-5.1
`conversation_processing` (legacy umbrella)	7.0%	18.5%	gpt-5.1
`conv_structure`	4.9%	17.9%	gpt-5.1
`conv_apps`	9.8%	13.5%	gpt-5.1 + mini

Step	Feature	Model	% of spend
1	`conv_discard`	gpt-4.1-mini	cheap
2	`conv_structure`	gpt-5.1	15.5%
3	`conv_action_items`	gpt-5.1	19.8%
4	`conv_folder`	gpt-4.1-mini	cheap
5	`conv_apps` (suggest)	gpt-4.1-mini	cheap
6	`conv_apps` (execute)	gpt-5.1	12.2%

#	Issue	Est. Savings	State	PR
#4636	Merge conv_structure + conv_action_items into single call	~15-20%	OPEN — needs A/B eval	PR #4689 (open)
#4639	Skip conv_apps suggestion when preferred app set	~2-3%	OPEN	PR #4683 (open)
#4640	Cap action item dedup context 50 → 10	~1-2%	OPEN	PR #4684 (open)
#4641	Avoid full reprocess in sync/postprocess paths	~5-10%	OPEN	—
~~#4637~~	~~Downgrade action_items to gpt-4.1-mini~~	—	CLOSED — no model downgrades	~~PR #4686 (closed)~~
~~#4638~~	~~Use structured overview for conv_apps~~	—	CLOSED — needs full transcript	—

#	Title	State	PR
#4672	Enable 24h prompt cache retention on gpt-5.1 (safe win)	OPEN	PR #4674 (open)
#4655	Fix double-summarization in conversations_to_string	OPEN	PR #4682 (open, replaces closed #4663)
#4656	Evaluate merged prompt quality with A/B comparison	OPEN	depends on #4636

Optimize conversation processing LLM costs (62% of total spend) #4635

Description

Context

Root Cause

Additional waste paths

Key Findings (from deep analysis)

Sub-issues

Related (not sub-issues but address same pipeline)

Expected Outcome

Data Source

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions