Skip to content

feat: enable OpenAI prompt caching for conversation processing#4664

Merged
beastoin merged 3 commits intomainfrom
fix/prompt-caching-4654
Feb 8, 2026
Merged

feat: enable OpenAI prompt caching for conversation processing#4664
beastoin merged 3 commits intomainfrom
fix/prompt-caching-4654

Conversation

@beastoin
Copy link
Collaborator

@beastoin beastoin commented Feb 7, 2026

Summary

  • Adds shared _build_conversation_context() helper that produces byte-identical context strings for the same inputs
  • Restructures get_transcript_structure() and extract_action_items() to use two system messages: shared context prefix + task-specific instructions
  • This enables OpenAI's automatic prompt caching — when both functions are called sequentially (which they always are), the second call reuses the cached KV computation from the first, saving up to 50% on input tokens
  • Unifies calendar context to always include meeting_link (was previously missing in extract_action_items)

Closes #4654

How it works

OpenAI automatically caches the KV computation for message prefixes that are byte-identical across API calls. By moving the conversation content (transcript + photos + calendar) into a dedicated first system message, both get_transcript_structure and extract_action_items share the same prefix. The second call gets a cache hit on the expensive transcript tokens.

Before: [instructions + context] × 2 calls = full price both times
After:  [context prefix] + [instructions] × 2 calls = second call cached

Changes

  • backend/utils/llm/conversation_processing.py — new _build_conversation_context() helper, refactored both prompt functions
  • backend/tests/unit/test_prompt_caching.py — 13 unit tests for determinism, calendar field coverage, ordering
  • backend/test.sh — registered new test file

Test plan

  • bash backend/test.sh — all tests pass (13 new + existing suite)
  • Determinism verified: same inputs produce byte-identical context strings
  • Calendar meeting_link now included in both functions (was missing in action items)
  • Production A/B: monitor OpenAI dashboard for cache hit rate after deploy

by AI for @beastoin

beastoin and others added 3 commits February 7, 2026 15:26
Add _build_conversation_context() shared helper and restructure
get_transcript_structure() + extract_action_items() to use two system
messages. The first (context) message is byte-identical across both
calls, enabling OpenAI's automatic prompt caching for up to 50% input
token savings. Also unifies calendar context to always include
meeting_link.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 tests covering _build_conversation_context() determinism, calendar
field inclusion (meeting_link, notes, participants), ordering
guarantees, and edge cases (empty inputs, missing fields).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a smart optimization for OpenAI prompt caching by refactoring how conversation context is built. The new _build_conversation_context helper and the two-system-message approach are excellent changes that should improve performance and reduce costs. The added unit tests are thorough and cover many edge cases. I found one critical issue with how the calendar context string is constructed, which could lead to non-deterministic output and defeat the purpose of caching. My review includes a suggested fix for this.

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

Can you run integration tests before and after to make sure the new changes help with prompt caching?

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

Integration test results — prompt caching verified:

======================================================================
INTEGRATION TEST: OpenAI Prompt Caching Verification
======================================================================

--- NEW pattern (two system messages) using gpt-4.1-mini ---
Call 1 (structure):    1089 input tokens, cached: 1024
Call 2 (action items): 1093 input tokens, cached: 1024

--- OLD pattern (single system message) using gpt-4.1-mini ---
Call 3 (old structure):    1085 input tokens, cached: 0
Call 4 (old action items): 1089 input tokens, cached: 0

======================================================================
RESULTS
======================================================================
NEW pattern — Call 2: 1024/1093 tokens cached (93.7% of input, 50% discount)
OLD pattern — Call 4: 0 cached tokens

PASS: Two-message pattern enables cache hits. Old pattern does not.

What this means:

  • The shared context prefix (transcript + calendar + photos) is cached after the first LLM call
  • The second call (action items) reuses 1024 cached tokens at 50% discount
  • Old code (single message, instructions + context inline) gets 0 cache hits because the prefixes differ between calls
  • In production with full prompts (~3000+ tokens each), the savings will be even larger since more of the prefix gets cached

Both calls produce correct outputs — structure gives a summary, action items gives a bullet list.


by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

@beastoin Reviewed the diff; no issues found and the shared context builder/prompt split looks consistent between structure and action-item paths. I ran bash backend/test.sh and it passed (only the existing Pydantic V2 deprecation warning from models/memories.py). Can you confirm you're ready to merge?


by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

Prompt cache baseline (Feb 1-7, pre-deploy):

Overall cache hit rate: 22.0%.

Daily trend: 19.9-26.1% (Feb 7 highest at 26.1%, could be natural variance).

By model:

  • gpt-4.1-mini (63% of input tokens): 26.4% cache hit — highest volume, most to gain from even small improvements.
  • gpt-5.1 (34% of input tokens): 15.6% cache hit — biggest room for improvement.
  • gpt-4-0613 (2% of input tokens): 0% cache hit — legacy model, likely too short to benefit from prompt caching.
  • gpt-4o (<1% of input tokens): 0.4% cache hit.

Will re-check 48h post-deploy to measure the delta against this baseline.


by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

lgtm

@beastoin beastoin merged commit 023a2e9 into main Feb 8, 2026
1 check passed
@beastoin beastoin deleted the fix/prompt-caching-4654 branch February 8, 2026 03:23
@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

Deployed to prod — gh workflow run gcp_backend.yml (env: prod, branch: main), run ID 21791507859.

Monitoring plan:

  • @mon tracking OpenAI cache hit rate baseline (pre-deploy) vs post-deploy over next 24h
  • Expecting cached_tokens > 0 on conversation processing calls after deploy goes live
  • Will report back with before/after comparison

by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

Please deploy Pusher too; these changes were also used by Pusher.

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 8, 2026

Pusher deploy triggered — run 21791568289 (env: prod, branch: main). In progress.


by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Feb 9, 2026

Post-deploy monitoring results (PR #4664 + #4670)

Hour-by-hour comparison (14:00–22:00 UTC, today vs yesterday):

Metric Change
gpt-5.1 cache hit rate +9.9pp (16.2% → 26.1%)
gpt-4-0613 request volume -88%
gpt-4-0613 cost/hr -51%
Total cost/hr -26%

Every single hour is cheaper than yesterday's equivalent.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable OpenAI prompt caching for conversation processing calls

1 participant