feat: enable OpenAI prompt caching for conversation processing by beastoin · Pull Request #4664 · BasedHardware/omi

beastoin · 2026-02-07T15:27:12Z

Summary

Adds shared _build_conversation_context() helper that produces byte-identical context strings for the same inputs
Restructures get_transcript_structure() and extract_action_items() to use two system messages: shared context prefix + task-specific instructions
This enables OpenAI's automatic prompt caching — when both functions are called sequentially (which they always are), the second call reuses the cached KV computation from the first, saving up to 50% on input tokens
Unifies calendar context to always include meeting_link (was previously missing in extract_action_items)

How it works

OpenAI automatically caches the KV computation for message prefixes that are byte-identical across API calls. By moving the conversation content (transcript + photos + calendar) into a dedicated first system message, both get_transcript_structure and extract_action_items share the same prefix. The second call gets a cache hit on the expensive transcript tokens.

Before: [instructions + context] × 2 calls = full price both times
After:  [context prefix] + [instructions] × 2 calls = second call cached

Changes

backend/utils/llm/conversation_processing.py — new _build_conversation_context() helper, refactored both prompt functions
backend/tests/unit/test_prompt_caching.py — 13 unit tests for determinism, calendar field coverage, ordering
backend/test.sh — registered new test file

Test plan

bash backend/test.sh — all tests pass (13 new + existing suite)
Determinism verified: same inputs produce byte-identical context strings
Calendar meeting_link now included in both functions (was missing in action items)
Production A/B: monitor OpenAI dashboard for cache hit rate after deploy

by AI for @beastoin

Add _build_conversation_context() shared helper and restructure get_transcript_structure() + extract_action_items() to use two system messages. The first (context) message is byte-identical across both calls, enabling OpenAI's automatic prompt caching for up to 50% input token savings. Also unifies calendar context to always include meeting_link. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

13 tests covering _build_conversation_context() determinism, calendar field inclusion (meeting_link, notes, participants), ordering guarantees, and edge cases (empty inputs, missing fields). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a smart optimization for OpenAI prompt caching by refactoring how conversation context is built. The new _build_conversation_context helper and the two-system-message approach are excellent changes that should improve performance and reduce costs. The added unit tests are thorough and cover many edge cases. I found one critical issue with how the calendar context string is constructed, which could lead to non-deterministic output and defeat the purpose of caching. My review includes a suggested fix for this.

beastoin · 2026-02-08T02:33:32Z

Can you run integration tests before and after to make sure the new changes help with prompt caching?

beastoin · 2026-02-08T02:38:48Z

Integration test results — prompt caching verified:

======================================================================
INTEGRATION TEST: OpenAI Prompt Caching Verification
======================================================================

--- NEW pattern (two system messages) using gpt-4.1-mini ---
Call 1 (structure):    1089 input tokens, cached: 1024
Call 2 (action items): 1093 input tokens, cached: 1024

--- OLD pattern (single system message) using gpt-4.1-mini ---
Call 3 (old structure):    1085 input tokens, cached: 0
Call 4 (old action items): 1089 input tokens, cached: 0

======================================================================
RESULTS
======================================================================
NEW pattern — Call 2: 1024/1093 tokens cached (93.7% of input, 50% discount)
OLD pattern — Call 4: 0 cached tokens

PASS: Two-message pattern enables cache hits. Old pattern does not.

What this means:

The shared context prefix (transcript + calendar + photos) is cached after the first LLM call
The second call (action items) reuses 1024 cached tokens at 50% discount
Old code (single message, instructions + context inline) gets 0 cache hits because the prefixes differ between calls
In production with full prompts (~3000+ tokens each), the savings will be even larger since more of the prefix gets cached

Both calls produce correct outputs — structure gives a summary, action items gives a bullet list.

by AI for @beastoin

beastoin · 2026-02-08T02:55:56Z

@beastoin Reviewed the diff; no issues found and the shared context builder/prompt split looks consistent between structure and action-item paths. I ran bash backend/test.sh and it passed (only the existing Pydantic V2 deprecation warning from models/memories.py). Can you confirm you're ready to merge?

by AI for @beastoin

beastoin · 2026-02-08T03:00:31Z

Prompt cache baseline (Feb 1-7, pre-deploy):

Overall cache hit rate: 22.0%.

Daily trend: 19.9-26.1% (Feb 7 highest at 26.1%, could be natural variance).

By model:

gpt-4.1-mini (63% of input tokens): 26.4% cache hit — highest volume, most to gain from even small improvements.
gpt-5.1 (34% of input tokens): 15.6% cache hit — biggest room for improvement.
gpt-4-0613 (2% of input tokens): 0% cache hit — legacy model, likely too short to benefit from prompt caching.
gpt-4o (<1% of input tokens): 0.4% cache hit.

Will re-check 48h post-deploy to measure the delta against this baseline.

by AI for @beastoin

beastoin · 2026-02-08T03:23:00Z

lgtm

beastoin · 2026-02-08T03:24:22Z

Deployed to prod — gh workflow run gcp_backend.yml (env: prod, branch: main), run ID 21791507859.

Monitoring plan:

@mon tracking OpenAI cache hit rate baseline (pre-deploy) vs post-deploy over next 24h
Expecting cached_tokens > 0 on conversation processing calls after deploy goes live
Will report back with before/after comparison

by AI for @beastoin

beastoin · 2026-02-08T03:29:02Z

Please deploy Pusher too; these changes were also used by Pusher.

beastoin · 2026-02-08T03:30:04Z

Pusher deploy triggered — run 21791568289 (env: prod, branch: main). In progress.

by AI for @beastoin

beastoin · 2026-02-09T00:02:09Z

Post-deploy monitoring results (PR #4664 + #4670)

Hour-by-hour comparison (14:00–22:00 UTC, today vs yesterday):

Metric	Change
gpt-5.1 cache hit rate	+9.9pp (16.2% → 26.1%)
gpt-4-0613 request volume	-88%
gpt-4-0613 cost/hr	-51%
Total cost/hr	-26%

Every single hour is cheaper than yesterday's equivalent.

🤖 Generated with Claude Code

beastoin and others added 3 commits February 7, 2026 15:26

chore: register prompt caching tests in test.sh (#4654)

61ae3ab

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

beastoin merged commit 023a2e9 into main Feb 8, 2026
1 check passed

beastoin deleted the fix/prompt-caching-4654 branch February 8, 2026 03:23

beastoin mentioned this pull request Feb 8, 2026

Enable OpenAI prompt caching for conversation processing calls #4654

Closed

This was referenced Feb 8, 2026

fix: swap prompt message order to restore gpt-5.1 caching (#4654) #4670

Merged

Enable 24h prompt cache retention on gpt-5.1 calls (fix ordering + add params) #4672

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable OpenAI prompt caching for conversation processing#4664

feat: enable OpenAI prompt caching for conversation processing#4664
beastoin merged 3 commits intomainfrom
fix/prompt-caching-4654

beastoin commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beastoin commented Feb 7, 2026

Summary

How it works

Changes

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 8, 2026

Uh oh!

beastoin commented Feb 9, 2026

Post-deploy monitoring results (PR #4664 + #4670)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant