fix: swap prompt message order to restore gpt-5.1 caching (#4654) by beastoin · Pull Request #4670 · BasedHardware/omi

beastoin · 2026-02-08T07:54:07Z

Summary

Swaps the two system messages in get_transcript_structure and extract_action_items — static instructions now come first, dynamic conversation context second
Moves existing_items_context (dynamic per-user action items) from the instructions message to the context message, keeping the instruction prefix more static
Moves language_code from instructions to context message, making the instruction prefix fully static across ALL languages
Adds 5 regression tests for message ordering, two-message structure, existing_items_context and language_code placement
The previous order (PR feat: enable OpenAI prompt caching for conversation processing #4664) put per-conversation dynamic content first, which broke OpenAI's prefix-based cross-conversation caching for gpt-5.1
Root cause: OpenAI routes requests to cache hosts by hashing the first ~256 tokens — unique transcripts scattered requests across machines, destroying cache reuse

Context

Post-deploy monitoring showed gpt-5.1 cache hit rate dropping from ~18% to 6-10% after PR #4664 shipped. gpt-4.1-mini (unaffected by that PR) went up independently.

Key insight from Codex review: extract_action_items has ~2800 tokens of static instructions — well above OpenAI's 1024-token caching threshold. Putting these first restores cross-conversation prefix caching across thousands of calls/hour.

Integration test results (live OpenAI API, gpt-5.1)

Scenario	Cache Hit Rate	Detail
Same user, same conversation	97.8%	1792/1832 tokens cached on identical repeat call
Same user, cross conversation	87.7%	Instruction prefix cached across different transcripts (same language)
Cross user (en/es/ja)	43.9%	Instruction prefix shared across different languages
A/B: static vs dynamic prefix	88.8% vs 0.0%	`language_code` in context = 4736 MORE cached tokens across languages

What changed

backend/utils/llm/conversation_processing.py — swapped ChatPromptTemplate.from_messages order; moved existing_items_context and language_code to context message
backend/tests/unit/test_prompt_caching.py — 5 regression tests (message ordering + dynamic field placement)
backend/tests/integration/test_prompt_caching_integration.py — 4 live API tests: same-user/same-conv, same-user/cross-conv, cross-user/cross-lang, A/B comparison

Test plan

All 19 prompt caching unit tests pass (13 original + 5 new + 1 updated)
All 40 backend unit tests pass
Integration: 97.8% same-conversation, 87.7% cross-conversation, 88.8% cross-language cache hits
Codex reviewer: approved (PR_APPROVED_LGTM)
Codex tester: approved (TESTS_APPROVED)
Post-deploy: monitor gpt-5.1 cache hit rate recovery (expect return to ~18%+ baseline)

🤖 Generated with Claude Code

…caching (#4654) Put static instructions first, dynamic conversation context second. The previous order (dynamic content first) broke OpenAI prefix-based caching because every request started with a unique transcript, scattering requests across cache hosts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request aims to improve OpenAI prompt caching by reordering system messages to have a static instruction prefix. The changes in extract_action_items and get_transcript_structure correctly swap the message order. However, a critical issue exists in both implementations: the instruction prompts, intended to be static, still contain dynamic placeholders ({existing_items_context} in one, and {started_at} and {tz} in the other). This makes the prefix dynamic and defeats the entire purpose of the change. My review includes critical feedback on how to address this by separating the dynamic content from the static instruction prompts to truly enable cross-conversation caching.

gemini-code-assist · 2026-02-08T07:55:43Z

backend/utils/llm/conversation_processing.py

-    context_message = 'Content:\n{conversation_context}'
-
-    # Second system message: task-specific instructions
+    # First system message: task-specific instructions (static prefix enables cross-conversation caching)


The goal of this PR is to create a static prefix for prompts to improve caching. However, instructions_text is not static because it includes the {existing_items_context} placeholder, which is populated with dynamic content (recent action items). This defeats the purpose of reordering the messages for extract_action_items. To fix this, the dynamic existing_items_context should be moved out of the first system message. It could be part of a new, separate system message placed after the static instructions and before the main conversation context.

gemini-code-assist · 2026-02-08T07:55:43Z

backend/utils/llm/conversation_processing.py

-    context_message = 'Content:\n{conversation_context}'
-
-    # Second system message: task-specific instructions
+    # First system message: task-specific instructions (static prefix enables cross-conversation caching)


Similar to the issue in extract_action_items, the instructions_text for this function is not static. It includes {started_at} and {tz} placeholders, which are dynamic for each conversation. This prevents cross-conversation caching and defeats the purpose of this change. To achieve a static prefix, the sentence containing these dynamic placeholders should be moved out of the first system message, for example by appending it to the context_message.

… caching (#4654) Moves the dynamic existing action items data from the static instructions message to the conversation context message. This keeps the instruction prefix more static across calls, improving cross-conversation cache hits for the extract_action_items function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds 4 tests verifying: - Static instructions come before dynamic context in both functions - Both functions use exactly two system messages - existing_items_context is in context message, not instructions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…4654) Validates cross-conversation cache hits with production-length instructions (>1024 tokens). Confirms correct order (instructions-first) produces 87.7% cache hit rate vs 0% with wrong order (content-first). Requires OPENAI_API_KEY; skipped automatically when not set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… prefix caching (#4654) Addresses Gemini review: {language_code} at ~30 tokens in broke the static prefix for all non-English calls. Moving it to context_message makes the first ~1500 tokens fully static across ALL languages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

#4654) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests same-user/same-conversation (97.8%), same-user/cross-conversation (87.7%), and cross-user/cross-language (88.8% vs 0% with old approach). Validates the language_code move from instructions to context message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

beastoin

lgtm

beastoin · 2026-02-08T13:29:08Z

Post-merge smoke test: Prompt prefix caching verification

Ran 10 varied conversation transcripts through the OpenAI API (gpt-5.1) using the exact same prompt structure as production (including real PydanticOutputParser format_instructions), measuring cached_tokens from the API response.

`extract_action_items` (3,326 token static prefix)

#	Topic	Input	Output	Cached	Cache%	Time
1	Tech standup	3,447	40	2,560	74.3%	0.95s
2	Doctor visit	3,444	91	3,328	96.6%	1.18s
3	Home renovation	3,448	81	3,328	96.5%	0.99s
4	Travel planning	3,457	41	3,328	96.3%	0.81s
5	Parent-teacher	3,459	76	2,560	74.0%	0.94s
6	Investor meeting	3,466	79	3,328	96.0%	0.94s
7	Wedding planning	3,463	143	3,328	96.1%	1.74s
8	Fitness coaching	3,466	49	3,328	96.0%	0.86s
9	Book club	3,480	93	3,328	95.6%	1.04s
10	Car breakdown	3,474	116	3,328	95.8%	1.40s

Calls 2-10: 93.7% cache hit rate

`get_transcript_structure` (1,761 token static prefix)

Prefix = 800 tokens instruction text + 994 tokens Structured pydantic schema (format_instructions).

#	Topic	Input	Output	Cached	Cache%	Time
1	Tech standup	1,821	244	0	0.0%	2.64s
2	Doctor visit	1,820	143	1,664	91.4%	2.76s
3	Home renovation	1,817	195	1,664	91.6%	2.33s
4	Travel planning	1,812	165	1,664	91.8%	1.91s
5	Parent-teacher	1,808	139	1,664	92.0%	1.91s
6	Investor meeting	1,819	192	1,664	91.5%	3.15s
7	Wedding planning	1,813	213	0	0.0%	2.25s
8	Fitness coaching	1,814	135	1,664	91.7%	1.55s
9	Book club	1,818	193	1,664	91.5%	3.17s
10	Car breakdown	1,813	148	1,664	91.8%	1.42s

Calls 2-10: 81.5% cache hit rate

Summary

Function	Static prefix tokens	Cache hit (calls 2-10)
`extract_action_items`	3,326	93.7%
`get_transcript_structure`	1,761	81.5%

Both functions are caching effectively. The static-instructions-first, dynamic-content-second message ordering from this PR enables OpenAI's automatic prefix caching across all conversation topics.

Minor note: get_transcript_structure has language_code in the instruction text (not the dynamic context message), so its cache is per-language rather than shared across languages. Low priority since most users stay on one language.

beastoin · 2026-02-09T00:02:02Z

Post-deploy monitoring results

Hour-by-hour comparison (14:00–22:00 UTC, today vs yesterday):

Metric	Change
gpt-5.1 cache hit rate	+9.9pp (16.2% → 26.1%)
gpt-4-0613 request volume	-88%
gpt-4-0613 cost/hr	-51%
Total cost/hr	-26%

Every single hour is cheaper than yesterday's equivalent. Message ordering fix is confirmed working in production.

🤖 Generated with Claude Code

beastoin and others added 2 commits February 8, 2026 07:53

test: update prompt caching test docstring for new message order (#4654)

8cec443

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Feb 8, 2026

View reviewed changes

beastoin and others added 6 commits February 8, 2026 07:58

test: add regression test for language_code not in instructions prefix (

89a51f8

#4654) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

beastoin commented Feb 8, 2026

View reviewed changes

beastoin merged commit 2d89b74 into main Feb 8, 2026
1 check passed

beastoin deleted the fix/prompt-caching-message-order branch February 8, 2026 08:33

beastoin mentioned this pull request Feb 8, 2026

feat: enable 24h prompt cache retention + routing keys for gpt-5.1 (#4672) #4674

Open

4 tasks

beastoin mentioned this pull request Feb 8, 2026

Enable 24h prompt cache retention on gpt-5.1 calls (fix ordering + add params) #4672

Open

4 tasks

beastoin mentioned this pull request Feb 9, 2026

feat: enable OpenAI prompt caching for conversation processing #4664

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: swap prompt message order to restore gpt-5.1 caching (#4654)#4670

fix: swap prompt message order to restore gpt-5.1 caching (#4654)#4670
beastoin merged 8 commits intomainfrom
fix/prompt-caching-message-order

beastoin commented Feb 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 8, 2026

Uh oh!

gemini-code-assist bot Feb 8, 2026

Uh oh!

beastoin left a comment

Uh oh!

Uh oh!

beastoin commented Feb 8, 2026 •

edited

Loading

Uh oh!

beastoin commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beastoin commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Integration test results (live OpenAI API, gpt-5.1)

What changed

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

beastoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beastoin commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-merge smoke test: Prompt prefix caching verification

extract_action_items (3,326 token static prefix)

get_transcript_structure (1,761 token static prefix)

Summary

Uh oh!

beastoin commented Feb 9, 2026

Post-deploy monitoring results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

beastoin commented Feb 8, 2026 •

edited

Loading

beastoin commented Feb 8, 2026 •

edited

Loading

`extract_action_items` (3,326 token static prefix)

`get_transcript_structure` (1,761 token static prefix)