feat: enable 24h prompt cache retention + routing keys for gpt-5.1 (#4672) by beastoin · Pull Request #4674 · BasedHardware/omi

beastoin · 2026-02-08T08:42:22Z

Summary

Adds prompt_cache_retention: "24h" to all three gpt-5.1 ChatOpenAI clients via extra_body, extending cache from ~5-10min in-memory to 24h SSD-backed retention
Adds per-function prompt_cache_key routing hints to improve cache hit rates by routing similar requests to the same cache host
Part B of issue Enable 24h prompt cache retention on gpt-5.1 calls (fix ordering + add params) #4672 (Part A — message ordering fix — was shipped in PR fix: swap prompt message order to restore gpt-5.1 caching (#4654) #4670)

What changed

`backend/utils/llm/clients.py`

Added extra_body={"prompt_cache_retention": "24h"} to llm_medium_experiment, llm_agent, and llm_agent_stream
Note: must use extra_body not model_kwargs — the OpenAI SDK doesn't accept prompt_cache_retention as a direct kwarg

`backend/utils/llm/conversation_processing.py`

extract_action_items() → .bind(prompt_cache_key="omi-extract-actions")
get_transcript_structure() → .bind(prompt_cache_key="omi-transcript-structure")
get_reprocess_transcript_structure() → .bind(prompt_cache_key="omi-transcript-structure")
get_app_result() → .invoke(prompt, prompt_cache_key="omi-app-result")
generate_summary_with_prompt() → .invoke(full_prompt, prompt_cache_key="omi-daily-summary")

`backend/tests/unit/test_process_conversation_usage_context.py`

Updated regex patterns to handle .bind() and extra invoke kwargs

`backend/tests/integration/test_prompt_caching_integration.py`

Added 5 new live API tests for retention and cache key params

Integration test results (live gpt-5.1 API)

Test	Cache Hit Rate
`prompt_cache_retention="24h"` accepted	PASS
24h retention + cross-conversation	87.7%
`prompt_cache_key` accepted	PASS
Cache key + cross-conversation	87.7%
Combined (retention + key)	91.4%

How it works

prompt_cache_retention: "24h": OpenAI keeps cached prefixes on SSD for 24h instead of default ~5-10min in-memory. 90% discount on cached input tokens, zero write surcharge.
prompt_cache_key: Combined with the prefix hash to route requests to machines holding the relevant cache. Each function gets a unique key so different instruction prefixes don't collide.

Expected impact

PR fix: swap prompt message order to restore gpt-5.1 caching (#4654) #4670 (Part A) restored cache hits from ~6-10% to ~18%+ by fixing message ordering
This PR (Part B) should push cache hits to 50-80% via 24h retention + routing
Combined: estimated 25-45% reduction in gpt-5.1 input costs (~55% of total OpenAI spend)

Test plan

All 228 backend unit tests pass
9 integration tests pass (4 message ordering + 5 retention/key)
Verified extra_body is the correct LangChain mechanism (not model_kwargs)
Post-deploy: monitor gpt-5.1 cached_tokens / prompt_tokens ratio — expect 50%+

Closes #4672

🤖 Generated with Claude Code

Adds model_kwargs={"prompt_cache_retention": "24h"} to all three gpt-5.1 ChatOpenAI instances (llm_medium_experiment, llm_agent, llm_agent_stream). Extends cache from default ~5-10min in-memory to 24h SSD-backed retention with 90% input token discount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds .bind(prompt_cache_key=...) to gpt-5.1 call sites: - omi-extract-actions for extract_action_items() - omi-transcript-structure for get_transcript_structure() and reprocess - omi-app-result for get_app_result() - omi-daily-summary for generate_summary_with_prompt() Routes similar requests to the same cache host for better hit rates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…4672) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request enables 24-hour prompt cache retention for gpt-5.1 models and adds prompt_cache_key routing hints to several LLM calls to improve cache hit rates. The changes in clients.py correctly configure the cache retention, and the test adjustments in test_process_conversation_usage_context.py accommodate the new calling patterns. My review focuses on the implementation of the cache keys in conversation_processing.py and I have one suggestion to improve consistency.

gemini-code-assist · 2026-02-08T08:43:39Z

backend/utils/llm/conversation_processing.py


    prompt = ChatPromptTemplate.from_messages([('system', prompt_text)])
-    chain = prompt | llm_medium_experiment | parser
+    chain = prompt | llm_medium_experiment.bind(prompt_cache_key="omi-transcript-structure") | parser


The prompt_cache_key "omi-transcript-structure" is also used in get_transcript_structure. However, the system prompts for get_transcript_structure and get_reprocess_transcript_structure are different. According to the pull request description, 'Each function gets a unique key so different instruction prefixes don't collide.' Using the same key for functions with different prompt prefixes could be confusing and might not be optimal for caching. To align with the stated goal of this PR and improve clarity, consider using a unique key for this function, for example omi-reprocess-transcript-structure.

Suggested change

chain = prompt | llm_medium_experiment.bind(prompt_cache_key="omi-transcript-structure") | parser

chain = prompt | llm_medium_experiment.bind(prompt_cache_key="omi-reprocess-transcript-structure") | parser

#4672) The OpenAI SDK doesn't accept prompt_cache_retention as a direct kwarg — it must be passed via extra_body. LangChain's ChatOpenAI supports extra_body as a native field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…he_key (#4672) Validates with live gpt-5.1 API: - prompt_cache_retention="24h" accepted via extra_body (87.7% cache hits) - prompt_cache_key routing hints accepted (87.7% cache hits) - Combined retention + key (91.4% cache hits) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

beastoin · 2026-02-09T02:59:43Z

Smoke Test Results

Environment: Local backend (dev environment, based-hardware-dev Firestore + Redis)
Branch: feat/prompt-cache-retention
Backend: uvicorn main:app on port 8788

Results

#	Test	Status
1	Health check: `GET /v1/conversations` → HTTP 200	PASS
2	`llm_medium_experiment`: model=gpt-5.1, `extra_body={"prompt_cache_retention": "24h"}`	PASS
3	`llm_agent`: model=gpt-5.1, `extra_body={"prompt_cache_retention": "24h"}`	PASS
4	`llm_agent_stream`: model=gpt-5.1, `extra_body={"prompt_cache_retention": "24h"}`	PASS
5	`prompt_cache_retention` NOT in `model_kwargs` (SDK rejection fix)	PASS
6	`get_transcript_structure` has `prompt_cache_key` routing	PASS
7	`extract_action_items` has `prompt_cache_key` routing	PASS
8	2 distinct `prompt_cache_key` values: `omi-transcript-structure`, `omi-extract-actions`	PASS
9	18 unit tests pass	PASS
10	9 integration tests pass (live OpenAI API, 42.65s)	PASS

Integration Test Detail (live OpenAI API)

Test	Status
`test_same_function_same_transcript_full_cache`	PASS
`test_same_language_different_transcripts`	PASS
`test_different_languages_share_instruction_cache`	PASS
`test_cross_user_vs_language_in_instructions`	PASS
`test_24h_retention_accepted_by_api`	PASS
`test_24h_retention_cache_hits`	PASS
`test_cache_key_accepted_by_api`	PASS
`test_same_key_cross_conversation_cache`	PASS
`test_retention_and_key_combined`	PASS

All 27 tests passed (18 unit + 9 integration). Ready to merge.

Merged _agent_cache_kwargs (prompt_cache_key routing from main) with extra_body prompt_cache_retention from this branch. Updated test that incorrectly asserted prompt_cache_retention should not exist — extra_body is the correct mechanism (not model_kwargs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

beastoin and others added 3 commits February 8, 2026 08:41

test: update model check regex to handle .bind() cache key pattern (#…

103afde

…4672) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Feb 8, 2026

View reviewed changes

beastoin and others added 2 commits February 8, 2026 09:07

beastoin mentioned this pull request Feb 8, 2026

Enable 24h prompt cache retention on gpt-5.1 calls (fix ordering + add params) #4672

Open

4 tasks

beastoin mentioned this pull request Feb 9, 2026

Optimize conversation processing LLM costs (62% of total spend) #4635

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable 24h prompt cache retention + routing keys for gpt-5.1 (#4672)#4674

feat: enable 24h prompt cache retention + routing keys for gpt-5.1 (#4672)#4674
beastoin wants to merge 6 commits intomainfrom
feat/prompt-cache-retention

beastoin commented Feb 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 8, 2026

Uh oh!

beastoin commented Feb 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	chain = prompt \| llm_medium_experiment.bind(prompt_cache_key="omi-transcript-structure") \| parser
	chain = prompt \| llm_medium_experiment.bind(prompt_cache_key="omi-reprocess-transcript-structure") \| parser

Conversation

beastoin commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

backend/utils/llm/clients.py

backend/utils/llm/conversation_processing.py

backend/tests/unit/test_process_conversation_usage_context.py

backend/tests/integration/test_prompt_caching_integration.py

Integration test results (live gpt-5.1 API)

How it works

Expected impact

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

beastoin commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Smoke Test Results

Results

Integration Test Detail (live OpenAI API)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

beastoin commented Feb 8, 2026 •

edited

Loading

`backend/utils/llm/clients.py`

`backend/utils/llm/conversation_processing.py`

`backend/tests/unit/test_process_conversation_usage_context.py`

`backend/tests/integration/test_prompt_caching_integration.py`

beastoin commented Feb 9, 2026 •

edited

Loading