-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Problem
backend/utils/llms/memory.py:get_prompt_data() fetches up to 1000 memories from Firestore on every LLM call that needs user context. This causes two issues:
-
Token waste: All 1000 memories are serialized into a single string and injected into prompts, regardless of relevance to the current conversation. Most memories are irrelevant noise.
-
Cache-breaking: The full memory string is injected into prompts as dynamic content. Any change to a user's memories (new memory added, memory updated) changes the entire string, invalidating any cached prefix that includes it.
-
Redundant DB reads: The same 1000 memories are fetched repeatedly for the same user within minutes (no application-level caching). The code even has a
# TODO: cache thiscomment at line 50.
Affected Call Sites
Memory injection via get_prompt_memories(uid) is used in:
utils/llm/chat.py— 6 call sites (chat/QA, mostlyllm_mini)utils/llm/memories.py— 3 call sites (memory extraction,llm_mini)utils/llm/external_integrations.py— 2 call sites (daily summary usesllm_medium_experiment/gpt-5.1, conversation summary usesllm_mini)utils/llm/proactive_notification.py— 1 call site (llm_mini)
Note: The primary gpt-5.1 cost drivers (get_transcript_structure and extract_action_items in conversation_processing.py) do NOT use memory injection. This optimization primarily benefits daily summary (gpt-5.1), chat/QA, and proactive notifications.
Proposed Fix (Three Parts)
Part A: Top-K relevant memories instead of all 1000
def get_prompt_memories(uid: str, context: str = None, k: int = 50) -> tuple[str, str]:
"""Fetch top-K relevant memories instead of all 1000."""
if context:
# Use embedding similarity to find relevant memories
relevant = retrieve_top_k_by_relevance(uid, context, k=k)
else:
# Fallback: most recent + highest-importance memories
relevant = get_top_k_by_recency_and_importance(uid, k=k)
...Part B: Deterministic versioned memory packs
def build_memory_pack(uid: str, memories: list) -> tuple[str, str]:
"""Build deterministic, cache-friendly memory text with version hash."""
sorted_memories = sorted(memories, key=lambda m: m.id) # Deterministic order
text = "\n".join(f"- {m.content}" for m in sorted_memories)
version = hashlib.md5(text.encode()).hexdigest()[:8]
return text, versionThe version hash enables cache key routing: prompt_cache_key=f"omi-chat-{uid}-{version}".
Part C: Application-level cache (address the TODO at line 50)
# Redis or in-memory cache with 5-10 min TTL
@cache(ttl=300)
def get_prompt_data(uid: str) -> tuple:
existing_memories = memories_db.get_memories(uid, limit=1000)
...Memories change infrequently (new ones added per conversation, not per second), so a 5-min TTL eliminates redundant DB reads during burst processing.
Impact
- Token reduction: 1000 → 50 memories = ~95% fewer memory tokens per call
- Cache improvement: Stable memory packs enable prefix caching for per-user prompts
- DB load reduction: Application cache eliminates redundant Firestore reads
- Primarily benefits daily summary (gpt-5.1) and chat/QA volume
Files to Change
backend/utils/llms/memory.py— implement top-K, versioning, application cachebackend/utils/llm/chat.py— pass conversation context toget_prompt_memoriesfor relevance filteringbackend/utils/llm/external_integrations.py— samebackend/utils/llm/proactive_notification.py— same
References
- Research:
~/rnd/prompt-cache-hit-best-practices.mdby @geni (Section 2) - OpenAI Prompt Caching Guide
Test Plan
-
backend/test.sh— all tests pass - Verify top-K returns relevant memories (not random subset)
- Verify deterministic ordering produces byte-identical output for same inputs
- Verify application cache TTL and invalidation behavior
- A/B test: compare chat quality with 50 vs 1000 memories