Skip to content

Optimize memory injection: versioned top-K packs with application-level caching #4673

@beastoin

Description

@beastoin

Problem

backend/utils/llms/memory.py:get_prompt_data() fetches up to 1000 memories from Firestore on every LLM call that needs user context. This causes two issues:

  1. Token waste: All 1000 memories are serialized into a single string and injected into prompts, regardless of relevance to the current conversation. Most memories are irrelevant noise.

  2. Cache-breaking: The full memory string is injected into prompts as dynamic content. Any change to a user's memories (new memory added, memory updated) changes the entire string, invalidating any cached prefix that includes it.

  3. Redundant DB reads: The same 1000 memories are fetched repeatedly for the same user within minutes (no application-level caching). The code even has a # TODO: cache this comment at line 50.

Affected Call Sites

Memory injection via get_prompt_memories(uid) is used in:

  • utils/llm/chat.py — 6 call sites (chat/QA, mostly llm_mini)
  • utils/llm/memories.py — 3 call sites (memory extraction, llm_mini)
  • utils/llm/external_integrations.py — 2 call sites (daily summary uses llm_medium_experiment/gpt-5.1, conversation summary uses llm_mini)
  • utils/llm/proactive_notification.py — 1 call site (llm_mini)

Note: The primary gpt-5.1 cost drivers (get_transcript_structure and extract_action_items in conversation_processing.py) do NOT use memory injection. This optimization primarily benefits daily summary (gpt-5.1), chat/QA, and proactive notifications.

Proposed Fix (Three Parts)

Part A: Top-K relevant memories instead of all 1000

def get_prompt_memories(uid: str, context: str = None, k: int = 50) -> tuple[str, str]:
    """Fetch top-K relevant memories instead of all 1000."""
    if context:
        # Use embedding similarity to find relevant memories
        relevant = retrieve_top_k_by_relevance(uid, context, k=k)
    else:
        # Fallback: most recent + highest-importance memories
        relevant = get_top_k_by_recency_and_importance(uid, k=k)
    ...

Part B: Deterministic versioned memory packs

def build_memory_pack(uid: str, memories: list) -> tuple[str, str]:
    """Build deterministic, cache-friendly memory text with version hash."""
    sorted_memories = sorted(memories, key=lambda m: m.id)  # Deterministic order
    text = "\n".join(f"- {m.content}" for m in sorted_memories)
    version = hashlib.md5(text.encode()).hexdigest()[:8]
    return text, version

The version hash enables cache key routing: prompt_cache_key=f"omi-chat-{uid}-{version}".

Part C: Application-level cache (address the TODO at line 50)

# Redis or in-memory cache with 5-10 min TTL
@cache(ttl=300)
def get_prompt_data(uid: str) -> tuple:
    existing_memories = memories_db.get_memories(uid, limit=1000)
    ...

Memories change infrequently (new ones added per conversation, not per second), so a 5-min TTL eliminates redundant DB reads during burst processing.

Impact

  • Token reduction: 1000 → 50 memories = ~95% fewer memory tokens per call
  • Cache improvement: Stable memory packs enable prefix caching for per-user prompts
  • DB load reduction: Application cache eliminates redundant Firestore reads
  • Primarily benefits daily summary (gpt-5.1) and chat/QA volume

Files to Change

  • backend/utils/llms/memory.py — implement top-K, versioning, application cache
  • backend/utils/llm/chat.py — pass conversation context to get_prompt_memories for relevance filtering
  • backend/utils/llm/external_integrations.py — same
  • backend/utils/llm/proactive_notification.py — same

References

Test Plan

  • backend/test.sh — all tests pass
  • Verify top-K returns relevant memories (not random subset)
  • Verify deterministic ordering produces byte-identical output for same inputs
  • Verify application cache TTL and invalidation behavior
  • A/B test: compare chat quality with 50 vs 1000 memories

Metadata

Metadata

Assignees

No one assigned

    Labels

    intelligenceLayer: Summaries, insights, action itemsmaintainerLane: High-risk, cross-system changesp1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions