Optimize memory injection: versioned top-K packs with application-level caching

## Problem

`backend/utils/llms/memory.py:get_prompt_data()` fetches up to **1000 memories** from Firestore on every LLM call that needs user context. This causes two issues:

1. **Token waste**: All 1000 memories are serialized into a single string and injected into prompts, regardless of relevance to the current conversation. Most memories are irrelevant noise.

2. **Cache-breaking**: The full memory string is injected into prompts as dynamic content. Any change to a user's memories (new memory added, memory updated) changes the entire string, invalidating any cached prefix that includes it.

3. **Redundant DB reads**: The same 1000 memories are fetched repeatedly for the same user within minutes (no application-level caching). The code even has a `# TODO: cache this` comment at line 50.

### Affected Call Sites

Memory injection via `get_prompt_memories(uid)` is used in:
- `utils/llm/chat.py` — 6 call sites (chat/QA, mostly `llm_mini`)
- `utils/llm/memories.py` — 3 call sites (memory extraction, `llm_mini`)
- `utils/llm/external_integrations.py` — 2 call sites (daily summary uses `llm_medium_experiment`/gpt-5.1, conversation summary uses `llm_mini`)
- `utils/llm/proactive_notification.py` — 1 call site (`llm_mini`)

**Note:** The primary gpt-5.1 cost drivers (`get_transcript_structure` and `extract_action_items` in `conversation_processing.py`) do NOT use memory injection. This optimization primarily benefits daily summary (gpt-5.1), chat/QA, and proactive notifications.

## Proposed Fix (Three Parts)

### Part A: Top-K relevant memories instead of all 1000

```python
def get_prompt_memories(uid: str, context: str = None, k: int = 50) -> tuple[str, str]:
    """Fetch top-K relevant memories instead of all 1000."""
    if context:
        # Use embedding similarity to find relevant memories
        relevant = retrieve_top_k_by_relevance(uid, context, k=k)
    else:
        # Fallback: most recent + highest-importance memories
        relevant = get_top_k_by_recency_and_importance(uid, k=k)
    ...
```

### Part B: Deterministic versioned memory packs

```python
def build_memory_pack(uid: str, memories: list) -> tuple[str, str]:
    """Build deterministic, cache-friendly memory text with version hash."""
    sorted_memories = sorted(memories, key=lambda m: m.id)  # Deterministic order
    text = "\n".join(f"- {m.content}" for m in sorted_memories)
    version = hashlib.md5(text.encode()).hexdigest()[:8]
    return text, version
```

The version hash enables cache key routing: `prompt_cache_key=f"omi-chat-{uid}-{version}"`.

### Part C: Application-level cache (address the TODO at line 50)

```python
# Redis or in-memory cache with 5-10 min TTL
@cache(ttl=300)
def get_prompt_data(uid: str) -> tuple:
    existing_memories = memories_db.get_memories(uid, limit=1000)
    ...
```

Memories change infrequently (new ones added per conversation, not per second), so a 5-min TTL eliminates redundant DB reads during burst processing.

## Impact

- **Token reduction**: 1000 → 50 memories = ~95% fewer memory tokens per call
- **Cache improvement**: Stable memory packs enable prefix caching for per-user prompts
- **DB load reduction**: Application cache eliminates redundant Firestore reads
- Primarily benefits daily summary (gpt-5.1) and chat/QA volume

## Files to Change

- `backend/utils/llms/memory.py` — implement top-K, versioning, application cache
- `backend/utils/llm/chat.py` — pass conversation context to `get_prompt_memories` for relevance filtering
- `backend/utils/llm/external_integrations.py` — same
- `backend/utils/llm/proactive_notification.py` — same

## References

- Research: `~/rnd/prompt-cache-hit-best-practices.md` by @geni (Section 2)
- [OpenAI Prompt Caching Guide](https://platform.openai.com/docs/guides/prompt-caching)

## Test Plan

- [ ] `backend/test.sh` — all tests pass
- [ ] Verify top-K returns relevant memories (not random subset)
- [ ] Verify deterministic ordering produces byte-identical output for same inputs
- [ ] Verify application cache TTL and invalidation behavior
- [ ] A/B test: compare chat quality with 50 vs 1000 memories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory injection: versioned top-K packs with application-level caching #4673

Problem

Affected Call Sites

Proposed Fix (Three Parts)

Part A: Top-K relevant memories instead of all 1000

Part B: Deterministic versioned memory packs

Part C: Application-level cache (address the TODO at line 50)

Impact

Files to Change

References

Test Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize memory injection: versioned top-K packs with application-level caching #4673

Description

Problem

Affected Call Sites

Proposed Fix (Three Parts)

Part A: Top-K relevant memories instead of all 1000

Part B: Deterministic versioned memory packs

Part C: Application-level cache (address the TODO at line 50)

Impact

Files to Change

References

Test Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions