A research-style lab exploring why AI memory retrieval fails.
The project tests semantic retrieval, recency, importance scoring, project detection, hybrid ranking, and query-type-aware retrieval policies on a small synthetic memory dataset.
This is not a SaaS product. It is an experiment-driven repository for understanding retrieval behavior in agent memory systems.
- Semantic similarity is useful but not sufficient for memory retrieval.
- Recent context stabilizes ambiguous references.
- Summary memory can lose factual details while still scoring highly.
- Query type matters: factual, planning, and conceptual queries need different ranking policies.
- Hybrid retrieval improves results but can still suppress sparse factual memories.
Latest harness run (python src/evaluate_retrieval.py):
| Metric | Value |
|---|---|
| Average Hit@5 | 0.83 |
| Average Recall@5 | 0.66 |
| Stage | Avg Recall@5 | Notes |
|---|---|---|
Hardcoded MeOS active_project boost |
0.51 | Cross-project queries leaked or missed expected memories |
| Dynamic project detection | 0.66 | Current src/evaluate_retrieval.py (semantic + importance + recency + query-based project) |
| Adaptive retrieval (factual query) | m041 in top 5 | src/adaptive_retrieval_test.py on q002; not yet wired into the evaluation harness |
Weakest case in the standard evaluator: q002 factual naming (Recall@5 0). Adaptive policy addresses it in isolation. Full per-query breakdown: run python src/evaluate_retrieval.py.
query
↓
query type detection
↓
semantic + keyword + importance + recency + project signals
↓
ranked memory candidates
↓
evaluation: Hit@5 / Recall@5- semantic retrieval
- recent conversational context
- summary memory
- importance-based ranking
- recency scoring
- dynamic project detection
- hybrid retrieval (semantic + keyword)
- query-type detection
- adaptive retrieval policies
- evaluation harness (
evaluation_queries.json) - failure cases in memory retrieval
Nine scripted experiments from pure semantic retrieval through adaptive query-type policies.
| # | Topic | One-line outcome |
|---|---|---|
| 1 | Semantic retrieval | Works for direct queries; drifts on broad concepts |
| 2 | Ambiguous reference | Semantic similarity alone cannot resolve "the system" |
| 3 | Recent context | Short-term state stabilizes implicit references |
| 4 | Summary memory | High similarity without the needed fact |
| 5 | Importance ranking | Helps when many memories are close; weak if all scores are high |
| 6 | Active project boost | Focuses MeOS queries; hardcoded bias hurt cross-project eval |
| 7 | Hybrid retrieval | Partial factual improvement; dense chunks still dominate |
| 8 | Query type detection | Factual, planning, and conceptual queries need different weights |
| 9 | Adaptive policy | m041 in top 5 for factual q002 (demo script only) |
Full write-ups: notes/observations.md (observations 1–9), notes/failures.md (failure cases 1–5).
Run all scripts from the repository root (paths such as data/conversations.json are relative).
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
copy .env.example .envSet OPENAI_API_KEY in .env, then run the evaluation harness:
python src/evaluate_retrieval.pyThe lab uses a small synthetic memory dataset stored in:
data/conversations.jsonEvaluation queries live in:
data/evaluation_queries.jsonEach memory includes: id, text, category, source, tags, importance, timestamp.
| Script | Purpose |
|---|---|
src/retrieval_test.py |
Pure semantic retrieval (cosine similarity) |
src/summary_test.py |
Single compressed summary vs. multiple queries |
src/ranking_test.py |
Semantic similarity + importance |
src/recency_ranking_test.py |
Adds recency to the ranking mix |
src/evaluate_retrieval.py |
Hit@5 and Recall@5 over evaluation_queries.json |
src/hybrid_retrieval_test.py |
Semantic + keyword overlap + multi-signal ranking |
src/query_type_test.py |
Demo: factual / planning / conceptual classification |
src/adaptive_retrieval_test.py |
Query-type-aware ranking weights |
- Compare ranking weight grids systematically.
- Integrate adaptive policy into
evaluate_retrieval.py. - Test noisy context injection.
- Test whether too many retrieved memories reduce answer quality.
- Add metadata / entity index for factual lookup (q002).
Detailed write-ups:
- notes/observations.md — Observations 1–9 (Finding / Implication)
- notes/failures.md — Failures 1–5 (Query / Expected / Actual / Cause / Possible Fix)