Skip to content

yigitsokel1/agent-memory-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Memory Lab

A research-style lab exploring why AI memory retrieval fails.

The project tests semantic retrieval, recency, importance scoring, project detection, hybrid ranking, and query-type-aware retrieval policies on a small synthetic memory dataset.

This is not a SaaS product. It is an experiment-driven repository for understanding retrieval behavior in agent memory systems.

Key findings

  • Semantic similarity is useful but not sufficient for memory retrieval.
  • Recent context stabilizes ambiguous references.
  • Summary memory can lose factual details while still scoring highly.
  • Query type matters: factual, planning, and conceptual queries need different ranking policies.
  • Hybrid retrieval improves results but can still suppress sparse factual memories.

Results snapshot

Latest harness run (python src/evaluate_retrieval.py):

Metric Value
Average Hit@5 0.83
Average Recall@5 0.66
Stage Avg Recall@5 Notes
Hardcoded MeOS active_project boost 0.51 Cross-project queries leaked or missed expected memories
Dynamic project detection 0.66 Current src/evaluate_retrieval.py (semantic + importance + recency + query-based project)
Adaptive retrieval (factual query) m041 in top 5 src/adaptive_retrieval_test.py on q002; not yet wired into the evaluation harness

Weakest case in the standard evaluator: q002 factual naming (Recall@5 0). Adaptive policy addresses it in isolation. Full per-query breakdown: run python src/evaluate_retrieval.py.

Retrieval pipeline

query
  ↓
query type detection
  ↓
semantic + keyword + importance + recency + project signals
  ↓
ranked memory candidates
  ↓
evaluation: Hit@5 / Recall@5

What we test

  • semantic retrieval
  • recent conversational context
  • summary memory
  • importance-based ranking
  • recency scoring
  • dynamic project detection
  • hybrid retrieval (semantic + keyword)
  • query-type detection
  • adaptive retrieval policies
  • evaluation harness (evaluation_queries.json)
  • failure cases in memory retrieval

Experiments

Nine scripted experiments from pure semantic retrieval through adaptive query-type policies.

# Topic One-line outcome
1 Semantic retrieval Works for direct queries; drifts on broad concepts
2 Ambiguous reference Semantic similarity alone cannot resolve "the system"
3 Recent context Short-term state stabilizes implicit references
4 Summary memory High similarity without the needed fact
5 Importance ranking Helps when many memories are close; weak if all scores are high
6 Active project boost Focuses MeOS queries; hardcoded bias hurt cross-project eval
7 Hybrid retrieval Partial factual improvement; dense chunks still dominate
8 Query type detection Factual, planning, and conceptual queries need different weights
9 Adaptive policy m041 in top 5 for factual q002 (demo script only)

Full write-ups: notes/observations.md (observations 1–9), notes/failures.md (failure cases 1–5).

How to run

Run all scripts from the repository root (paths such as data/conversations.json are relative).

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
copy .env.example .env

Set OPENAI_API_KEY in .env, then run the evaluation harness:

python src/evaluate_retrieval.py

Dataset

The lab uses a small synthetic memory dataset stored in:

data/conversations.json

Evaluation queries live in:

data/evaluation_queries.json

Each memory includes: id, text, category, source, tags, importance, timestamp.

Scripts

Script Purpose
src/retrieval_test.py Pure semantic retrieval (cosine similarity)
src/summary_test.py Single compressed summary vs. multiple queries
src/ranking_test.py Semantic similarity + importance
src/recency_ranking_test.py Adds recency to the ranking mix
src/evaluate_retrieval.py Hit@5 and Recall@5 over evaluation_queries.json
src/hybrid_retrieval_test.py Semantic + keyword overlap + multi-signal ranking
src/query_type_test.py Demo: factual / planning / conceptual classification
src/adaptive_retrieval_test.py Query-type-aware ranking weights

Next experiments

  • Compare ranking weight grids systematically.
  • Integrate adaptive policy into evaluate_retrieval.py.
  • Test noisy context injection.
  • Test whether too many retrieved memories reduce answer quality.
  • Add metadata / entity index for factual lookup (q002).

Notes

Detailed write-ups:

About

Research lab for AI agent memory retrieval, hybrid ranking, semantic drift, and adaptive retrieval policies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages