A graph-first long-term memory system for AI agents.
74.2% on LongMemEval (n=500, GPT-4o judge, standard methodology) -- no benchmark-specific tuning, no hardcoded answers.
Supermemory 81.6% ████████████████████████████████████████
Us 74.2% █████████████████████████████████████
Zep 71.2% ███████████████████████████████████
Mem0 49.0% ████████████████████████
A memory system that lets AI agents remember conversations across sessions. Built on Cognee (knowledge graph + relational storage), with typed retrieval strategies, temporal reasoning, and a rigorous evaluation methodology.
This is not a wrapper around vector search. The system uses graph traversal as the primary retrieval mechanism, with BM25 keyword search and raw conversation fallback as supplementary sources. Each query is classified by type (knowledge update, temporal reasoning, multi-session counting, preference, etc.) and routed to a specialized retrieval and answering pipeline.
| Category | Score | What It Tests |
|---|---|---|
| Single-session user (SSU) | 98.6% | Remembering what the user said in a session |
| Knowledge update (KU) | 80.8% | Tracking facts that change over time |
| Multi-session (MS) | 69.2% | Counting and aggregating across conversations |
| Temporal reasoning (TR) | 66.9% | Ordering events, computing durations |
| Single-session assistant (SSA) | 82.1% | Recalling what the assistant said |
| Single-session preference (SSP) | 40.0% | Inferring user preferences for recommendations |
| Overall | 74.2% |
Evaluated on LongMemEval (500 questions, 6 categories, LLM-as-judge with GPT-4o).
Conversation Turn
|
v
[Extraction] ── GPT-4o-mini via OpenRouter
| Extracts typed facts: EntityState, Preference,
| Event, Fact, Constraint, Decision, Commitment
v
[Graph Store] ── Cognee (Kuzu graph + SQLite)
| Supersession: new facts replace old ones
| Entity-attribute indexing
| BM25 full-text search (FTS5)
v
[Query Classification] ── LLM-based (GPT-4o)
| 6 types: temporal, count, knowledge_update,
| preference, assistant, default
v
[Retrieval] ── Multi-strategy for count/KU/default families:
| BM25 primary + entity-centric + temporal timeline
| + raw observation fallback
| Single-strategy for temporal/assistant/preference:
| Specialized pipelines with type-specific logic
v
[Answer Generation] ── 2-prompt ensemble (family + default)
| Tiebreaker voting for count disagreements
| Focused extraction LLM for temporal ordering
| Typed reducers for inventory/schedule queries
v
[Final Gate] ── Abstention detection, numeric arbitration
|
v
Answer
Graph-first retrieval. Most memory systems embed conversations and do cosine similarity. We extract structured facts into a knowledge graph and traverse entity relationships. This handles multi-hop queries ("What restaurant did Sarah recommend when we discussed Italian food?") that vector search misses.
Typed query classification. "How many plants did I buy?" and "How many days ago did I buy plants?" require completely different retrieval strategies. The system classifies queries into 6 types and routes each to a specialized pipeline.
Honest evaluation. We benchmark on LongMemEval with n=500 examples, the standard GPT-4o judge, and per-category failure analysis. No benchmark-specific code, no hardcoded entity names, no memorized answers. We previously had 18 handwritten reducers for specific benchmark questions -- we replaced them all with a single generic LLM-based reducer and the score barely moved (73.2% -> 74.2%), proving the architecture generalizes.
Documented failures. We track what doesn't work. Temporal reasoning (66.9%) is our weakest area because conversation timestamps represent when things were discussed, not when events happened. Multi-session counting (69.2%) struggles when items are scattered across many conversations. We publish these numbers because memory systems that only report top-line accuracy are hiding their weaknesses.
We went from 0% to 74.2% over 10 optimization cycles. Here's what we learned.
The trajectory:
| Score | What Changed |
|---|---|
| 0% | No memory (control) |
| 37.4% | Graph extraction + typed schemas |
| 49.0% | Added vector embeddings |
| 45.8% | BM25 + GPT-4o answering |
| 66.2% | 8 cycles of retrieval + prompt fixes |
| 73.2% | Selective multi-strategy retrieval |
| 74.2% | Generic typed reducers + code quality hardening |
What worked:
- GPT-4o as the answer model was the single highest-impact change (~+25pp)
- Two-stage retrieval: find the domain first, then the specific answer
- Selective multi-strategy: broader retrieval for categories that benefit, specialized paths for those that don't
- 2-prompt ensemble: run both a family-specific and a generic prompt, arbitrate when they disagree
- Typed MS reducers: deterministic inventory/schedule classification before LLM counting
What didn't work:
- Removing query classification from retrieval (-1.6pp, TR/KU/SSA regressed)
- BM25 status filtering (-10.7pp on SSA, nodes that SSA queries need were filtered out)
- Fuzzy attribute matching for supersession (too aggressive, false matches)
- Time-window filtering at query time (session dates are not event dates)
The system uses generic, domain-agnostic reducers -- no hardcoded entity names, no benchmark-specific routing. The architecture generalizes to any conversation domain.
# Clone and set up
git clone https://github.com/illegalcall/kairn.git
cd kairn
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Set your API key
export OPENROUTER_API_KEY=your_key_here
export EXTRACTOR_PROVIDER=openrouter
export ANSWER_MODEL=openai/gpt-4o
# Run the evaluation
PYTHONPATH=src python eval/parallel_benchmark.py \
--variant I --judge llm --workers 8src/
memory/
engine.py # Core memory engine (retrieval + answering)
schemas.py # Typed memory schemas + supersession
conflicts.py # Conflict detection layer
graph_store.py # Kuzu + SQLite storage
extractor.py # LLM-based fact extraction
embedder.py # Embedding utilities
storage/
db.py # SQLite schema + initialization
observations.py # Raw conversation storage
turns.py # Conversation turn tracking
node_index.py # BM25 full-text search
interface.py # ContextPack data model
eval/
parallel_benchmark.py # Parallel LongMemEval runner (12 workers)
longmemeval_runner.py # Single-example evaluation harness
beam_adapter.py # BEAM benchmark adapter
judge.py # LLM-as-judge scoring
tests/ # Unit and regression tests
The evaluation uses LongMemEval, which tests whether a memory system can answer questions about information scattered across multiple conversation sessions.
Each example has:
- A haystack of 5-25 conversation sessions (~115K tokens)
- A question that requires remembering specific details
- A gold answer
The system ingests all sessions, then answers the question. An LLM judge (GPT-4o) compares the prediction to the gold answer using question-type-specific prompts from the original LongMemEval paper.
We run all 500 examples with 12 parallel workers. Each worker gets its own isolated storage directory to prevent cross-contamination.
As of March 2026, published LongMemEval scores:
| System | Score | Judge | Open Source | Notes |
|---|---|---|---|---|
| Chronos | 95.6% | Opus (non-standard) | No code | Paper only, PwC |
| Mastra | 94.9% | GPT-5-mini (non-standard) | Partial | 84.2% with GPT-4o |
| Supermemory | 81.6% | GPT-4o (standard) | Partial | Core engine is proprietary |
| Kairn | 74.2% | GPT-4o (standard) | Yes | Fully open, generic architecture |
| Zep | 71.2% | GPT-4o | Yes | Oracle mode |
| Mem0 | 49.0% | GPT-5-mini (non-standard) | Yes | Third-party measured (arXiv 2603.04814) |
Note: Most top systems use non-standard judges, making their scores not directly comparable. Our 74.2% uses the standard GPT-4o judge specified in the original LongMemEval paper.
- Temporal reasoning (66.9%): Conversation timestamps represent when events were discussed, not when they happened. Ordering queries ("which came first?") fail when both events are discussed in the same session. Fixing this requires event-time extraction at ingestion, which is a research problem.
- SSP (40.0%): Preference inference is dominated by LLM judge stochasticity. The same predictions score differently across runs.
- Cost: Each evaluation example requires multiple LLM calls (extraction + classification + answering). Full 500-example eval costs ~$15-20 in API credits.
- No MCP server yet: The system runs as a Python library. MCP server integration is planned.
MIT