Kairn

A graph-first long-term memory system for AI agents.

74.2% on LongMemEval (n=500, GPT-4o judge, standard methodology) -- no benchmark-specific tuning, no hardcoded answers.

Supermemory  81.6%  ████████████████████████████████████████
Us           74.2%  █████████████████████████████████████
Zep          71.2%  ███████████████████████████████████
Mem0         49.0%  ████████████████████████

What This Is

A memory system that lets AI agents remember conversations across sessions. Built on Cognee (knowledge graph + relational storage), with typed retrieval strategies, temporal reasoning, and a rigorous evaluation methodology.

This is not a wrapper around vector search. The system uses graph traversal as the primary retrieval mechanism, with BM25 keyword search and raw conversation fallback as supplementary sources. Each query is classified by type (knowledge update, temporal reasoning, multi-session counting, preference, etc.) and routed to a specialized retrieval and answering pipeline.

Key Results

Category	Score	What It Tests
Single-session user (SSU)	98.6%	Remembering what the user said in a session
Knowledge update (KU)	80.8%	Tracking facts that change over time
Multi-session (MS)	69.2%	Counting and aggregating across conversations
Temporal reasoning (TR)	66.9%	Ordering events, computing durations
Single-session assistant (SSA)	82.1%	Recalling what the assistant said
Single-session preference (SSP)	40.0%	Inferring user preferences for recommendations
Overall	74.2%

Evaluated on LongMemEval (500 questions, 6 categories, LLM-as-judge with GPT-4o).

Architecture

Conversation Turn
       |
       v
  [Extraction]  ── GPT-4o-mini via OpenRouter
       |             Extracts typed facts: EntityState, Preference,
       |             Event, Fact, Constraint, Decision, Commitment
       v
  [Graph Store]  ── Cognee (Kuzu graph + SQLite)
       |             Supersession: new facts replace old ones
       |             Entity-attribute indexing
       |             BM25 full-text search (FTS5)
       v
  [Query Classification]  ── LLM-based (GPT-4o)
       |                      6 types: temporal, count, knowledge_update,
       |                      preference, assistant, default
       v
  [Retrieval]  ── Multi-strategy for count/KU/default families:
       |           BM25 primary + entity-centric + temporal timeline
       |           + raw observation fallback
       |         Single-strategy for temporal/assistant/preference:
       |           Specialized pipelines with type-specific logic
       v
  [Answer Generation]  ── 2-prompt ensemble (family + default)
       |                   Tiebreaker voting for count disagreements
       |                   Focused extraction LLM for temporal ordering
       |                   Typed reducers for inventory/schedule queries
       v
  [Final Gate]  ── Abstention detection, numeric arbitration
       |
       v
     Answer

What Makes This Different

Graph-first retrieval. Most memory systems embed conversations and do cosine similarity. We extract structured facts into a knowledge graph and traverse entity relationships. This handles multi-hop queries ("What restaurant did Sarah recommend when we discussed Italian food?") that vector search misses.

Typed query classification. "How many plants did I buy?" and "How many days ago did I buy plants?" require completely different retrieval strategies. The system classifies queries into 6 types and routes each to a specialized pipeline.

Honest evaluation. We benchmark on LongMemEval with n=500 examples, the standard GPT-4o judge, and per-category failure analysis. No benchmark-specific code, no hardcoded entity names, no memorized answers. We previously had 18 handwritten reducers for specific benchmark questions -- we replaced them all with a single generic LLM-based reducer and the score barely moved (73.2% -> 74.2%), proving the architecture generalizes.

Documented failures. We track what doesn't work. Temporal reasoning (66.9%) is our weakest area because conversation timestamps represent when things were discussed, not when events happened. Multi-session counting (69.2%) struggles when items are scattered across many conversations. We publish these numbers because memory systems that only report top-line accuracy are hiding their weaknesses.

The Optimization Journey

We went from 0% to 74.2% over 10 optimization cycles. Here's what we learned.

The trajectory:

Score	What Changed
0%	No memory (control)
37.4%	Graph extraction + typed schemas
49.0%	Added vector embeddings
45.8%	BM25 + GPT-4o answering
66.2%	8 cycles of retrieval + prompt fixes
73.2%	Selective multi-strategy retrieval
74.2%	Generic typed reducers + code quality hardening

What worked:

GPT-4o as the answer model was the single highest-impact change (~+25pp)
Two-stage retrieval: find the domain first, then the specific answer
Selective multi-strategy: broader retrieval for categories that benefit, specialized paths for those that don't
2-prompt ensemble: run both a family-specific and a generic prompt, arbitrate when they disagree
Typed MS reducers: deterministic inventory/schedule classification before LLM counting

What didn't work:

Removing query classification from retrieval (-1.6pp, TR/KU/SSA regressed)
BM25 status filtering (-10.7pp on SSA, nodes that SSA queries need were filtered out)
Fuzzy attribute matching for supersession (too aggressive, false matches)
Time-window filtering at query time (session dates are not event dates)

The system uses generic, domain-agnostic reducers -- no hardcoded entity names, no benchmark-specific routing. The architecture generalizes to any conversation domain.

Quick Start

# Clone and set up
git clone https://github.com/illegalcall/kairn.git
cd kairn
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Set your API key
export OPENROUTER_API_KEY=your_key_here
export EXTRACTOR_PROVIDER=openrouter
export ANSWER_MODEL=openai/gpt-4o

# Run the evaluation
PYTHONPATH=src python eval/parallel_benchmark.py \
  --variant I --judge llm --workers 8

Project Structure

src/
  memory/
    engine.py          # Core memory engine (retrieval + answering)
    schemas.py         # Typed memory schemas + supersession
    conflicts.py       # Conflict detection layer
    graph_store.py     # Kuzu + SQLite storage
    extractor.py       # LLM-based fact extraction
    embedder.py        # Embedding utilities
  storage/
    db.py              # SQLite schema + initialization
    observations.py    # Raw conversation storage
    turns.py           # Conversation turn tracking
    node_index.py      # BM25 full-text search
  interface.py         # ContextPack data model
eval/
  parallel_benchmark.py  # Parallel LongMemEval runner (12 workers)
  longmemeval_runner.py  # Single-example evaluation harness
  beam_adapter.py        # BEAM benchmark adapter
  judge.py               # LLM-as-judge scoring
tests/                   # Unit and regression tests

How the Eval Works

The evaluation uses LongMemEval, which tests whether a memory system can answer questions about information scattered across multiple conversation sessions.

Each example has:

A haystack of 5-25 conversation sessions (~115K tokens)
A question that requires remembering specific details
A gold answer

The system ingests all sessions, then answers the question. An LLM judge (GPT-4o) compares the prediction to the gold answer using question-type-specific prompts from the original LongMemEval paper.

We run all 500 examples with 12 parallel workers. Each worker gets its own isolated storage directory to prevent cross-contamination.

Competitive Landscape

As of March 2026, published LongMemEval scores:

System	Score	Judge	Open Source	Notes
Chronos	95.6%	Opus (non-standard)	No code	Paper only, PwC
Mastra	94.9%	GPT-5-mini (non-standard)	Partial	84.2% with GPT-4o
Supermemory	81.6%	GPT-4o (standard)	Partial	Core engine is proprietary
Kairn	74.2%	GPT-4o (standard)	Yes	Fully open, generic architecture
Zep	71.2%	GPT-4o	Yes	Oracle mode
Mem0	49.0%	GPT-5-mini (non-standard)	Yes	Third-party measured (arXiv 2603.04814)

Note: Most top systems use non-standard judges, making their scores not directly comparable. Our 74.2% uses the standard GPT-4o judge specified in the original LongMemEval paper.

Known Limitations

Temporal reasoning (66.9%): Conversation timestamps represent when events were discussed, not when they happened. Ordering queries ("which came first?") fail when both events are discussed in the same session. Fixing this requires event-time extraction at ingestion, which is a research problem.
SSP (40.0%): Preference inference is dominated by LLM judge stochasticity. The same predictions score differently across runs.
Cost: Each evaluation example requires multiple LLM calls (extraction + classification + answering). Full 500-example eval costs ~$15-20 in API credits.
No MCP server yet: The system runs as a Python library. MCP server integration is planned.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
eval		eval
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kairn

What This Is

Key Results

Architecture

What Makes This Different

The Optimization Journey

Quick Start

Project Structure

How the Eval Works

Competitive Landscape

Known Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kairn

What This Is

Key Results

Architecture

What Makes This Different

The Optimization Journey

Quick Start

Project Structure

How the Eval Works

Competitive Landscape

Known Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages