Skip to content

illegalcall/kairn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kairn

A graph-first long-term memory system for AI agents.

74.2% on LongMemEval (n=500, GPT-4o judge, standard methodology) -- no benchmark-specific tuning, no hardcoded answers.

Supermemory  81.6%  ████████████████████████████████████████
Us           74.2%  █████████████████████████████████████
Zep          71.2%  ███████████████████████████████████
Mem0         49.0%  ████████████████████████

What This Is

A memory system that lets AI agents remember conversations across sessions. Built on Cognee (knowledge graph + relational storage), with typed retrieval strategies, temporal reasoning, and a rigorous evaluation methodology.

This is not a wrapper around vector search. The system uses graph traversal as the primary retrieval mechanism, with BM25 keyword search and raw conversation fallback as supplementary sources. Each query is classified by type (knowledge update, temporal reasoning, multi-session counting, preference, etc.) and routed to a specialized retrieval and answering pipeline.

Key Results

Category Score What It Tests
Single-session user (SSU) 98.6% Remembering what the user said in a session
Knowledge update (KU) 80.8% Tracking facts that change over time
Multi-session (MS) 69.2% Counting and aggregating across conversations
Temporal reasoning (TR) 66.9% Ordering events, computing durations
Single-session assistant (SSA) 82.1% Recalling what the assistant said
Single-session preference (SSP) 40.0% Inferring user preferences for recommendations
Overall 74.2%

Evaluated on LongMemEval (500 questions, 6 categories, LLM-as-judge with GPT-4o).

Architecture

Conversation Turn
       |
       v
  [Extraction]  ── GPT-4o-mini via OpenRouter
       |             Extracts typed facts: EntityState, Preference,
       |             Event, Fact, Constraint, Decision, Commitment
       v
  [Graph Store]  ── Cognee (Kuzu graph + SQLite)
       |             Supersession: new facts replace old ones
       |             Entity-attribute indexing
       |             BM25 full-text search (FTS5)
       v
  [Query Classification]  ── LLM-based (GPT-4o)
       |                      6 types: temporal, count, knowledge_update,
       |                      preference, assistant, default
       v
  [Retrieval]  ── Multi-strategy for count/KU/default families:
       |           BM25 primary + entity-centric + temporal timeline
       |           + raw observation fallback
       |         Single-strategy for temporal/assistant/preference:
       |           Specialized pipelines with type-specific logic
       v
  [Answer Generation]  ── 2-prompt ensemble (family + default)
       |                   Tiebreaker voting for count disagreements
       |                   Focused extraction LLM for temporal ordering
       |                   Typed reducers for inventory/schedule queries
       v
  [Final Gate]  ── Abstention detection, numeric arbitration
       |
       v
     Answer

What Makes This Different

Graph-first retrieval. Most memory systems embed conversations and do cosine similarity. We extract structured facts into a knowledge graph and traverse entity relationships. This handles multi-hop queries ("What restaurant did Sarah recommend when we discussed Italian food?") that vector search misses.

Typed query classification. "How many plants did I buy?" and "How many days ago did I buy plants?" require completely different retrieval strategies. The system classifies queries into 6 types and routes each to a specialized pipeline.

Honest evaluation. We benchmark on LongMemEval with n=500 examples, the standard GPT-4o judge, and per-category failure analysis. No benchmark-specific code, no hardcoded entity names, no memorized answers. We previously had 18 handwritten reducers for specific benchmark questions -- we replaced them all with a single generic LLM-based reducer and the score barely moved (73.2% -> 74.2%), proving the architecture generalizes.

Documented failures. We track what doesn't work. Temporal reasoning (66.9%) is our weakest area because conversation timestamps represent when things were discussed, not when events happened. Multi-session counting (69.2%) struggles when items are scattered across many conversations. We publish these numbers because memory systems that only report top-line accuracy are hiding their weaknesses.

The Optimization Journey

We went from 0% to 74.2% over 10 optimization cycles. Here's what we learned.

The trajectory:

Score What Changed
0% No memory (control)
37.4% Graph extraction + typed schemas
49.0% Added vector embeddings
45.8% BM25 + GPT-4o answering
66.2% 8 cycles of retrieval + prompt fixes
73.2% Selective multi-strategy retrieval
74.2% Generic typed reducers + code quality hardening

What worked:

  • GPT-4o as the answer model was the single highest-impact change (~+25pp)
  • Two-stage retrieval: find the domain first, then the specific answer
  • Selective multi-strategy: broader retrieval for categories that benefit, specialized paths for those that don't
  • 2-prompt ensemble: run both a family-specific and a generic prompt, arbitrate when they disagree
  • Typed MS reducers: deterministic inventory/schedule classification before LLM counting

What didn't work:

  • Removing query classification from retrieval (-1.6pp, TR/KU/SSA regressed)
  • BM25 status filtering (-10.7pp on SSA, nodes that SSA queries need were filtered out)
  • Fuzzy attribute matching for supersession (too aggressive, false matches)
  • Time-window filtering at query time (session dates are not event dates)

The system uses generic, domain-agnostic reducers -- no hardcoded entity names, no benchmark-specific routing. The architecture generalizes to any conversation domain.

Quick Start

# Clone and set up
git clone https://github.com/illegalcall/kairn.git
cd kairn
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Set your API key
export OPENROUTER_API_KEY=your_key_here
export EXTRACTOR_PROVIDER=openrouter
export ANSWER_MODEL=openai/gpt-4o

# Run the evaluation
PYTHONPATH=src python eval/parallel_benchmark.py \
  --variant I --judge llm --workers 8

Project Structure

src/
  memory/
    engine.py          # Core memory engine (retrieval + answering)
    schemas.py         # Typed memory schemas + supersession
    conflicts.py       # Conflict detection layer
    graph_store.py     # Kuzu + SQLite storage
    extractor.py       # LLM-based fact extraction
    embedder.py        # Embedding utilities
  storage/
    db.py              # SQLite schema + initialization
    observations.py    # Raw conversation storage
    turns.py           # Conversation turn tracking
    node_index.py      # BM25 full-text search
  interface.py         # ContextPack data model
eval/
  parallel_benchmark.py  # Parallel LongMemEval runner (12 workers)
  longmemeval_runner.py  # Single-example evaluation harness
  beam_adapter.py        # BEAM benchmark adapter
  judge.py               # LLM-as-judge scoring
tests/                   # Unit and regression tests

How the Eval Works

The evaluation uses LongMemEval, which tests whether a memory system can answer questions about information scattered across multiple conversation sessions.

Each example has:

  • A haystack of 5-25 conversation sessions (~115K tokens)
  • A question that requires remembering specific details
  • A gold answer

The system ingests all sessions, then answers the question. An LLM judge (GPT-4o) compares the prediction to the gold answer using question-type-specific prompts from the original LongMemEval paper.

We run all 500 examples with 12 parallel workers. Each worker gets its own isolated storage directory to prevent cross-contamination.

Competitive Landscape

As of March 2026, published LongMemEval scores:

System Score Judge Open Source Notes
Chronos 95.6% Opus (non-standard) No code Paper only, PwC
Mastra 94.9% GPT-5-mini (non-standard) Partial 84.2% with GPT-4o
Supermemory 81.6% GPT-4o (standard) Partial Core engine is proprietary
Kairn 74.2% GPT-4o (standard) Yes Fully open, generic architecture
Zep 71.2% GPT-4o Yes Oracle mode
Mem0 49.0% GPT-5-mini (non-standard) Yes Third-party measured (arXiv 2603.04814)

Note: Most top systems use non-standard judges, making their scores not directly comparable. Our 74.2% uses the standard GPT-4o judge specified in the original LongMemEval paper.

Known Limitations

  • Temporal reasoning (66.9%): Conversation timestamps represent when events were discussed, not when they happened. Ordering queries ("which came first?") fail when both events are discussed in the same session. Fixing this requires event-time extraction at ingestion, which is a research problem.
  • SSP (40.0%): Preference inference is dominated by LLM judge stochasticity. The same predictions score differently across runs.
  • Cost: Each evaluation example requires multiple LLM calls (extraction + classification + answering). Full 500-example eval costs ~$15-20 in API credits.
  • No MCP server yet: The system runs as a Python library. MCP server integration is planned.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors