ClickHouse · zlareb1 · Jun 5, 2026 · Jun 5, 2026 · Jun 5, 2026 · Jun 18, 2026
diff --git a/loom/README.md b/loom/README.md
@@ -0,0 +1,67 @@
+# Benchmarking Loom on LongMemEval
+
+[Loom](https://github.com/ClickHouse/loom) is a ClickHouse-backed memory service.
+This integration plugs it into LongMemEval at the **indexing + retrieval** stages,
+reads with the official reader prompt, and grades QA accuracy with the repo's
+semantic judge (`src/evaluation/evaluate_qa.py`).
+
+| stage | who does it |
+|-------|-------------|
+| indexing + retrieval | **Loom** (`loom/run_loom.py` — ingest via `memory.set_from_messages`, retrieve via `memory.search`) |
+| reading (answer generation) | the official `src/generation/run_generation.py` prompt, replicated in `run_loom.py` (facts variant, step-by-step) |
+| judging | `src/evaluation/evaluate_qa.py` — semantic judge (gpt-4o), per question type |
+
+Only the ingest+retrieve stage is Loom's; the reader is the standard official one.
+(The official `src/retrieval/run_retrieval.py` is built around in-process
+retrievers — BM25 / Contriever / Stella / GTE over a flat corpus — and has no
+hook for an external memory *service*, which is why this adapter exists.)
+
+## Prerequisites
+
+1. A running Loom server and a bearer token with write access. See the
+   [Loom repo](https://github.com/ClickHouse/loom) for `make dev` and
+   `mint-token`.
+2. `OPENAI_API_KEY` in the environment (used by the reader model, default
+   `gpt-4o`, and by the judge).
+3. Install the adapter dep (everything else is already in the repo's requirements):
+
+   ```bash
+   pip install -r loom/requirements.txt
+   ```
+
+4. The dataset (LongMemEval-S, the variant other systems report on):
+
+   ```bash
+   wget https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json -O data/longmemeval_s_cleaned.json
+   ```
+
+## Run
+
+```bash
+export LOOM_TOKEN="<your loom token>"
+export OPENAI_API_KEY="<your key>"
+
+# 1) Ingest + retrieve with Loom, generate answers with the official reader,
+#    write a hypotheses file. (Omit --limit for the full 500.)
+python loom/run_loom.py \
+  --base-url http://127.0.0.1:7777 \
+  --dataset data/longmemeval_s_cleaned.json \
+  --out loom/loom_hyp.jsonl \
+  --limit 40 --shuffle   # omit --limit for the full 500; --shuffle gives a mixed sample
+
+# 2) Grade QA accuracy (gpt-4o judge, per question type).
+python src/evaluation/evaluate_qa.py gpt-4o loom/loom_hyp.jsonl data/longmemeval_s_cleaned.json
+```
+
+`run_loom.py` prints **evidence-session recall@k** (Loom's own retrieval metric);
+`evaluate_qa.py` prints the **QA accuracy** (overall + per question type).
+
+## Notes
+
+- `--search-mode rrf` (default) lets Loom's query planner self-route; it never
+  sees the gold `question_type`.
+- Indexing is one `set_from_messages` per session (the natural unit), run
+  concurrently (`--ingest-concurrency`) because each call does LLM extraction
+  server-side; a -S question has ~50 sessions.
+- `run_loom.py` creates a fresh namespace per question, so questions don't leak
+  into each other.
diff --git a/loom/RESULTS.md b/loom/RESULTS.md
@@ -0,0 +1,123 @@
+# Loom on LongMemEval-S — Benchmark Results
+
+[Loom](https://github.com/ClickHouse/loom) is a ClickHouse-backed memory service.
+This benchmark plugs Loom into LongMemEval-S at the **indexing + retrieval** stages;
+the *reader* (answerer) and *judge* (grader) are LLMs — the standard measurement
+apparatus, not part of Loom. The sections below report every dimension a memory
+service is judged on: accuracy (across reader/judge choices), retrieval recall,
+latency, token efficiency, and the HyDE fallback rate.
+
+## Setup
+
+- **Dataset:** LongMemEval-S, 500 questions (491 answered; a few dropped to reader API timeouts).
+- **Indexing + retrieval:** Loom — `memory.set_from_messages` per session, then
+  `memory.search` at `top_k=200`, `search_mode=rrf`, no reranker (product default).
+- **Embeddings:** OpenAI `text-embedding-3-small`. **Extraction:** `gpt-4o-mini`.
+- **Reader / judge:** OpenAI `gpt-4o` and `gpt-5` (varied below to show their effect).
+- Single-node Loom + ClickHouse.
+
+## 1. Accuracy — reader × judge
+
+The end-to-end score depends as much on the **reader** and **judge** as on the memory.
+Full-500, identical Loom retrieval, varying only the answerer and grader:
+
+| reader ↓ \ judge → | gpt-4o judge | gpt-5 judge |
+|---|---|---|
+| **gpt-4o reader** | 84.9% | 82.2% |
+| **gpt-5 reader** | **92.9%** | **88.2%** |
+
+- **Reader effect:** gpt-4o → gpt-5 on the *same* retrieval = **+6 to +8pt**. The memory is
+  identical; the answerer is the lever.
+- **Judge effect:** gpt-5 judge is **stricter** (~3–5pt lower), almost entirely on the
+  open-ended single-session-preference rubric.
+- **gpt-5 reader + gpt-5 judge:** **88.2%** (independently reproduced at 88.4% on a second
+  full-500 run). This is the headline.
+- **Judge adjudication:** a blind 3-rater re-grade of the 23 questions where the gpt-4o and
+  gpt-5 judges disagreed found **18 were gpt-5 over-strictness** (mostly the preference
+  rubric) and **5 genuine errors** — implying honestly-graded accuracy nearer **~92%** once
+  the over-strictness is removed. The headline reported here stays the un-adjudicated **88.2%**.
+
+### Per-category (gpt-5 reader + gpt-5 judge)
+
+| Category | Accuracy |
+|---|---|
+| single-session-user | 98.6% |
+| single-session-assistant | 96.4% |
+| temporal-reasoning | 89.3% |
+| knowledge-update | 87.0% |
+| multi-session | 83.5% |
+| single-session-preference | 70.0% |
+
+## 2. Retrieval recall (Loom's own quality, reader-independent)
+
+| Recall metric | Loom |
+|---|---|
+| Evidence session present in top-k | **99.6%** |
+| *Every* gold session present in top-k | 97.1% |
+| Gold answer string present in a retrieved excerpt | 48.1% |
+
+Recall@200 is 99.6% — Loom surfaces a memory from the gold evidence session on nearly every
+question. This is *why* accuracy is reader/judge-dominated: the facts are in the context; the
+score is what the reader makes of them.
+
+## 3. Latency — by retrieval budget
+
+Loom's default retrieval runs LLM-in-loop work on the read path (query planning, plus a HyDE
+recall-rescue on a weak top hit). Paired A/B — one ingest, the same 99 questions, gpt-5
+reader+judge, only the retrieval budget differs:
+
+| retrieval budget | accuracy | fact recall | search p50 | search p95 |
+|---|---|---|---|---|
+| default (LLM-in-loop) | 88.9% | 47/99 | ~1,000 ms | ~5,200 ms |
+| **`fast`** (pure vector) | **90.9%** | 47/99 | **~140 ms** | ~620 ms |
+
+The `fast` budget **holds accuracy** (within n=99 noise) and **recall** (identical — differs
+on 0 questions) at **~7× lower latency**. On this workload (recall already 99.6%) the
+LLM-in-loop work does not change *what* is retrieved, so it is latency without benefit —
+`--retrieval-budget fast` is the latency-optimal setting for QA workloads. (Floor for a
+simple well-matched query is ~290 ms; the default path's p50 ranges ~1.0–1.9s depending on
+query mix and load.)
+
+## 4. Token efficiency — by top_k
+
+Context handed to the reader (median, ~4 chars/token), measured on a populated namespace:
+
+| top_k | memories served | ~tokens |
+|---|---|---|
+| 20 | 20 | ~1,927 |
+| 50 | ~48 | ~4,177 |
+| 200 | ~119–188 | ~11,290 |
+
+Token cost is a **recall/cost knob**: the 88–92% accuracy above uses `top_k=200`. Smaller `k`
+serves far less context but lowers recall and accuracy — the low token count and the high
+accuracy do not co-exist at the same `k`.
+
+## 5. HyDE fallback
+
+The HyDE recall-rescue (an LLM that rewrites a weak query to an answer-shape and re-searches)
+**fired on ~10% of queries** and, in a 60-question A/B, **changed which answer was retrieved on
+0 of them** — it fires partly on abstention/preference questions it cannot help. On a
+high-recall workload there is little to rescue, so it is mostly latency; it is left enabled
+(a knob, not removed) because it can help paraphrase-heavy or sparse-memory workloads.
+
+## How to read these numbers
+
+- **Accuracy is reader/judge-dominated, not retrieval-dominated** (recall@200 = 99.6%): the
+  facts are in the retrieved context; the score is what the reader and judge make of them.
+- **Latency and token figures are Loom's own operational measurements on this hardware.**
+  They are setup-specific (harness, hardware, read path, and tokenizer all affect them), so
+  treat them as Loom-vs-Loom (e.g. the budget comparison above), not as a portable ranking.
+
+## Reproduce
+
+```bash
+# Accuracy (gpt-5 reader + gpt-5 judge) + retrieval metrics:
+python loom/run_loom.py --base-url http://127.0.0.1:7777 \
+  --dataset data/longmemeval_s_cleaned.json \
+  --out loom/hyp.jsonl --metrics-out loom/metrics.json \
+  --top-k 200 --answer-model gpt-5 --ingest-concurrency 8 --measure-latency
+python src/evaluation/evaluate_qa.py gpt-5 loom/hyp.jsonl data/longmemeval_s_cleaned.json
+
+# Latency-optimal (fast retrieval budget):
+python loom/run_loom.py ... --retrieval-budget fast --measure-latency
+```
diff --git a/loom/requirements.txt b/loom/requirements.txt
@@ -0,0 +1,3 @@
+# Extra dependency for the Loom adapter (loom/run_loom.py).
+# The official reader/judge deps are already in the repo's requirements.
+httpx>=0.27