They look like they understand your question.
They actually just return whatever's most popular in their database.
A student named Mira would do the same on her French exam.
She'd score 80% by always picking "C". She doesn't speak French.
This is a 30-second test that catches them.
The Colab runs the actual library on a synthetic bench (60 seconds, no install).
The Playground lets you pick a strategy with sliders and watch the gate verdict update live in your browser.
The Case study shows the same gate working on a peer-reviewed BEIR benchmark.
pip install git+https://github.com/spalsh-spec/falsify-evalFree. Open source. Runs on your laptop. Works on any search system.
Built for search engines, recommendation systems, the retrieval side of RAG.
Not built for the part of ChatGPT that writes paragraphs — that's a different problem we haven't built a test for.
30-second demo · The Mira test · How it works · Three surfaces · Preprint
Imagine a student named Mira who never studied. She noticed that on past exams, "C" is the most common correct answer. So she writes C every time and scores 80%. She looks smart on paper. She has zero actual knowledge — she gamed the pattern.
A retrieval or ranking system can do the same thing. If the most popular document in a corpus happens to be relevant for most queries, a system that always returns that popular document will score well on aggregate metrics — without using the query at all. (This is not a hypothetical: see the CS01 NFCorpus case study where this exact predictor scores nDCG@10 = 0.066 on a published BEIR benchmark while ignoring every query.)
The published number looks great. It does not mean what you think it means.
falsify-eval is a Mira-check for retrieval and ranking systems. It compares your system's score against four "fake students" — four null distributions, including one (Null D, the marginal-matched random) that is original to this work and that the previous standard nulls miss. If your system can't beat all four by a calibrated margin, the gate fails.
→ Case studies (real numbers, two public benchmarks):
- CS01 — NFCorpus (323 BEIR queries, dense relevance ~38 docs/query)
- CS02 — SciFact (300 BEIR queries, sparse relevance ~1.1 docs/query)
Across both: Mira and popularity-only fail at Δ_D ≈ 0; BM25 and dense MiniLM pass at Δ_D = +0.14 to +0.73. Reproducible in 5 minutes each on M1 laptop. Joint finding: graded metrics (nDCG) on dense-relevance benchmarks can flatten the gate — pair them with single-gold strict metrics (recall@K against top-1).
pip install git+https://github.com/spalsh-spec/falsify-eval
python -c "from falsify_eval.demo import run; run()"Three systems graded on a 50-query synthetic bench:
═══ constant_predictor (deliberately broken) ═══
real mean nDCG@5 = 0.20
Δ_A (gold-permuted) = +0.000 ✗
Δ_B (uniform random) = +0.001 ✗
Δ_C (random retrieval) = +0.18 ✓
Δ_D (marginal-matched) = +0.000 ✗ ← the gate that catches Mira
GATE: ✗ FAIL (correctly rejected)
═══ mock_engine (plausible retrieval, 70% top-1) ═══
real mean nDCG@5 = 0.62
Δ across all 4 nulls ≥ +0.40 ✓
GATE: ✓ PASS (correctly accepted)
═══ oracle (perfect top-1) ═══
real mean nDCG@5 = 1.00
GATE: ✓ PASS by maximum margin
%%{init: {'theme': 'base', 'themeVariables': {
'fontFamily': 'Garamond, EB Garamond, Georgia, serif',
'primaryColor': '#f3eee5',
'primaryTextColor': '#1c1611',
'primaryBorderColor': '#9c4a1a',
'lineColor': '#9d8147',
'tertiaryColor': '#faf6ed',
'tertiaryBorderColor': '#d4c8b2',
'edgeLabelBackground': '#f3eee5'
}}}%%
flowchart LR
R([your retriever]) -->|top-K per query| S[real score]
G([gold labels]) --> S
G -->|permute π| A[Null A · label-permuted]
G -->|iid uniform| B[Null B · uniform random]
P([item pool]) -->|sample K| C[Null C · random retrieval]
G -->|sample by class freq| D[Null D · marginal-matched ★]
S --> Δ{Δ ≥ τ on<br/>all four?}
A --> Δ
B --> Δ
C --> Δ
D --> Δ
Δ -->|yes| PASS([✓ PASS])
Δ -->|no| FAIL([✗ FAIL])
classDef ok fill:#eef3e8,stroke:#3d7a4a,color:#1a3d22,stroke-width:1.5px;
classDef fail fill:#f7e9e3,stroke:#9c4a1a,color:#5a1c0c,stroke-width:1.5px;
classDef novel fill:#fef9e7,stroke:#9d8147,color:#5a4720,stroke-width:2px;
classDef gate fill:#f3eee5,stroke:#1c1611,color:#1c1611,stroke-width:2px;
class PASS ok
class FAIL fail
class D novel
class Δ gate
| Null | What it tests | Catches |
|---|---|---|
| A — gold-permuted | bijection π over class labels | systems that learned label distribution shape, not relevance |
| B — uniform random | iid uniform draw of gold per query | systems that exploit class-prior assumption |
| C — random retrieval | replace engine output with K random items from pool | systems that score by retrieval coverage, not ranking quality |
| D — marginal-matched ★ | iid draw of gold from the empirical class frequency | predictors matched to the gold marginal — new in this work |
Null D is the load-bearing contribution. It correctly rejects the constant-most-frequent predictor that A and B can false-positive. (Definition 1 of the preprint.)
# 1. Library
from falsify_eval import four_null_gate
result = four_null_gate(
retrieved_lists, gold_list, rel_list, my_metric,
item_pool=corpus_ids, k=5, n_trials=50, tau=0.05,
progress=True, # stderr per-stage timing
)
assert result["gate_passes"]# 2. CLI on JSONL benches — no Python knowledge needed
falsify-eval grade --input bench.jsonl --metric ndcg@5 --pool corpus.txt
falsify-eval doctor # end-to-end install verification
falsify-eval quickstart ./demo # writes a sample bench + pool// 3. MCP server — Claude Code, Cursor, any MCP-compatible client
{
"mcpServers": {
"falsify-eval": {
"command": "python",
"args": ["-m", "falsify_eval.mcp_server"]
}
}
}Claude can then call grade_retrieval directly on any retrieval pipeline output you give it — no glue code, no separate scoring service.
A non-exhaustive list of failure modes the gate flags:
| Broken predictor | Δ_A | Δ_B | Δ_C | Δ_D | Gate |
|---|---|---|---|---|---|
| Constant most-frequent class | ≈ 0 | ≈ 0 | + | ≈ 0 | ✗ |
| Marginal-matched random | ≈ 0 | + | + | ≈ 0 | ✗ |
| Popularity-only ranker (no query feature) | + | + | + | small | ✗ |
| Lexical-match-only on bag-of-words | + | + | + | + | ✓ |
| Full retriever (BM25 / dense / hybrid) | + | + | + | + | ✓ |
| Full retriever on drifted corpus | varies | varies | varies | varies | ✗ via verify_state |
The first three score well on bare aggregate metrics (nDCG, MRR, recall@K). The standard reporting practice publishes those numbers. The four-null gate rejects them.
A passing gate is necessary for credible reporting, not sufficient. It does not prove:
- the engine learned the actual relevance signal (only that it learned something beyond the four trivial null classes)
- the engine generalises beyond the evaluation set
- per-feature contribution claims are significant (handled separately by
bootstrap_ci,paired_permutation_p,cohens_d_paired) - the bench developer didn't overfit query phrasing to engine behaviour
The library is calibrated for retrieval and ranking evaluation — search, recommendation top-K, RAG retrieval-side, classification-as-retrieval. It is not yet generalised to LLM free-text generation, summarisation, or open-ended QA. Those domains need their own null distributions and are planned for v0.3+.
from falsify_eval import four_null_gate
# Replace this with whatever your retriever returns. The library doesn't
# care if it's BM25, FAISS, Pinecone, Weaviate, Vespa, or a homegrown
# bag-of-words. It grades the OUTPUT, not the engine.
def my_rag_retriever(query: str) -> list[str]:
"""Return top-K document IDs for a query."""
...
retrieved = [my_rag_retriever(q) for q in queries]
def recall_at_5(r, g, _rel): return 1.0 if g in r[:5] else 0.0
res = four_null_gate(
retrieved, gold, [3]*len(gold), recall_at_5,
item_pool=pool, k=5, n_trials=100, tau=0.05, seed=2026,
)
print("GATE:", "PASS" if res["gate_passes"] else "FAIL", res["deltas"])A complete Claude-API worked example with a 50-query bench is in examples/llm_rag_validation.py. To adapt it to GPT-4 / Llama / Mistral / Gemini: swap the API call inside my_rag_retriever. The gate is identical.
The gate calls your metric_fn exactly N × (1 + 4 × n_trials) times.
| Metric cost / call | N=500, n_trials=50 |
|---|---|
| In-memory check (~1 µs) | 0.1 s |
| Embedding lookup (~1 ms) | 1.7 min |
| LLM-judge call (~200 ms) | ~5.6 hours |
If your run is taking hours, your metric is the bottleneck — not the gate (which finishes N=5,000 × pool=100k × n_trials=50 in under 2 seconds with a fast metric). Pass progress=True to see per-stage timing on stderr. Three options to speed up: (1) drop n_trials from 50 → 20 — statistically defensible; (2) cache metric_fn calls; (3) parallelise the four nulls with multiprocessing — pure CPU, no shared state.
| Capability | DVC | MLflow | W&B | Ragas | TruLens | falsify-eval |
|---|---|---|---|---|---|---|
| Vendor-free | ✓ | ✓ | ✗ | ✓ | partial | ✓ |
| Pure-text human-readable lock | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Couples artifact hash + verified score | ✗ | ✗ | partial | ✗ | partial | ✓ |
| Falsification gate (CI-enforceable) | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Marginal-matched null ★ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Positive-control self-validation | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
The tools above solve different problems (versioning, tracking, observability). They complement falsify-eval; they don't replace it.
Where it actually runs
Pure Python ≥ 3.10 + numpy ≥ 1.24. No GPUs, no native extensions, no internet at runtime.
| Environment | One-liner |
|---|---|
| Local laptop | pip install git+https://github.com/spalsh-spec/falsify-eval |
| Google Colab | !pip install git+https://github.com/spalsh-spec/falsify-eval |
| Kaggle / Sagemaker / Databricks | same as Colab |
| GitHub Actions | add the pip install line to your run: block |
| Docker (any base image with Python ≥ 3.10) | RUN pip install git+... |
| AWS Lambda / Cloud Functions | bundle as a layer; the wheel is < 50 KB |
| Air-gapped / offline | clone the repo to a USB stick; install from local path |
The library is intentionally minimal so the audit surface is small and the deployment surface is large. No network calls, no telemetry, no opinions about your runtime.
What the gate proves (Proposition 1)
If the four-null gate PASSes (Δ ≥ τ on all four nulls) at N_trials = 50, τ = 0.05, then with Bonferroni-corrected confidence ≥ 0.95:
- The engine is not equivalent to a label-permutation-invariant ranker (rejected by G_A).
- The engine is not achieving its score solely via the uniform-class-prior assumption (rejected by G_B).
- The engine is not equivalent to a uniform-random retriever (rejected by G_C).
- The engine is not equivalent to a gold-marginal-matched predictor (rejected by G_D — new in this work).
The full proof is in PREPRINT.md, §3.
Why we built it
Most retrieval-system papers report a single aggregate metric (nDCG@k, MRR) and call it a contribution. Three failure modes make this practice unsafe at any benchmark size and dangerous on small ones:
- Null-distribution silence. A learned ranker can absorb gold-label distribution shape without learning underlying query–document relevance. A constant predictor matched to the empirical class marginal can score non-trivially without using the query at all.
- Corpus drift between commits. ALTER TABLE migrations and feedback-loop side effects mutate runtime artifacts without changing source code. A "score-neutral" annotation can be true about the source diff while false about the runnable system.
- Small-sample claims masquerading as significance. A +0.02 metric gain on N < 50 queries usually sits inside the bench's noise floor.
The four-null gate addresses (1). The integrity-check state lock (lock_state / verify_state) addresses (2). The statistical-reporting helpers (bootstrap_ci, paired_permutation_p, cohens_d_paired, power_n_required) address (3). All in <1,000 lines of Python with numpy as the only runtime dependency.
PREPRINT.md— Calibrated Falsification Harnesses for Retrieval Evaluation (v7, with N=10,000 validation, broken-predictor suite, sensitivity grid, soundness proposition).SUPPLEMENTARY.md— extended tables, ablations, bench-size calibration curve.
Submission to arXiv is pending. The DOI will be added to CITATION.cff on acceptance. In the interim, the markdown is the canonical source; both files are immutable for v0.1.0 (verifiable via lock_state against the v0.1.0 tag).
@article{sharma2026calibrated,
title = {Calibrated Falsification Harnesses for Retrieval Evaluation},
author = {Sharma, Sparsh},
year = {2026},
eprint = {<arxiv-id-when-published>},
archivePrefix = {arXiv},
primaryClass = {cs.IR}
}Vāk-Kaṇaja is the Sanskrit / Pāṇinian retrieval engine built alongside falsify-eval. It is the first retriever (to my knowledge) adversarially verified by the four-null gate via cross-falsification, and the first to wire the 6 classical Pramāṇas of Nyāya / Mīmāṃsā into a retrieval engine as a router — detecting the query's epistemological type (Pratyakṣa, Anumāna, Upamāna, Arthāpatti, Anupalabdhi, Śabda) and routing evidence channels accordingly.
It also implements an Anupalabdhi (non-perception) confidence floor: when the corpus does not contain the answer, the engine returns "corpus does not contain this knowledge" as a positive verdict, refusing to leak weak chunks. Pairs with falsify-eval's Null A naturally — the silent-failure failure mode that load-bearing AI-safety arguments rely on assuming away.
The engine ships with a calibrated negative result: bench expansion N=21 → N=141 falsified the lift from the novel rerankers (Poincaré, topological persistence, fractal affinity), which now ship at production weight 0 and are documented as opt-in research components. The 3-channel φ-RRF baseline is the production default. This is the falsify-eval discipline applied to the authoring engine — same calibration that earned three clean rounds of adversarial review on this library.
Public release imminent at github.com/spalsh-spec/vak-kanaja, Apache 2.0, under the Bhardwaj & Sons brand. Priority announcement dated 2026-05-08.
- v0.1.6.11 — current. 91 tests passing on a fresh clone (Mayank-battery 31 + property-based 15 + scipy cross-check 11 + smoke 8 + validation 9 + CLI stdin 4 + Windows-encoding 3 + shell-mangled paths 6 + sundry 4); ~10 s on M1. CI matrix green on Ubuntu × {3.10, 3.11, 3.12} and macOS × {3.10, 3.11, 3.12}.
- v0.1.6.11 — publish-workflow version-sync guard hardened: previously tried to
import falsify_evalbefore the package was installed and failed at the version-check step; now reads__version__andpyproject.toml'sversiondirectly via grep/sed so the tag, source files, and built artefact are cross-checked three ways without requiring an install. - v0.1.6.10 — distribution + arXiv build prep (infrastructure-only, no gate behaviour change): added
.github/workflows/publish.ymlfor OIDC trusted publishing to PyPI on everyv*tag push; addedtools/build_arxiv.shfor convertingPREPRINT.mdto an arXiv-submittable LaTeX bundle via pandoc; added[tool.mutmut]config +docs/MUTATION_TESTING.mddocumenting the deferred status; added[project.optional-dependencies] devbucket pinningmutmut,build, andtwine. - v0.1.6.9 — added CS03 case-study scaffold (
case_studies/cs03_aikosh_rag/) for the AIKosh internal RAG integration (Jasmeet Singh, in flight); added Tested-platforms log to README; renumbered v0.2 case studies (CS03 = AIKosh, CS04 = FiQA, CS05 = Quora). - v0.1.6.8 — empirical equivariance certificate: PREPRINT §5.9 + property tests proving the gate is strongly equivariant under order-preserving label-set bijections and Null C /
real_meanare exactly equivariant under arbitrary bijections. - v0.1.6.7 — declared
hypothesis>=6.0as a test dep so CI installs it. (Caught by CI matrix the moment v0.1.6.6 landed.) - v0.1.6.6 — Hypothesis property-based test suite for the four-null gate: 13 universally-true properties (algebraic, deterministic, metric, gate-semantics, validation), each fuzzed against ~80 random benches per CI run.
- v0.1.6.5 — cross-platform path-mangling hint: when
--input my-bench\bench.jsonlis copy-pasted into zsh and the backslash gets eaten, the CLI now suggests the corrected forward-slash path instead of a bareFileNotFoundError. - v0.1.6.4 — Windows console UTF-8 / ASCII output hardening (closes Jasmeet's cp1252
UnicodeEncodeErroron theΔglyph): reconfigure stdout to UTF-8 witherrors='replace'at CLI entry, with auto-fallback to ASCII glyphs (Δ→d,τ→tau,✓→[ok]) when the post-reconfigure stream still can't encode them. Also--asciiflag andFALSIFY_ASCII=1env var. - v0.1.6.3 — public priority announcement of companion engine Vāk-Kaṇaja.
- v0.1.6.2 — Mayank round-3 polish: negative-seed validation in
_validate_inputs. - v0.1.6.1 — Mayank round-2: CLI
--input -now reads from stdin (wasFileNotFoundError: '-'). - v0.1.6 — bonferroni helper, scipy cross-check tests, property-based tests, CS02 SciFact case study, PREPRINT scope-honesty rewrite, AI/retrieval conflation strike across surfaces.
- v0.1.5.2 — added
progress=Trueflag tofour_null_gateafter Mayank's 5-hour AIKosh silent-run incident. - v0.1.5.1 — closed
null_adefect class for tuple / dataclass labels. - v0.1.5 — fixed all 14 defects from the Mayank Singh adversarial battery; full credit in
CHANGELOG.md. - v0.2 (next) — PyPI publish; case studies CS03 (AIKosh internal RAG, scaffolded — see
case_studies/cs03_aikosh_rag/), CS04 (FiQA) and CS05 (Quora) for metric-sensitivity triangulation; broken-predictor zoo as a public artifact;label_order_seedparameter to break dependency on adversarial label ordering (see PREPRINT §5.9). - v0.3+ (planned) — extension to LLM free-text and summarisation; pre-registration tooling. (Not yet shipped — do not claim coverage.)
External-verification log. Each entry is a real run by a real person who is not the package author, dated, with the exact version they ran. New entries go at the top.
| Date | Tester | OS | Python | Shell | Version | Notes |
|---|---|---|---|---|---|---|
| 2026-05-08 | Jasmeet Singh (AIKosh) | Windows 10 (19045) | 3.14.3 | PowerShell | 0.1.6.7 | install / upgrade 0.1.6.2→0.1.6.7 / doctor / quickstart / grade all clean; original cp1252 defect closed. CS03 integration with AIKosh's internal RAG retriever in flight. |
| 2026-05-07 | Mayank Singh | macOS 14 (M1) | 3.12 | zsh | 0.1.5 → 0.1.6.2 | adversarial 14-defect battery; all closed. |
Issues and PRs welcome. The reference implementation is intentionally minimal; the goal is for the protocol to be small enough that adopters audit the entire library before depending on it.
A house of standards. Released by Bhardwaj & Sons under Apache 2.0.
The methodology is free, public, and citable so it can become a standard rather than a product.