Skip to content

spalsh-spec/falsify-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
falsify-eval — four nulls, one gate, zero inflation

Some search engines pretend to be smart.

They look like they understand your question.
They actually just return whatever's most popular in their database.

A student named Mira would do the same on her French exam.
She'd score 80% by always picking "C". She doesn't speak French.

This is a 30-second test that catches them.

→ Try it without installing anything

Open in Colab Play with sliders Real-data case study

The Colab runs the actual library on a synthetic bench (60 seconds, no install).
The Playground lets you pick a strategy with sliders and watch the gate verdict update live in your browser.
The Case study shows the same gate working on a peer-reviewed BEIR benchmark.

→ Or install and run locally

pip install git+https://github.com/spalsh-spec/falsify-eval

Free. Open source. Runs on your laptop. Works on any search system.

Built for search engines, recommendation systems, the retrieval side of RAG.
Not built for the part of ChatGPT that writes paragraphs — that's a different problem we haven't built a test for.


CI Tests PyPI DOI Release Python ≥ 3.10 Apache 2.0

30-second demo · The Mira test · How it works · Three surfaces · Preprint


The Mira test

Imagine a student named Mira who never studied. She noticed that on past exams, "C" is the most common correct answer. So she writes C every time and scores 80%. She looks smart on paper. She has zero actual knowledge — she gamed the pattern.

A retrieval or ranking system can do the same thing. If the most popular document in a corpus happens to be relevant for most queries, a system that always returns that popular document will score well on aggregate metrics — without using the query at all. (This is not a hypothetical: see the CS01 NFCorpus case study where this exact predictor scores nDCG@10 = 0.066 on a published BEIR benchmark while ignoring every query.)

The published number looks great. It does not mean what you think it means.

falsify-eval is a Mira-check for retrieval and ranking systems. It compares your system's score against four "fake students" — four null distributions, including one (Null D, the marginal-matched random) that is original to this work and that the previous standard nulls miss. If your system can't beat all four by a calibrated margin, the gate fails.

Case studies (real numbers, two public benchmarks):

Across both: Mira and popularity-only fail at Δ_D ≈ 0; BM25 and dense MiniLM pass at Δ_D = +0.14 to +0.73. Reproducible in 5 minutes each on M1 laptop. Joint finding: graded metrics (nDCG) on dense-relevance benchmarks can flatten the gate — pair them with single-gold strict metrics (recall@K against top-1).


30-second demo

pip install git+https://github.com/spalsh-spec/falsify-eval
python -c "from falsify_eval.demo import run; run()"

Three systems graded on a 50-query synthetic bench:

═══ constant_predictor (deliberately broken) ═══
  real mean nDCG@5         = 0.20
    Δ_A (gold-permuted)    = +0.000  ✗
    Δ_B (uniform random)   = +0.001  ✗
    Δ_C (random retrieval) = +0.18   ✓
    Δ_D (marginal-matched) = +0.000  ✗  ← the gate that catches Mira
  GATE: ✗ FAIL  (correctly rejected)

═══ mock_engine (plausible retrieval, 70% top-1) ═══
  real mean nDCG@5         = 0.62
    Δ across all 4 nulls   ≥ +0.40   ✓
  GATE: ✓ PASS  (correctly accepted)

═══ oracle (perfect top-1) ═══
  real mean nDCG@5         = 1.00
  GATE: ✓ PASS by maximum margin

How it works

%%{init: {'theme': 'base', 'themeVariables': {
    'fontFamily': 'Garamond, EB Garamond, Georgia, serif',
    'primaryColor': '#f3eee5',
    'primaryTextColor': '#1c1611',
    'primaryBorderColor': '#9c4a1a',
    'lineColor': '#9d8147',
    'tertiaryColor': '#faf6ed',
    'tertiaryBorderColor': '#d4c8b2',
    'edgeLabelBackground': '#f3eee5'
}}}%%
flowchart LR
    R([your retriever]) -->|top-K per query| S[real score]
    G([gold labels]) --> S
    G -->|permute π| A[Null A · label-permuted]
    G -->|iid uniform| B[Null B · uniform random]
    P([item pool]) -->|sample K| C[Null C · random retrieval]
    G -->|sample by class freq| D[Null D · marginal-matched ★]
    S --> Δ{Δ ≥ τ on<br/>all four?}
    A --> Δ
    B --> Δ
    C --> Δ
    D --> Δ
    Δ -->|yes| PASS([✓ PASS])
    Δ -->|no| FAIL([✗ FAIL])
    classDef ok    fill:#eef3e8,stroke:#3d7a4a,color:#1a3d22,stroke-width:1.5px;
    classDef fail  fill:#f7e9e3,stroke:#9c4a1a,color:#5a1c0c,stroke-width:1.5px;
    classDef novel fill:#fef9e7,stroke:#9d8147,color:#5a4720,stroke-width:2px;
    classDef gate  fill:#f3eee5,stroke:#1c1611,color:#1c1611,stroke-width:2px;
    class PASS ok
    class FAIL fail
    class D novel
    class Δ gate
Loading
Null What it tests Catches
A — gold-permuted bijection π over class labels systems that learned label distribution shape, not relevance
B — uniform random iid uniform draw of gold per query systems that exploit class-prior assumption
C — random retrieval replace engine output with K random items from pool systems that score by retrieval coverage, not ranking quality
D — marginal-matched iid draw of gold from the empirical class frequency predictors matched to the gold marginal — new in this work

Null D is the load-bearing contribution. It correctly rejects the constant-most-frequent predictor that A and B can false-positive. (Definition 1 of the preprint.)


Three surfaces

# 1. Library
from falsify_eval import four_null_gate

result = four_null_gate(
    retrieved_lists, gold_list, rel_list, my_metric,
    item_pool=corpus_ids, k=5, n_trials=50, tau=0.05,
    progress=True,                      # stderr per-stage timing
)
assert result["gate_passes"]
# 2. CLI on JSONL benches — no Python knowledge needed
falsify-eval grade --input bench.jsonl --metric ndcg@5 --pool corpus.txt
falsify-eval doctor                     # end-to-end install verification
falsify-eval quickstart ./demo          # writes a sample bench + pool
// 3. MCP server — Claude Code, Cursor, any MCP-compatible client
{
  "mcpServers": {
    "falsify-eval": {
      "command": "python",
      "args": ["-m", "falsify_eval.mcp_server"]
    }
  }
}

Claude can then call grade_retrieval directly on any retrieval pipeline output you give it — no glue code, no separate scoring service.


What it catches

A non-exhaustive list of failure modes the gate flags:

Broken predictor Δ_A Δ_B Δ_C Δ_D Gate
Constant most-frequent class ≈ 0 ≈ 0 + ≈ 0
Marginal-matched random ≈ 0 + + ≈ 0
Popularity-only ranker (no query feature) + + + small
Lexical-match-only on bag-of-words + + + +
Full retriever (BM25 / dense / hybrid) + + + +
Full retriever on drifted corpus varies varies varies varies ✗ via verify_state

The first three score well on bare aggregate metrics (nDCG, MRR, recall@K). The standard reporting practice publishes those numbers. The four-null gate rejects them.


What the gate does NOT prove

A passing gate is necessary for credible reporting, not sufficient. It does not prove:

  • the engine learned the actual relevance signal (only that it learned something beyond the four trivial null classes)
  • the engine generalises beyond the evaluation set
  • per-feature contribution claims are significant (handled separately by bootstrap_ci, paired_permutation_p, cohens_d_paired)
  • the bench developer didn't overfit query phrasing to engine behaviour

The library is calibrated for retrieval and ranking evaluation — search, recommendation top-K, RAG retrieval-side, classification-as-retrieval. It is not yet generalised to LLM free-text generation, summarisation, or open-ended QA. Those domains need their own null distributions and are planned for v0.3+.


Validating an LLM-RAG pipeline

from falsify_eval import four_null_gate

# Replace this with whatever your retriever returns. The library doesn't
# care if it's BM25, FAISS, Pinecone, Weaviate, Vespa, or a homegrown
# bag-of-words. It grades the OUTPUT, not the engine.
def my_rag_retriever(query: str) -> list[str]:
    """Return top-K document IDs for a query."""
    ...

retrieved = [my_rag_retriever(q) for q in queries]

def recall_at_5(r, g, _rel): return 1.0 if g in r[:5] else 0.0

res = four_null_gate(
    retrieved, gold, [3]*len(gold), recall_at_5,
    item_pool=pool, k=5, n_trials=100, tau=0.05, seed=2026,
)
print("GATE:", "PASS" if res["gate_passes"] else "FAIL", res["deltas"])

A complete Claude-API worked example with a 50-query bench is in examples/llm_rag_validation.py. To adapt it to GPT-4 / Llama / Mistral / Gemini: swap the API call inside my_rag_retriever. The gate is identical.


Why is my run taking so long?

The gate calls your metric_fn exactly N × (1 + 4 × n_trials) times.

Metric cost / call N=500, n_trials=50
In-memory check (~1 µs) 0.1 s
Embedding lookup (~1 ms) 1.7 min
LLM-judge call (~200 ms) ~5.6 hours

If your run is taking hours, your metric is the bottleneck — not the gate (which finishes N=5,000 × pool=100k × n_trials=50 in under 2 seconds with a fast metric). Pass progress=True to see per-stage timing on stderr. Three options to speed up: (1) drop n_trials from 50 → 20 — statistically defensible; (2) cache metric_fn calls; (3) parallelise the four nulls with multiprocessing — pure CPU, no shared state.


How this compares

Capability DVC MLflow W&B Ragas TruLens falsify-eval
Vendor-free partial
Pure-text human-readable lock
Couples artifact hash + verified score partial partial
Falsification gate (CI-enforceable)
Marginal-matched null
Positive-control self-validation

The tools above solve different problems (versioning, tracking, observability). They complement falsify-eval; they don't replace it.


Where it actually runs

Pure Python ≥ 3.10 + numpy ≥ 1.24. No GPUs, no native extensions, no internet at runtime.

Environment One-liner
Local laptop pip install git+https://github.com/spalsh-spec/falsify-eval
Google Colab !pip install git+https://github.com/spalsh-spec/falsify-eval
Kaggle / Sagemaker / Databricks same as Colab
GitHub Actions add the pip install line to your run: block
Docker (any base image with Python ≥ 3.10) RUN pip install git+...
AWS Lambda / Cloud Functions bundle as a layer; the wheel is < 50 KB
Air-gapped / offline clone the repo to a USB stick; install from local path

The library is intentionally minimal so the audit surface is small and the deployment surface is large. No network calls, no telemetry, no opinions about your runtime.

What the gate proves (Proposition 1)

If the four-null gate PASSes (Δ ≥ τ on all four nulls) at N_trials = 50, τ = 0.05, then with Bonferroni-corrected confidence ≥ 0.95:

  • The engine is not equivalent to a label-permutation-invariant ranker (rejected by G_A).
  • The engine is not achieving its score solely via the uniform-class-prior assumption (rejected by G_B).
  • The engine is not equivalent to a uniform-random retriever (rejected by G_C).
  • The engine is not equivalent to a gold-marginal-matched predictor (rejected by G_D — new in this work).

The full proof is in PREPRINT.md, §3.

Why we built it

Most retrieval-system papers report a single aggregate metric (nDCG@k, MRR) and call it a contribution. Three failure modes make this practice unsafe at any benchmark size and dangerous on small ones:

  1. Null-distribution silence. A learned ranker can absorb gold-label distribution shape without learning underlying query–document relevance. A constant predictor matched to the empirical class marginal can score non-trivially without using the query at all.
  2. Corpus drift between commits. ALTER TABLE migrations and feedback-loop side effects mutate runtime artifacts without changing source code. A "score-neutral" annotation can be true about the source diff while false about the runnable system.
  3. Small-sample claims masquerading as significance. A +0.02 metric gain on N < 50 queries usually sits inside the bench's noise floor.

The four-null gate addresses (1). The integrity-check state lock (lock_state / verify_state) addresses (2). The statistical-reporting helpers (bootstrap_ci, paired_permutation_p, cohens_d_paired, power_n_required) address (3). All in <1,000 lines of Python with numpy as the only runtime dependency.


Preprint

  • PREPRINT.mdCalibrated Falsification Harnesses for Retrieval Evaluation (v7, with N=10,000 validation, broken-predictor suite, sensitivity grid, soundness proposition).
  • SUPPLEMENTARY.md — extended tables, ablations, bench-size calibration curve.

Submission to arXiv is pending. The DOI will be added to CITATION.cff on acceptance. In the interim, the markdown is the canonical source; both files are immutable for v0.1.0 (verifiable via lock_state against the v0.1.0 tag).

@article{sharma2026calibrated,
  title  = {Calibrated Falsification Harnesses for Retrieval Evaluation},
  author = {Sharma, Sparsh},
  year   = {2026},
  eprint = {<arxiv-id-when-published>},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR}
}

Companion engine — Vāk-Kaṇaja (public release imminent)

Vāk-Kaṇaja is the Sanskrit / Pāṇinian retrieval engine built alongside falsify-eval. It is the first retriever (to my knowledge) adversarially verified by the four-null gate via cross-falsification, and the first to wire the 6 classical Pramāṇas of Nyāya / Mīmāṃsā into a retrieval engine as a router — detecting the query's epistemological type (Pratyakṣa, Anumāna, Upamāna, Arthāpatti, Anupalabdhi, Śabda) and routing evidence channels accordingly.

It also implements an Anupalabdhi (non-perception) confidence floor: when the corpus does not contain the answer, the engine returns "corpus does not contain this knowledge" as a positive verdict, refusing to leak weak chunks. Pairs with falsify-eval's Null A naturally — the silent-failure failure mode that load-bearing AI-safety arguments rely on assuming away.

The engine ships with a calibrated negative result: bench expansion N=21 → N=141 falsified the lift from the novel rerankers (Poincaré, topological persistence, fractal affinity), which now ship at production weight 0 and are documented as opt-in research components. The 3-channel φ-RRF baseline is the production default. This is the falsify-eval discipline applied to the authoring engine — same calibration that earned three clean rounds of adversarial review on this library.

Public release imminent at github.com/spalsh-spec/vak-kanaja, Apache 2.0, under the Bhardwaj & Sons brand. Priority announcement dated 2026-05-08.


Status

  • v0.1.6.11 — current. 91 tests passing on a fresh clone (Mayank-battery 31 + property-based 15 + scipy cross-check 11 + smoke 8 + validation 9 + CLI stdin 4 + Windows-encoding 3 + shell-mangled paths 6 + sundry 4); ~10 s on M1. CI matrix green on Ubuntu × {3.10, 3.11, 3.12} and macOS × {3.10, 3.11, 3.12}.
  • v0.1.6.11 — publish-workflow version-sync guard hardened: previously tried to import falsify_eval before the package was installed and failed at the version-check step; now reads __version__ and pyproject.toml's version directly via grep/sed so the tag, source files, and built artefact are cross-checked three ways without requiring an install.
  • v0.1.6.10 — distribution + arXiv build prep (infrastructure-only, no gate behaviour change): added .github/workflows/publish.yml for OIDC trusted publishing to PyPI on every v* tag push; added tools/build_arxiv.sh for converting PREPRINT.md to an arXiv-submittable LaTeX bundle via pandoc; added [tool.mutmut] config + docs/MUTATION_TESTING.md documenting the deferred status; added [project.optional-dependencies] dev bucket pinning mutmut, build, and twine.
  • v0.1.6.9 — added CS03 case-study scaffold (case_studies/cs03_aikosh_rag/) for the AIKosh internal RAG integration (Jasmeet Singh, in flight); added Tested-platforms log to README; renumbered v0.2 case studies (CS03 = AIKosh, CS04 = FiQA, CS05 = Quora).
  • v0.1.6.8 — empirical equivariance certificate: PREPRINT §5.9 + property tests proving the gate is strongly equivariant under order-preserving label-set bijections and Null C / real_mean are exactly equivariant under arbitrary bijections.
  • v0.1.6.7 — declared hypothesis>=6.0 as a test dep so CI installs it. (Caught by CI matrix the moment v0.1.6.6 landed.)
  • v0.1.6.6 — Hypothesis property-based test suite for the four-null gate: 13 universally-true properties (algebraic, deterministic, metric, gate-semantics, validation), each fuzzed against ~80 random benches per CI run.
  • v0.1.6.5 — cross-platform path-mangling hint: when --input my-bench\bench.jsonl is copy-pasted into zsh and the backslash gets eaten, the CLI now suggests the corrected forward-slash path instead of a bare FileNotFoundError.
  • v0.1.6.4 — Windows console UTF-8 / ASCII output hardening (closes Jasmeet's cp1252 UnicodeEncodeError on the Δ glyph): reconfigure stdout to UTF-8 with errors='replace' at CLI entry, with auto-fallback to ASCII glyphs (Δ→d, τ→tau, ✓→[ok]) when the post-reconfigure stream still can't encode them. Also --ascii flag and FALSIFY_ASCII=1 env var.
  • v0.1.6.3 — public priority announcement of companion engine Vāk-Kaṇaja.
  • v0.1.6.2 — Mayank round-3 polish: negative-seed validation in _validate_inputs.
  • v0.1.6.1 — Mayank round-2: CLI --input - now reads from stdin (was FileNotFoundError: '-').
  • v0.1.6 — bonferroni helper, scipy cross-check tests, property-based tests, CS02 SciFact case study, PREPRINT scope-honesty rewrite, AI/retrieval conflation strike across surfaces.
  • v0.1.5.2 — added progress=True flag to four_null_gate after Mayank's 5-hour AIKosh silent-run incident.
  • v0.1.5.1 — closed null_a defect class for tuple / dataclass labels.
  • v0.1.5 — fixed all 14 defects from the Mayank Singh adversarial battery; full credit in CHANGELOG.md.
  • v0.2 (next) — PyPI publish; case studies CS03 (AIKosh internal RAG, scaffolded — see case_studies/cs03_aikosh_rag/), CS04 (FiQA) and CS05 (Quora) for metric-sensitivity triangulation; broken-predictor zoo as a public artifact; label_order_seed parameter to break dependency on adversarial label ordering (see PREPRINT §5.9).
  • v0.3+ (planned) — extension to LLM free-text and summarisation; pre-registration tooling. (Not yet shipped — do not claim coverage.)

Tested platforms

External-verification log. Each entry is a real run by a real person who is not the package author, dated, with the exact version they ran. New entries go at the top.

Date Tester OS Python Shell Version Notes
2026-05-08 Jasmeet Singh (AIKosh) Windows 10 (19045) 3.14.3 PowerShell 0.1.6.7 install / upgrade 0.1.6.2→0.1.6.7 / doctor / quickstart / grade all clean; original cp1252 defect closed. CS03 integration with AIKosh's internal RAG retriever in flight.
2026-05-07 Mayank Singh macOS 14 (M1) 3.12 zsh 0.1.5 → 0.1.6.2 adversarial 14-defect battery; all closed.

Issues and PRs welcome. The reference implementation is intentionally minimal; the goal is for the protocol to be small enough that adopters audit the entire library before depending on it.


A house of standards. Released by Bhardwaj & Sons under Apache 2.0.
The methodology is free, public, and citable so it can become a standard rather than a product.

About

Calibrated falsification harness for retrieval & ranking. Four-null gate (incl. gold-marginal-matched random, novel) + SHA-256/git-commit integrity lock. Catches predictors that look right but aren't.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors