LongListBench

Benchmark for long-list entity extraction from semi-structured documents under layout and OCR noise, inspired by recurring patterns observed in real-world claims documents.

This benchmark was developed at Kay.ai.

Quick Start

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
python -m pip install -r benchmarks/requirements.txt
python -m playwright install chromium

# Set API keys (only needed for OCR/evaluation runs)
cp .env.example .env

# Generate the complete benchmark dataset
python benchmarks/generate_claims_benchmark.py

Reproducibility

Convenience targets are provided via the repository root Makefile:

make help

# Create venv + install deps + install Playwright Chromium
make setup

# Generate synthetic benchmark dataset (PDF/HTML/JSON)
make generate

# Build the paper
make paper

See benchmarks/README.md for benchmark documentation.

Versioning and Citation

Version: see VERSION.
Citation metadata: see CITATION.cff.

Benchmark Overview

80 benchmark instances across 4 difficulty tiers × 2 formats
2,700 base claims across all instances (some instances include additional rows due to large_doc and duplicates)
7 problem types testing real-world complexity (all implemented)
2 document formats (detailed and table views)
Ground truth annotations in JSON format
OCR-processed PDFs simulating production scenarios

Problem Types

Code	Meaning
`page_breaks`	A single incident/row is split across PDF pages (content continues on the next page).
`multi_row`	Key fields (especially descriptions) span multiple lines/rows instead of being single-line.
`duplicates`	Duplicate incidents are inserted (exact repeats) to test deduplication and counting.
`large_doc`	Document is much longer than normal (many more incidents/pages).
`multiple_tables`	Adds additional irrelevant tables/sections mixed in with the main claims content.
`multi_column`	Uses a multi-column layout in parts of the document to stress reading order.
`merged_cells`	Uses merged table cells (e.g. `rowspan`/`colspan`) to make table structure harder.

Difficulty Tiers

Tier	Claims/PDF	Instances	Formats	Problems
Easy	10	15×2 = 30	Detailed + Table	1-2
Medium	25	12×2 = 24	Detailed + Table	3-4
Hard	50	8×2 = 16	Detailed + Table	5-6
Extreme	100	5×2 = 10	Detailed + Table	All 7

Note: these are nominal sizes; the released dataset includes additional rows from duplicates and large_doc. In the current release, ground-truth incident counts per document range from 10--11 (easy), 25--27 (medium), 55 (hard), and 500 (extreme).

Document Formats

Detailed: Incident sections with line items and financial breakdowns
Table: Compact tabular format with all claims in rows

Verified Gemini 2.5 Baseline

Using the synchronized benchmark snapshot from this repository, we highlight two local extraction regimes:

Regime	Overall weighted micro F1	Extreme-tier weighted micro F1	Representative extreme F1
Full-context one-shot	27.4%	5.9%	5.7% (`extreme_100_001_detailed`)
Auto-chunked (`longlistbench`)	84.8%	81.7%	63.3% (`extreme_100_001_detailed`)

The local one-shot regime remains strong on easy documents (97.2%), but drops to 74.6% on medium, 44.4% on hard, and 5.9% on extreme. The simplified local auto-chunked regime reaches 97.3% weighted F1 on easy, 96.5% on medium, 87.7% on hard, 71.0% on detailed documents overall, and 95.9% on table documents overall. Chunking therefore mitigates the catastrophic long-context failure mode, but the simpler released baseline still leaves substantial residual errors, especially on long detailed documents.

Development

For development and testing, see benchmarks/synthetic/README.md for the synthetic data generator.

Development Setup

Installing the Pre-Commit Hook

Optional: install a pre-commit hook to quickly sanity-check that the paper compiles:

# From the repository root
cp pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

The hook runs a fast LaTeX compile (make quick) in the paper directory; in strict mode it can prevent the commit if compilation fails.

By default, the hook is best-effort and will skip (or warn) when dependencies are missing. To make paper compilation failures block commits, set:

export STRICT_PAPER_COMPILE=1

Manually invoking the hook:

# Test the hook without committing
.git/hooks/pre-commit

Alternatively, run the same check from your virtualenv:

source .venv/bin/activate
make -C paper quick

Note: You can skip the hook for a specific commit using:

git commit --no-verify

Requirements

LaTeX distribution (TeX Live, MacTeX, or similar)
pdflatex and biber must be available in your PATH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongListBench

Quick Start

Reproducibility

Versioning and Citation

Benchmark Overview

Problem Types

Difficulty Tiers

Document Formats

Verified Gemini 2.5 Baseline

Development

Development Setup

Installing the Pre-Commit Hook

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
benchmarks		benchmarks
paper		paper
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
pre-commit		pre-commit

Folders and files

Latest commit

History

Repository files navigation

LongListBench

Quick Start

Reproducibility

Versioning and Citation

Benchmark Overview

Problem Types

Difficulty Tiers

Document Formats

Verified Gemini 2.5 Baseline

Development

Development Setup

Installing the Pre-Commit Hook

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages