Benchmark for long-list entity extraction from semi-structured documents under layout and OCR noise, inspired by recurring patterns observed in real-world claims documents.
This benchmark was developed at Kay.ai.
# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
python -m pip install -r benchmarks/requirements.txt
python -m playwright install chromium
# Set API keys (only needed for OCR/evaluation runs)
cp .env.example .env
# Generate the complete benchmark dataset
python benchmarks/generate_claims_benchmark.pyConvenience targets are provided via the repository root Makefile:
make help
# Create venv + install deps + install Playwright Chromium
make setup
# Generate synthetic benchmark dataset (PDF/HTML/JSON)
make generate
# Build the paper
make paperSee benchmarks/README.md for benchmark documentation.
- Version: see
VERSION. - Citation metadata: see
CITATION.cff.
- 80 benchmark instances across 4 difficulty tiers × 2 formats
- 2,700 base claims across all instances (some instances include additional rows due to
large_docandduplicates) - 7 problem types testing real-world complexity (all implemented)
- 2 document formats (detailed and table views)
- Ground truth annotations in JSON format
- OCR-processed PDFs simulating production scenarios
| Code | Meaning |
|---|---|
page_breaks |
A single incident/row is split across PDF pages (content continues on the next page). |
multi_row |
Key fields (especially descriptions) span multiple lines/rows instead of being single-line. |
duplicates |
Duplicate incidents are inserted (exact repeats) to test deduplication and counting. |
large_doc |
Document is much longer than normal (many more incidents/pages). |
multiple_tables |
Adds additional irrelevant tables/sections mixed in with the main claims content. |
multi_column |
Uses a multi-column layout in parts of the document to stress reading order. |
merged_cells |
Uses merged table cells (e.g. rowspan/colspan) to make table structure harder. |
| Tier | Claims/PDF | Instances | Formats | Problems |
|---|---|---|---|---|
| Easy | 10 | 15×2 = 30 | Detailed + Table | 1-2 |
| Medium | 25 | 12×2 = 24 | Detailed + Table | 3-4 |
| Hard | 50 | 8×2 = 16 | Detailed + Table | 5-6 |
| Extreme | 100 | 5×2 = 10 | Detailed + Table | All 7 |
Note: these are nominal sizes; the released dataset includes additional rows from duplicates and large_doc. In the current release, ground-truth incident counts per document range from 10--11 (easy), 25--27 (medium), 55 (hard), and 500 (extreme).
- Detailed: Incident sections with line items and financial breakdowns
- Table: Compact tabular format with all claims in rows
Using the synchronized benchmark snapshot from this repository, we highlight two local extraction regimes:
| Regime | Overall weighted micro F1 | Extreme-tier weighted micro F1 | Representative extreme F1 |
|---|---|---|---|
| Full-context one-shot | 27.4% | 5.9% | 5.7% (extreme_100_001_detailed) |
Auto-chunked (longlistbench) |
84.8% | 81.7% | 63.3% (extreme_100_001_detailed) |
The local one-shot regime remains strong on easy documents (97.2%), but drops to 74.6% on medium, 44.4% on hard, and 5.9% on extreme. The simplified local auto-chunked regime reaches 97.3% weighted F1 on easy, 96.5% on medium, 87.7% on hard, 71.0% on detailed documents overall, and 95.9% on table documents overall. Chunking therefore mitigates the catastrophic long-context failure mode, but the simpler released baseline still leaves substantial residual errors, especially on long detailed documents.
For development and testing, see benchmarks/synthetic/README.md for the synthetic data generator.
Optional: install a pre-commit hook to quickly sanity-check that the paper compiles:
# From the repository root
cp pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commitThe hook runs a fast LaTeX compile (make quick) in the paper directory; in strict mode it can prevent the commit if compilation fails.
By default, the hook is best-effort and will skip (or warn) when dependencies are missing. To make paper compilation failures block commits, set:
export STRICT_PAPER_COMPILE=1Manually invoking the hook:
# Test the hook without committing
.git/hooks/pre-commitAlternatively, run the same check from your virtualenv:
source .venv/bin/activate
make -C paper quickNote: You can skip the hook for a specific commit using:
git commit --no-verify- LaTeX distribution (TeX Live, MacTeX, or similar)
pdflatexandbibermust be available in your PATH