Skip to content

kaydotai/longlistbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongListBench

Benchmark for long-list entity extraction from semi-structured documents under layout and OCR noise, inspired by recurring patterns observed in real-world claims documents.

This benchmark was developed at Kay.ai.

Quick Start

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
python -m pip install -r benchmarks/requirements.txt
python -m playwright install chromium

# Set API keys (only needed for OCR/evaluation runs)
cp .env.example .env

# Generate the complete benchmark dataset
python benchmarks/generate_claims_benchmark.py

Reproducibility

Convenience targets are provided via the repository root Makefile:

make help

# Create venv + install deps + install Playwright Chromium
make setup

# Generate synthetic benchmark dataset (PDF/HTML/JSON)
make generate

# Build the paper
make paper

See benchmarks/README.md for benchmark documentation.

Versioning and Citation

  • Version: see VERSION.
  • Citation metadata: see CITATION.cff.

Benchmark Overview

  • 80 benchmark instances across 4 difficulty tiers × 2 formats
  • 2,700 base claims across all instances (some instances include additional rows due to large_doc and duplicates)
  • 7 problem types testing real-world complexity (all implemented)
  • 2 document formats (detailed and table views)
  • Ground truth annotations in JSON format
  • OCR-processed PDFs simulating production scenarios

Problem Types

Code Meaning
page_breaks A single incident/row is split across PDF pages (content continues on the next page).
multi_row Key fields (especially descriptions) span multiple lines/rows instead of being single-line.
duplicates Duplicate incidents are inserted (exact repeats) to test deduplication and counting.
large_doc Document is much longer than normal (many more incidents/pages).
multiple_tables Adds additional irrelevant tables/sections mixed in with the main claims content.
multi_column Uses a multi-column layout in parts of the document to stress reading order.
merged_cells Uses merged table cells (e.g. rowspan/colspan) to make table structure harder.

Difficulty Tiers

Tier Claims/PDF Instances Formats Problems
Easy 10 15×2 = 30 Detailed + Table 1-2
Medium 25 12×2 = 24 Detailed + Table 3-4
Hard 50 8×2 = 16 Detailed + Table 5-6
Extreme 100 5×2 = 10 Detailed + Table All 7

Note: these are nominal sizes; the released dataset includes additional rows from duplicates and large_doc. In the current release, ground-truth incident counts per document range from 10--11 (easy), 25--27 (medium), 55 (hard), and 500 (extreme).

Document Formats

  • Detailed: Incident sections with line items and financial breakdowns
  • Table: Compact tabular format with all claims in rows

Verified Gemini 2.5 Baseline

Using the synchronized benchmark snapshot from this repository, we highlight two local extraction regimes:

Regime Overall weighted micro F1 Extreme-tier weighted micro F1 Representative extreme F1
Full-context one-shot 27.4% 5.9% 5.7% (extreme_100_001_detailed)
Auto-chunked (longlistbench) 84.8% 81.7% 63.3% (extreme_100_001_detailed)

The local one-shot regime remains strong on easy documents (97.2%), but drops to 74.6% on medium, 44.4% on hard, and 5.9% on extreme. The simplified local auto-chunked regime reaches 97.3% weighted F1 on easy, 96.5% on medium, 87.7% on hard, 71.0% on detailed documents overall, and 95.9% on table documents overall. Chunking therefore mitigates the catastrophic long-context failure mode, but the simpler released baseline still leaves substantial residual errors, especially on long detailed documents.

Development

For development and testing, see benchmarks/synthetic/README.md for the synthetic data generator.

Development Setup

Installing the Pre-Commit Hook

Optional: install a pre-commit hook to quickly sanity-check that the paper compiles:

# From the repository root
cp pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

The hook runs a fast LaTeX compile (make quick) in the paper directory; in strict mode it can prevent the commit if compilation fails.

By default, the hook is best-effort and will skip (or warn) when dependencies are missing. To make paper compilation failures block commits, set:

export STRICT_PAPER_COMPILE=1

Manually invoking the hook:

# Test the hook without committing
.git/hooks/pre-commit

Alternatively, run the same check from your virtualenv:

source .venv/bin/activate
make -C paper quick

Note: You can skip the hook for a specific commit using:

git commit --no-verify

Requirements

  • LaTeX distribution (TeX Live, MacTeX, or similar)
  • pdflatex and biber must be available in your PATH

About

LongListBench: Benchmark for long-list entity extraction from semi-structured claims documents under layout complexity and OCR noise.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages