harness: Add Vidore V3 benchmark and BEIR metrics support by jioffe502 · Pull Request #1378 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-02-05T18:12:45Z

Description

Adds Vidore V3 benchmark support and BEIR evaluation metrics to the test harness.

Changes

Add Vidore V3 dataset configurations with HuggingFace integration for ground truth
Add dataset groups feature for running multiple datasets (e.g., --dataset=vidore)
Add optional BEIR metrics (NDCG, MAP, Precision) for recall evaluation

results in: https://docs.google.com/spreadsheets/d/137poeB7CmDE7AmaiLalOM1qZZvDlCDRkBfMD6IjBBAs/edit?gid=0#gid=0

Dependent on #1305

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

- Add 8 Vidore V3 dataset configurations (finance_en, industrial, computer_science, pharmaceuticals, hr, energy, physics, finance_fr) - Add vidore_load_ground_truth() using HuggingFace datasets API - Add vidore_recall() evaluator with PDF-only matching - Add extract_page_as_image, extract_method, image_elements_modality config options to support Vidore's OCR-based page image retrieval - Add datasets>=2.0.0 dependency for HuggingFace qrels loading Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

- Add dataset_groups section to test_configs.yaml with vidore, vidore_english, vidore_quick groups - Add expand_dataset_names() in config.py to handle group expansion - Add --list-datasets CLI option to show available datasets and groups - Update README.md with dataset groups documentation Usage: uv run nv-ingest-harness-run --list-datasets uv run nv-ingest-harness-run --case=e2e_recall --dataset=vidore uv run nv-ingest-harness-run --case=e2e_recall --dataset=vidore_quick Note: test_configs.yaml includes temp test settings (vdb_backend: milvus, reranker_mode: none, modified vidore_quick) - revert after testing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add optional BEIR evaluation (NDCG, MAP, Precision) to recall tests - Configurable via enable_beir in test_configs.yaml or ENABLE_BEIR env var - Add beir>=2.0.0 dependency to harness - Add nvidia/llama-nemotron-embed-vl-1b-v2 to known embedding models Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add embed model fallback detection (dim=1024 warning) to e2e.py and recall.py - Add Milvus collection vector dimension verification after ingestion - Enable BEIR metrics by default for all Vidore V3 datasets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

Aligns harness evaluation with the Vidore V3 notebook ground truth: - vidore_load_ground_truth now builds full qrels with all relevant docs and graded relevance scores (1=partial, 2=high) instead of collapsing to single doc with binary relevance - Dedup retrieved PDFs in recall scoring and BEIR metrics to avoid multiple chunks from the same PDF inflating top-k positions - Add language_filter config for isolating English-only query evaluation - Add warm-up sleep and TeeFile close guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

ChrisJar

Thanks for working on this!

jioffe502 requested a review from a team as a code owner February 5, 2026 18:12

jioffe502 requested review from ChrisJar, charlesbluca and drobison00 and removed request for drobison00 February 5, 2026 18:12

jioffe502 marked this pull request as draft February 5, 2026 18:13

jioffe502 and others added 4 commits February 23, 2026 19:56

jioffe502 force-pushed the vidore-v3-benchmark branch from f4b0c94 to e8af1a5 Compare February 23, 2026 20:20

jioffe502 changed the title ~~[DRAFT] harness: Add Vidore V3 benchmark and BEIR metrics support~~ harness: Add Vidore V3 benchmark and BEIR metrics support Feb 24, 2026

jioffe502 marked this pull request as ready for review February 24, 2026 22:59

Merge branch 'main' into vidore-v3-benchmark

197812b

ChrisJar approved these changes Feb 25, 2026

View reviewed changes

jioffe502 added 2 commits February 25, 2026 13:58

Merge branch 'main' into vidore-v3-benchmark

91fb44f

Merge branch 'main' into vidore-v3-benchmark

04c4298

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harness: Add Vidore V3 benchmark and BEIR metrics support#1378

harness: Add Vidore V3 benchmark and BEIR metrics support#1378
jioffe502 wants to merge 8 commits intoNVIDIA:mainfrom
jioffe502:vidore-v3-benchmark

jioffe502 commented Feb 5, 2026 •

edited

Loading

Uh oh!

ChrisJar left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jioffe502 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Checklist

Uh oh!

ChrisJar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jioffe502 commented Feb 5, 2026 •

edited

Loading