Skip to content

harness: Add Vidore V3 benchmark and BEIR metrics support#1378

Open
jioffe502 wants to merge 8 commits intoNVIDIA:mainfrom
jioffe502:vidore-v3-benchmark
Open

harness: Add Vidore V3 benchmark and BEIR metrics support#1378
jioffe502 wants to merge 8 commits intoNVIDIA:mainfrom
jioffe502:vidore-v3-benchmark

Conversation

@jioffe502
Copy link
Collaborator

@jioffe502 jioffe502 commented Feb 5, 2026

Description

Adds Vidore V3 benchmark support and BEIR evaluation metrics to the test harness.

Changes

  • Add Vidore V3 dataset configurations with HuggingFace integration for ground truth
  • Add dataset groups feature for running multiple datasets (e.g., --dataset=vidore)
  • Add optional BEIR metrics (NDCG, MAP, Precision) for recall evaluation

results in: https://docs.google.com/spreadsheets/d/137poeB7CmDE7AmaiLalOM1qZZvDlCDRkBfMD6IjBBAs/edit?gid=0#gid=0

Dependent on #1305

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jioffe502 jioffe502 requested a review from a team as a code owner February 5, 2026 18:12
@jioffe502 jioffe502 requested review from ChrisJar, charlesbluca and drobison00 and removed request for drobison00 February 5, 2026 18:12
@jioffe502 jioffe502 marked this pull request as draft February 5, 2026 18:13
jioffe502 and others added 4 commits February 23, 2026 19:56
- Add 8 Vidore V3 dataset configurations (finance_en, industrial,
  computer_science, pharmaceuticals, hr, energy, physics, finance_fr)
- Add vidore_load_ground_truth() using HuggingFace datasets API
- Add vidore_recall() evaluator with PDF-only matching
- Add extract_page_as_image, extract_method, image_elements_modality
  config options to support Vidore's OCR-based page image retrieval
- Add datasets>=2.0.0 dependency for HuggingFace qrels loading

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
- Add dataset_groups section to test_configs.yaml with vidore, vidore_english, vidore_quick groups
- Add expand_dataset_names() in config.py to handle group expansion
- Add --list-datasets CLI option to show available datasets and groups
- Update README.md with dataset groups documentation

Usage:
  uv run nv-ingest-harness-run --list-datasets
  uv run nv-ingest-harness-run --case=e2e_recall --dataset=vidore
  uv run nv-ingest-harness-run --case=e2e_recall --dataset=vidore_quick

Note: test_configs.yaml includes temp test settings (vdb_backend: milvus,
reranker_mode: none, modified vidore_quick) - revert after testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add optional BEIR evaluation (NDCG, MAP, Precision) to recall tests
- Configurable via enable_beir in test_configs.yaml or ENABLE_BEIR env var
- Add beir>=2.0.0 dependency to harness
- Add nvidia/llama-nemotron-embed-vl-1b-v2 to known embedding models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add embed model fallback detection (dim=1024 warning) to e2e.py and recall.py
- Add Milvus collection vector dimension verification after ingestion
- Enable BEIR metrics by default for all Vidore V3 datasets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Aligns harness evaluation with the Vidore V3 notebook ground truth:
- vidore_load_ground_truth now builds full qrels with all relevant docs
  and graded relevance scores (1=partial, 2=high) instead of collapsing
  to single doc with binary relevance
- Dedup retrieved PDFs in recall scoring and BEIR metrics to avoid
  multiple chunks from the same PDF inflating top-k positions
- Add language_filter config for isolating English-only query evaluation
- Add warm-up sleep and TeeFile close guard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 changed the title [DRAFT] harness: Add Vidore V3 benchmark and BEIR metrics support harness: Add Vidore V3 benchmark and BEIR metrics support Feb 24, 2026
@jioffe502 jioffe502 marked this pull request as ready for review February 24, 2026 22:59
Copy link
Collaborator

@ChrisJar ChrisJar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants