Skip to content

huggingface/yourbench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YourBench Logo

YourBench: A Dynamic Benchmark Generation Framework

GitHub Repo stars

[GitHub] · [Dataset] · [Documentation] · [Paper]


Generate high-quality QA pairs and evaluation datasets from any source documents. YourBench transforms your PDFs, Word docs, and text files into structured benchmark datasets with configurable output formats. Appearing at COLM 2025. 100% free and open source.

Features

  • Document Ingestion – Parse PDFs, Word docs, HTML, and text files into standardized Markdown
  • Question Generation – Create single-hop and multi-hop questions with customizable schemas
  • Custom Output Schemas – Define your own Pydantic models for question/answer format
  • Multi-Model Support – Use different LLMs for different pipeline stages
  • HuggingFace Integration – Push datasets directly to the Hub or save locally
  • Quality Filtering – Citation scoring and deduplication built-in

Quick Start

Use uv to run the packaged CLI directly:

uvx --from yourbench yourbench run example/default_example/config.yaml --debug

The example config works out-of-the-box with env vars from .env (see .env.template).

Install locally if you prefer:

uv pip install yourbench
yourbench run example/default_example/config.yaml

Installation

Requires Python 3.12+.

# With uv (recommended)
uv pip install yourbench

# With pip
pip install yourbench

From source:

git clone https://github.com/huggingface/yourbench.git
cd yourbench
pip install -e .

Usage

Minimal config:

hf_configuration:
  hf_dataset_name: my-benchmark

model_list:
  - model_name: openai/gpt-4o-mini
    api_key: $OPENAI_API_KEY

pipeline:
  ingestion:
    source_documents_dir: ./my-documents
  summarization:
  chunking:
  single_shot_question_generation:
  prepare_lighteval:
yourbench run config.yaml

With custom output schema:

pipeline:
  single_shot_question_generation:
    question_schema: ./my_schema.py  # Must export DataFormat class
# my_schema.py
from pydantic import BaseModel, Field

class DataFormat(BaseModel):
    question: str = Field(description="The question")
    answer: str = Field(description="The answer")
    difficulty: str = Field(description="easy, medium, or hard")

Documentation

Guide Description
Configuration Full config reference with all options
Custom Schemas Define your own output formats
How It Works Pipeline architecture and stages
FAQ Common questions and troubleshooting
OpenAI-Compatible Models Use vLLM, Ollama, etc.
Dataset Columns Output field descriptions
Academic Paper COLM 2025 submission

Try Online

No installation needed:

Example Configs

The example/ folder contains ready-to-use configurations:

  • default_example/ – Basic setup with sample documents
  • harry_potter_quizz/ – Generate quiz questions from books
  • custom_prompts_demo/ – Custom prompts for domain-specific questions
  • local_vllm_private_data/ – Use local models for private data
  • rich_pdf_extraction_with_gemini/ – LLM-based PDF extraction for charts/figures

Run any example:

yourbench run example/default_example/config.yaml

API Keys

Set in environment or .env file:

HF_TOKEN=hf_xxx              # For Hub upload
OPENAI_API_KEY=sk-xxx        # For OpenAI models

Use $VAR_NAME in config to reference environment variables.

Contributing

PRs welcome! Open an issue first for major changes.

📈 Progress

📜 License

Apache 2.0 – see LICENSE.

📚 Citation

@misc{shashidhar2025yourbencheasycustomevaluation,
      title={YourBench: Easy Custom Evaluation Sets for Everyone},
      author={Sumuk Shashidhar and Clémentine Fourrier and Alina Lozovskia and Thomas Wolf and Gokhan Tur and Dilek Hakkani-Tür},
      year={2025},
      eprint={2504.01833},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.01833}
}

About

🤗 Benchmark Large Language Models Reliably On Your Data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 60.4%
  • Python 39.5%
  • Makefile 0.1%