A clean, engineering-first evaluation framework for LLM benchmarks. It uses uv-managed environments, a single minimal model config, and produces one self-contained Markdown report per run against any OpenAI-compatible endpoint.
uvas the default environment and dependency workflow- Installable package (
hatchlingbuild backend) with allm-evalconsole script - Single model config file, with
${ENV_VAR}placeholders for secrets - 10 built-in tasks across code generation, math reasoning, and multiple-choice knowledge
- OpenAI-compatible client with streaming, retries, and usage accounting
- One self-contained Markdown report per run (config + metrics + per-case results)
- Per-subject / per-domain accuracy breakdown for all knowledge tasks
- Optional thinking mode with configurable
reasoning_effort - Live progress bar with pass/fail counts during evaluation
- Quality tooling wired in:
ruff(lint + format),mypy(type checking),pytest
llm_eval/
cli.py
clients.py
reporting.py
runner.py
settings.py
tasks.py
utils.py
ifeval/ # vendored Google IFEval checkers + scoring glue
configs/
model.example.yaml # committed template (no secrets)
datasets/
HumanEval.jsonl
HumanEvalPlus.jsonl
MBPP.jsonl
MBPPPlus.jsonl
GSM8K.jsonl
AIME2025.jsonl
AIME2026.jsonl
GPQA.jsonl
MMLUPro.jsonl
IFEval.jsonl
LiveCodeBench.jsonl
scripts/
fetch_datasets.py # re-download / regenerate datasets from Hugging Face
tests/
pyproject.toml
Recommended:
uv syncInstall with dev dependencies (pytest, ruff, mypy and type stubs):
uv sync --extra devCopy the committed template to a local config (the configs/ directory is git-ignored apart from the example, so your real keys never get committed):
cp configs/model.example.yaml configs/model.yamlbase_url: ${OPENAI_BASE_URL:-https://api.openai.com/v1}
api_key: ${OPENAI_API_KEY}
model_name: gpt-4.1-mini
workers: 10
timeout_seconds: 120
execution_timeout_seconds: 20
thinking_enabled: false
reasoning_effort:Prefer ${ENV_VAR} placeholders for secrets so the API key is never written to disk. Recognized fields:
| Field | Purpose |
|---|---|
base_url |
OpenAI-compatible endpoint |
api_key |
API key (use ${OPENAI_API_KEY}) |
model_name |
Model id sent to the endpoint; also names the output directory |
workers |
Concurrent requests (default 10) |
timeout_seconds |
Per-request HTTP timeout (default 120) |
execution_timeout_seconds |
Cap for running generated code during grading (default 20) |
thinking_enabled |
true/false (also accepts enabled/disabled) |
reasoning_effort |
low/medium/high/max, sent only when thinking is enabled |
The output directory is fixed to results/<model_name>/.
# run a benchmark
uv run llm-eval run --config configs/model.yaml --task gpqa
# equivalent module form
uv run python -m llm_eval run --config configs/model.yaml --task gsm
# list all available tasks
uv run llm-eval run --list-tasks--config defaults to configs/model.yaml and --task defaults to humaneval.
# Code generation
uv run llm-eval run --config configs/model.yaml --task humaneval
uv run llm-eval run --config configs/model.yaml --task humanevalplus
uv run llm-eval run --config configs/model.yaml --task mbpp
uv run llm-eval run --config configs/model.yaml --task mbppplus
uv run llm-eval run --config configs/model.yaml --task livecodebench
# Math reasoning
uv run llm-eval run --config configs/model.yaml --task gsm
uv run llm-eval run --config configs/model.yaml --task aime2025
uv run llm-eval run --config configs/model.yaml --task aime2026
# Multiple-choice knowledge
uv run llm-eval run --config configs/model.yaml --task gpqa
uv run llm-eval run --config configs/model.yaml --task mmlu_pro
# Instruction following
uv run llm-eval run --config configs/model.yaml --task ifevalWhen a run starts, the framework prints the selected task, model, dataset path, worker count, thinking mode, and output directory, then shows a live progress bar (passed / failed / last HTTP status) while cases are processed.
For providers that support it, the client sends reasoning_effort and extra_body={"thinking": {"type": "enabled"}} when thinking_enabled is on. Equivalent SDK call:
from openai import OpenAI
client = OpenAI(api_key="...", base_url="https://api.deepseek.com")
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
],
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
)| Task | Dataset | Cases | Grading |
|---|---|---|---|
humaneval |
HumanEval | 164 | Execute generated code against unit tests |
humanevalplus |
HumanEval+ | 164 | Same, with extended tests and a numpy shim |
mbpp |
MBPP | 974 | Execute generated code against assertion tests |
mbppplus |
MBPP+ | 378 | EvalPlus extended tests, with a numpy shim |
livecodebench |
LiveCodeBench (lite) | 219 | Run as a stdin/stdout program against public test cases |
| Task | Dataset | Cases | Grading |
|---|---|---|---|
gsm |
GSM8K | 1319 | Exact match on the final #### <number> answer |
aime2025 |
AIME 2025 (I+II) | 30 | Integer match on the \boxed{} answer |
aime2026 |
AIME 2026 (I+II) | 30 | Integer match on the \boxed{} answer |
The MCQ task asks the model to output \boxed{A/B/C/D} and grades by exact letter match. The report includes a per-domain accuracy breakdown.
| Task | Dataset | Cases | Coverage |
|---|---|---|---|
gpqa |
GPQA-Diamond | 198 | PhD-level science (Physics, Chemistry, Biology) |
mmlu_pro |
MMLU-Pro | 420 | 14 subjects, 10-way choices (30 sampled per category) |
| Task | Dataset | Cases | Grading |
|---|---|---|---|
ifeval |
IFEval | 541 | Programmatic verification of each instruction (prompt-level strict accuracy) |
livecodebench keeps only stdin/stdout problems (AtCoder/Codeforces) and grades against the plaintext public test cases. ifeval vendors Google's instruction checkers under llm_eval/ifeval/ and depends on langdetect. Regenerate any dataset with uv run --with datasets python scripts/fetch_datasets.py <name>.
Each run writes a single self-contained Markdown report:
results/<model_name>/<task>_report.md
It contains:
- Overview — task, model, dataset, workers, thinking mode
- Metrics — pass rate, wall clock, throughput, prompt/completion/total tokens
- Status counts — how many cases passed, failed, errored, etc.
- Accuracy by domain — per-subject breakdown (knowledge tasks only)
- Results — a per-case table with status, time, tokens, and a detail column
All tooling is configured in pyproject.toml and runs through uv:
uv run ruff check . # lint
uv run ruff format . # auto-format
uv run mypy # type check
uv run pytest # tests