Skip to content

Miaoge-Ge/llm-eval-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Eval Framework

中文文档

A clean, engineering-first evaluation framework for LLM benchmarks. It uses uv-managed environments, a single minimal model config, and produces one self-contained Markdown report per run against any OpenAI-compatible endpoint.

Highlights

  • uv as the default environment and dependency workflow
  • Installable package (hatchling build backend) with a llm-eval console script
  • Single model config file, with ${ENV_VAR} placeholders for secrets
  • 10 built-in tasks across code generation, math reasoning, and multiple-choice knowledge
  • OpenAI-compatible client with streaming, retries, and usage accounting
  • One self-contained Markdown report per run (config + metrics + per-case results)
  • Per-subject / per-domain accuracy breakdown for all knowledge tasks
  • Optional thinking mode with configurable reasoning_effort
  • Live progress bar with pass/fail counts during evaluation
  • Quality tooling wired in: ruff (lint + format), mypy (type checking), pytest

Project structure

llm_eval/
  cli.py
  clients.py
  reporting.py
  runner.py
  settings.py
  tasks.py
  utils.py
  ifeval/              # vendored Google IFEval checkers + scoring glue
configs/
  model.example.yaml   # committed template (no secrets)
datasets/
  HumanEval.jsonl
  HumanEvalPlus.jsonl
  MBPP.jsonl
  MBPPPlus.jsonl
  GSM8K.jsonl
  AIME2025.jsonl
  AIME2026.jsonl
  GPQA.jsonl
  MMLUPro.jsonl
  IFEval.jsonl
  LiveCodeBench.jsonl
scripts/
  fetch_datasets.py    # re-download / regenerate datasets from Hugging Face
tests/
pyproject.toml

Install

Recommended:

uv sync

Install with dev dependencies (pytest, ruff, mypy and type stubs):

uv sync --extra dev

Model config

Copy the committed template to a local config (the configs/ directory is git-ignored apart from the example, so your real keys never get committed):

cp configs/model.example.yaml configs/model.yaml

configs/model.example.yaml:

base_url: ${OPENAI_BASE_URL:-https://api.openai.com/v1}
api_key: ${OPENAI_API_KEY}
model_name: gpt-4.1-mini
workers: 10
timeout_seconds: 120
execution_timeout_seconds: 20
thinking_enabled: false
reasoning_effort:

Prefer ${ENV_VAR} placeholders for secrets so the API key is never written to disk. Recognized fields:

Field Purpose
base_url OpenAI-compatible endpoint
api_key API key (use ${OPENAI_API_KEY})
model_name Model id sent to the endpoint; also names the output directory
workers Concurrent requests (default 10)
timeout_seconds Per-request HTTP timeout (default 120)
execution_timeout_seconds Cap for running generated code during grading (default 20)
thinking_enabled true/false (also accepts enabled/disabled)
reasoning_effort low/medium/high/max, sent only when thinking is enabled

The output directory is fixed to results/<model_name>/.

CLI

# run a benchmark
uv run llm-eval run --config configs/model.yaml --task gpqa

# equivalent module form
uv run python -m llm_eval run --config configs/model.yaml --task gsm

# list all available tasks
uv run llm-eval run --list-tasks

--config defaults to configs/model.yaml and --task defaults to humaneval.

Run every benchmark

# Code generation
uv run llm-eval run --config configs/model.yaml --task humaneval
uv run llm-eval run --config configs/model.yaml --task humanevalplus
uv run llm-eval run --config configs/model.yaml --task mbpp
uv run llm-eval run --config configs/model.yaml --task mbppplus
uv run llm-eval run --config configs/model.yaml --task livecodebench

# Math reasoning
uv run llm-eval run --config configs/model.yaml --task gsm
uv run llm-eval run --config configs/model.yaml --task aime2025
uv run llm-eval run --config configs/model.yaml --task aime2026

# Multiple-choice knowledge
uv run llm-eval run --config configs/model.yaml --task gpqa
uv run llm-eval run --config configs/model.yaml --task mmlu_pro

# Instruction following
uv run llm-eval run --config configs/model.yaml --task ifeval

When a run starts, the framework prints the selected task, model, dataset path, worker count, thinking mode, and output directory, then shows a live progress bar (passed / failed / last HTTP status) while cases are processed.

Thinking mode

For providers that support it, the client sends reasoning_effort and extra_body={"thinking": {"type": "enabled"}} when thinking_enabled is on. Equivalent SDK call:

from openai import OpenAI

client = OpenAI(api_key="...", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Hello"},
    ],
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
)

Supported tasks

Code generation

Task Dataset Cases Grading
humaneval HumanEval 164 Execute generated code against unit tests
humanevalplus HumanEval+ 164 Same, with extended tests and a numpy shim
mbpp MBPP 974 Execute generated code against assertion tests
mbppplus MBPP+ 378 EvalPlus extended tests, with a numpy shim
livecodebench LiveCodeBench (lite) 219 Run as a stdin/stdout program against public test cases

Math reasoning

Task Dataset Cases Grading
gsm GSM8K 1319 Exact match on the final #### <number> answer
aime2025 AIME 2025 (I+II) 30 Integer match on the \boxed{} answer
aime2026 AIME 2026 (I+II) 30 Integer match on the \boxed{} answer

Multiple-choice knowledge

The MCQ task asks the model to output \boxed{A/B/C/D} and grades by exact letter match. The report includes a per-domain accuracy breakdown.

Task Dataset Cases Coverage
gpqa GPQA-Diamond 198 PhD-level science (Physics, Chemistry, Biology)
mmlu_pro MMLU-Pro 420 14 subjects, 10-way choices (30 sampled per category)

Instruction following

Task Dataset Cases Grading
ifeval IFEval 541 Programmatic verification of each instruction (prompt-level strict accuracy)

livecodebench keeps only stdin/stdout problems (AtCoder/Codeforces) and grades against the plaintext public test cases. ifeval vendors Google's instruction checkers under llm_eval/ifeval/ and depends on langdetect. Regenerate any dataset with uv run --with datasets python scripts/fetch_datasets.py <name>.

Output

Each run writes a single self-contained Markdown report:

results/<model_name>/<task>_report.md

It contains:

  • Overview — task, model, dataset, workers, thinking mode
  • Metrics — pass rate, wall clock, throughput, prompt/completion/total tokens
  • Status counts — how many cases passed, failed, errored, etc.
  • Accuracy by domain — per-subject breakdown (knowledge tasks only)
  • Results — a per-case table with status, time, tokens, and a detail column

Development

All tooling is configured in pyproject.toml and runs through uv:

uv run ruff check .    # lint
uv run ruff format .   # auto-format
uv run mypy            # type check
uv run pytest          # tests

About

A lightweight, configuration-driven evaluation framework for LLM code generation & reasoning tasks (MBPP, HumanEval, GSM8K). Supports multi-provider (DeepSeek, OpenAI, ZhipuAI) and concurrent execution.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages