prompt-eval

Prompt regression testing for CI, runnable without API keys.

Thesis: prompts deserve unit tests. A prompt change should produce a small, reviewable report: inputs, rendered prompt, response, judge score, failures, and latency.

Run It In 30 Seconds

python -m pip install -e ".[dev]" && python examples/no_api_key_regression.py

The demo uses a deterministic local model function, so it can run in CI without secrets.

Why Care?

You want to know whether prompt version B actually regressed behavior.
You need no-key tests for prompt templates and scoring logic.
You want judges that are plain Python objects, not hidden service calls.

Architecture

flowchart LR
    EvalCase --> PromptTemplate
    PromptTemplate --> ModelFunction
    ModelFunction --> Response
    Response --> Judge
    Judge --> EvalSummary
    EvalSummary --> Reporter

Example

from prompt_eval import Contains, EvalCase, EvalRunner, PromptTemplate

template = PromptTemplate("Classify: {{ ticket }}")
llm = lambda prompt: "category: billing"
cases = [EvalCase({"ticket": "refund please"}, "billing")]
summary = EvalRunner(template, Contains(ignore_case=True), llm).run(cases)
assert summary.pass_rate == 1.0

Built-In Judges

Exact match, contains, fuzzy match, regex match, semantic similarity, LLM judge, and weighted composite judges.

Limitations

Local string judges are not substitutes for domain review.
LLM-as-judge is supported but should be calibrated and versioned.
This project tests prompt behavior; it does not manage datasets or model deployment.

Development

python -m pip install -e ".[dev]"
pytest
python scripts/ci_regression_demo.py

See ARCHITECTURE.md, TECHNICAL_ARTICLE.md, and RELEASE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
assets		assets
docs		docs
examples		examples
prompt_eval		prompt_eval
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUALITY.md		QUALITY.md
README.md		README.md
RELEASE.md		RELEASE.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prompt-eval

Run It In 30 Seconds

Why Care?

Architecture

Example

Built-In Judges

Limitations

Development

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

prompt-eval

Run It In 30 Seconds

Why Care?

Architecture

Example

Built-In Judges

Limitations

Development

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages