Prompt regression testing for CI, runnable without API keys.
Thesis: prompts deserve unit tests. A prompt change should produce a small, reviewable report: inputs, rendered prompt, response, judge score, failures, and latency.
python -m pip install -e ".[dev]" && python examples/no_api_key_regression.pyThe demo uses a deterministic local model function, so it can run in CI without secrets.
- You want to know whether prompt version B actually regressed behavior.
- You need no-key tests for prompt templates and scoring logic.
- You want judges that are plain Python objects, not hidden service calls.
flowchart LR
EvalCase --> PromptTemplate
PromptTemplate --> ModelFunction
ModelFunction --> Response
Response --> Judge
Judge --> EvalSummary
EvalSummary --> Reporter
from prompt_eval import Contains, EvalCase, EvalRunner, PromptTemplate
template = PromptTemplate("Classify: {{ ticket }}")
llm = lambda prompt: "category: billing"
cases = [EvalCase({"ticket": "refund please"}, "billing")]
summary = EvalRunner(template, Contains(ignore_case=True), llm).run(cases)
assert summary.pass_rate == 1.0Exact match, contains, fuzzy match, regex match, semantic similarity, LLM judge, and weighted composite judges.
- Local string judges are not substitutes for domain review.
- LLM-as-judge is supported but should be calibrated and versioned.
- This project tests prompt behavior; it does not manage datasets or model deployment.
python -m pip install -e ".[dev]"
pytest
python scripts/ci_regression_demo.pySee ARCHITECTURE.md, TECHNICAL_ARTICLE.md, and RELEASE.md.