Eval framework for Claude Code skills, hooks, and prompts. Test at scale. Catch regressions across model versions. Ship with confidence.
pip install claude-eval
claude-eval initYou wrote a Claude Code skill. It works for your test prompts. You have no idea what happens at scale.
- Does it handle ambiguous inputs?
- Does it work on Sonnet AND Opus AND Haiku?
- When Anthropic ships a new model, does it break?
- What's the cost per invocation?
claude-eval answers all of these systematically.
pip install claude-eval
claude-eval init
# Creates tests/eval/ with sample evalsWrite your first eval (it's just YAML):
# tests/eval/security-audit.yaml
skill: security-audit
description: "Catches OWASP Top 10 in small Express apps"
models: [claude-haiku-4, claude-sonnet-4-6]
cases:
- id: sql-injection-in-login
fixture: fixtures/sqli-login.js
expects:
severity_contains: critical
mentions_line: 23
suggests_fix_with: parameterized
cost_limit: 0.10
- id: missing-csrf
fixture: fixtures/no-csrf.js
expects:
severity_contains: high
mentions_pattern: csrf
- id: clean-code
fixture: fixtures/clean.js
expects:
no_critical: true
no_high: trueRun:
$ claude-eval run
Running 3 cases Γ 2 models = 6 evaluations...
β security-audit/sql-injection-in-login (haiku) 4.2s $0.008
β security-audit/sql-injection-in-login (sonnet) 5.8s $0.030
β security-audit/missing-csrf (haiku) 3.1s $0.006
β security-audit/missing-csrf (sonnet) 6.4s $0.034
expected: mentions_pattern=csrf
got: "no security issues found"
β security-audit/clean-code (haiku) 2.8s $0.005
β security-audit/clean-code (sonnet) 4.1s $0.024
5/6 passed. Total cost: $0.107. Total time: 26.4s.expects:
# Text assertions
contains: "OWASP"
contains_any: ["SQL", "injection"]
contains_all: ["line 23", "parameterize"]
matches_regex: "severity.*critical"
# Structural assertions (for JSON-mode outputs)
json_path:
"$.severity": "critical"
"$.findings[*].line": [23, 47]
# Negative assertions
not_contains: "I cannot help"
no_critical: true
# Semantic assertions (uses a stronger model as judge)
semantic_match: "Identifies the SQL injection vulnerability in the login handler"
# Performance assertions
max_tokens: 2000
max_cost: 0.05
max_latency_ms: 8000
# Tool-use assertions
uses_tool: Read
tool_call_count: { Edit: ">=2" }Run the same eval N times to measure variance:
claude-eval run --repeat 5Output:
security-audit/sql-injection-in-login
haiku: 5/5 passed reliability: 1.00
sonnet: 4/5 passed reliability: 0.80 β flaky
Anything below 0.95 is flagged. This is the test that catches "works on my machine" prompts.
When Anthropic ships a new model, run:
claude-eval run --baseline claude-sonnet-4-6 --candidate claude-sonnet-4-8
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Skill β Baseline β Candidate β Delta
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
security-audit β 92% β 96% β +4% β
refactor-deep β 88% β 87% β -1%
api-mock β 95% β 95% β 0%
i18n-translate β 78% β 91% β +13% β
changelog-gen β 99% β 98% β -1%
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cost change: $0.42 β $0.51 (+21%)Now you know whether to upgrade your team's default model.
Evals support fixtures in any format Claude can read:
fixtures/
βββ code/ # Source files (.py, .ts, .go, etc.)
βββ repos/ # Whole tiny repos (multi-file)
βββ prompts/ # Specific user prompts
βββ diffs/ # Pre-generated diffs
βββ traces/ # Captured tool-use traces
claude-eval outputs JUnit XML, so it plugs into:
- GitHub Actions
- GitLab CI
- CircleCI
- Buildkite
- Any system that understands JUnit
GitHub Actions example:
- name: Claude Eval
run: claude-eval run --junit results.xml
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Publish results
uses: dorny/test-reporter@v1
with:
name: Claude Evals
path: results.xml
reporter: jest-junitDefault budget per run: $1.00. Configurable.
claude-eval run --budget 5.00
claude-eval run --cost-strategy=cheapest-passingcheapest-passing runs Haiku first, falls back to Sonnet only if Haiku fails. Cuts eval costs by ~60%.
We've used claude-eval to:
- π Migrate our prompts from Sonnet 4.5 β 4.6 β 4.7 with confidence
- π Catch regressions when a model update silently changed behavior
- π Compare custom skills to community baselines
- π° Pick the cheapest model that meets quality bar
MIT.
Star β to ship Claude Code skills with confidence.