GitHub - kasimmj/claude-code-test-runner: 🧪 Evaluation framework for testing Claude Code skills at scale. Run regression suites across model versions.

Eval framework for Claude Code skills, hooks, and prompts. Test at scale. Catch regressions across model versions. Ship with confidence.

pip install claude-eval
claude-eval init

🧪 Why this matters

You wrote a Claude Code skill. It works for your test prompts. You have no idea what happens at scale.

Does it handle ambiguous inputs?
Does it work on Sonnet AND Opus AND Haiku?
When Anthropic ships a new model, does it break?
What's the cost per invocation?

claude-eval answers all of these systematically.

⚡ Quick Start

pip install claude-eval
claude-eval init
# Creates tests/eval/ with sample evals

Write your first eval (it's just YAML):

# tests/eval/security-audit.yaml
skill: security-audit
description: "Catches OWASP Top 10 in small Express apps"
models: [claude-haiku-4, claude-sonnet-4-6]

cases:
  - id: sql-injection-in-login
    fixture: fixtures/sqli-login.js
    expects:
      severity_contains: critical
      mentions_line: 23
      suggests_fix_with: parameterized
    cost_limit: 0.10

  - id: missing-csrf
    fixture: fixtures/no-csrf.js
    expects:
      severity_contains: high
      mentions_pattern: csrf

  - id: clean-code
    fixture: fixtures/clean.js
    expects:
      no_critical: true
      no_high: true

Run:

$ claude-eval run

Running 3 cases × 2 models = 6 evaluations...

  ✓ security-audit/sql-injection-in-login    (haiku)   4.2s   $0.008
  ✓ security-audit/sql-injection-in-login    (sonnet)  5.8s   $0.030
  ✓ security-audit/missing-csrf              (haiku)   3.1s   $0.006
  ✗ security-audit/missing-csrf              (sonnet)  6.4s   $0.034
      expected: mentions_pattern=csrf
      got:      "no security issues found"
  ✓ security-audit/clean-code                (haiku)   2.8s   $0.005
  ✓ security-audit/clean-code                (sonnet)  4.1s   $0.024

5/6 passed. Total cost: $0.107. Total time: 26.4s.

📐 Eval Assertions

expects:
  # Text assertions
  contains: "OWASP"
  contains_any: ["SQL", "injection"]
  contains_all: ["line 23", "parameterize"]
  matches_regex: "severity.*critical"

  # Structural assertions (for JSON-mode outputs)
  json_path:
    "$.severity": "critical"
    "$.findings[*].line": [23, 47]

  # Negative assertions
  not_contains: "I cannot help"
  no_critical: true

  # Semantic assertions (uses a stronger model as judge)
  semantic_match: "Identifies the SQL injection vulnerability in the login handler"

  # Performance assertions
  max_tokens: 2000
  max_cost: 0.05
  max_latency_ms: 8000

  # Tool-use assertions
  uses_tool: Read
  tool_call_count: { Edit: ">=2" }

🧬 Reliability Testing

Run the same eval N times to measure variance:

claude-eval run --repeat 5

Output:

security-audit/sql-injection-in-login
  haiku:  5/5 passed   reliability: 1.00
  sonnet: 4/5 passed   reliability: 0.80   ⚠ flaky

Anything below 0.95 is flagged. This is the test that catches "works on my machine" prompts.

📊 Comparison Runs

When Anthropic ships a new model, run:

claude-eval run --baseline claude-sonnet-4-6 --candidate claude-sonnet-4-8

────────────────────────────────────────────────────
  Skill              │ Baseline │ Candidate │ Delta
────────────────────────────────────────────────────
  security-audit     │ 92%      │ 96%       │ +4%  ✓
  refactor-deep      │ 88%      │ 87%       │ -1%
  api-mock           │ 95%      │ 95%       │  0%
  i18n-translate     │ 78%      │ 91%       │ +13% ✓
  changelog-gen      │ 99%      │ 98%       │ -1%
────────────────────────────────────────────────────
  Cost change:       $0.42 → $0.51 (+21%)

Now you know whether to upgrade your team's default model.

🎯 Fixture Types

Evals support fixtures in any format Claude can read:

fixtures/
├── code/            # Source files (.py, .ts, .go, etc.)
├── repos/           # Whole tiny repos (multi-file)
├── prompts/         # Specific user prompts
├── diffs/           # Pre-generated diffs
└── traces/          # Captured tool-use traces

🤖 CI Integration

claude-eval outputs JUnit XML, so it plugs into:

GitHub Actions
GitLab CI
CircleCI
Buildkite
Any system that understands JUnit

GitHub Actions example:

- name: Claude Eval
  run: claude-eval run --junit results.xml
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

- name: Publish results
  uses: dorny/test-reporter@v1
  with:
    name: Claude Evals
    path: results.xml
    reporter: jest-junit

💰 Cost Control

Default budget per run: $1.00. Configurable.

claude-eval run --budget 5.00
claude-eval run --cost-strategy=cheapest-passing

cheapest-passing runs Haiku first, falls back to Sonnet only if Haiku fails. Cuts eval costs by ~60%.

🌟 Real-World Use

We've used claude-eval to:

🔄 Migrate our prompts from Sonnet 4.5 → 4.6 → 4.7 with confidence
🐛 Catch regressions when a model update silently changed behavior
📊 Compare custom skills to community baselines
💰 Pick the cheapest model that meets quality bar

📜 License

MIT.

Star ⭐ to ship Claude Code skills with confidence.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 Why this matters

⚡ Quick Start

📐 Eval Assertions

🧬 Reliability Testing

📊 Comparison Runs

🎯 Fixture Types

🤖 CI Integration

💰 Cost Control

🌟 Real-World Use

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🧪 Why this matters

⚡ Quick Start

📐 Eval Assertions

🧬 Reliability Testing

📊 Comparison Runs

🎯 Fixture Types

🤖 CI Integration

💰 Cost Control

🌟 Real-World Use

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages