Skip to content

kasimmj/claude-code-test-runner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


claude-code-test-runner

Eval framework for Claude Code skills, hooks, and prompts. Test at scale. Catch regressions across model versions. Ship with confidence.


pip install claude-eval
claude-eval init


πŸ§ͺ Why this matters

You wrote a Claude Code skill. It works for your test prompts. You have no idea what happens at scale.

  • Does it handle ambiguous inputs?
  • Does it work on Sonnet AND Opus AND Haiku?
  • When Anthropic ships a new model, does it break?
  • What's the cost per invocation?

claude-eval answers all of these systematically.


⚑ Quick Start

pip install claude-eval
claude-eval init
# Creates tests/eval/ with sample evals

Write your first eval (it's just YAML):

# tests/eval/security-audit.yaml
skill: security-audit
description: "Catches OWASP Top 10 in small Express apps"
models: [claude-haiku-4, claude-sonnet-4-6]

cases:
  - id: sql-injection-in-login
    fixture: fixtures/sqli-login.js
    expects:
      severity_contains: critical
      mentions_line: 23
      suggests_fix_with: parameterized
    cost_limit: 0.10

  - id: missing-csrf
    fixture: fixtures/no-csrf.js
    expects:
      severity_contains: high
      mentions_pattern: csrf

  - id: clean-code
    fixture: fixtures/clean.js
    expects:
      no_critical: true
      no_high: true

Run:

$ claude-eval run

Running 3 cases Γ— 2 models = 6 evaluations...

  βœ“ security-audit/sql-injection-in-login    (haiku)   4.2s   $0.008
  βœ“ security-audit/sql-injection-in-login    (sonnet)  5.8s   $0.030
  βœ“ security-audit/missing-csrf              (haiku)   3.1s   $0.006
  βœ— security-audit/missing-csrf              (sonnet)  6.4s   $0.034
      expected: mentions_pattern=csrf
      got:      "no security issues found"
  βœ“ security-audit/clean-code                (haiku)   2.8s   $0.005
  βœ“ security-audit/clean-code                (sonnet)  4.1s   $0.024

5/6 passed. Total cost: $0.107. Total time: 26.4s.

πŸ“ Eval Assertions

expects:
  # Text assertions
  contains: "OWASP"
  contains_any: ["SQL", "injection"]
  contains_all: ["line 23", "parameterize"]
  matches_regex: "severity.*critical"

  # Structural assertions (for JSON-mode outputs)
  json_path:
    "$.severity": "critical"
    "$.findings[*].line": [23, 47]

  # Negative assertions
  not_contains: "I cannot help"
  no_critical: true

  # Semantic assertions (uses a stronger model as judge)
  semantic_match: "Identifies the SQL injection vulnerability in the login handler"

  # Performance assertions
  max_tokens: 2000
  max_cost: 0.05
  max_latency_ms: 8000

  # Tool-use assertions
  uses_tool: Read
  tool_call_count: { Edit: ">=2" }

🧬 Reliability Testing

Run the same eval N times to measure variance:

claude-eval run --repeat 5

Output:

security-audit/sql-injection-in-login
  haiku:  5/5 passed   reliability: 1.00
  sonnet: 4/5 passed   reliability: 0.80   ⚠ flaky

Anything below 0.95 is flagged. This is the test that catches "works on my machine" prompts.


πŸ“Š Comparison Runs

When Anthropic ships a new model, run:

claude-eval run --baseline claude-sonnet-4-6 --candidate claude-sonnet-4-8

────────────────────────────────────────────────────
  Skill              β”‚ Baseline β”‚ Candidate β”‚ Delta
────────────────────────────────────────────────────
  security-audit     β”‚ 92%      β”‚ 96%       β”‚ +4%  βœ“
  refactor-deep      β”‚ 88%      β”‚ 87%       β”‚ -1%
  api-mock           β”‚ 95%      β”‚ 95%       β”‚  0%
  i18n-translate     β”‚ 78%      β”‚ 91%       β”‚ +13% βœ“
  changelog-gen      β”‚ 99%      β”‚ 98%       β”‚ -1%
────────────────────────────────────────────────────
  Cost change:       $0.42 β†’ $0.51 (+21%)

Now you know whether to upgrade your team's default model.


🎯 Fixture Types

Evals support fixtures in any format Claude can read:

fixtures/
β”œβ”€β”€ code/            # Source files (.py, .ts, .go, etc.)
β”œβ”€β”€ repos/           # Whole tiny repos (multi-file)
β”œβ”€β”€ prompts/         # Specific user prompts
β”œβ”€β”€ diffs/           # Pre-generated diffs
└── traces/          # Captured tool-use traces

πŸ€– CI Integration

claude-eval outputs JUnit XML, so it plugs into:

  • GitHub Actions
  • GitLab CI
  • CircleCI
  • Buildkite
  • Any system that understands JUnit

GitHub Actions example:

- name: Claude Eval
  run: claude-eval run --junit results.xml
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

- name: Publish results
  uses: dorny/test-reporter@v1
  with:
    name: Claude Evals
    path: results.xml
    reporter: jest-junit

πŸ’° Cost Control

Default budget per run: $1.00. Configurable.

claude-eval run --budget 5.00
claude-eval run --cost-strategy=cheapest-passing

cheapest-passing runs Haiku first, falls back to Sonnet only if Haiku fails. Cuts eval costs by ~60%.


🌟 Real-World Use

We've used claude-eval to:

  • πŸ”„ Migrate our prompts from Sonnet 4.5 β†’ 4.6 β†’ 4.7 with confidence
  • πŸ› Catch regressions when a model update silently changed behavior
  • πŸ“Š Compare custom skills to community baselines
  • πŸ’° Pick the cheapest model that meets quality bar

πŸ“œ License

MIT.


Star ⭐ to ship Claude Code skills with confidence.

About

πŸ§ͺ Evaluation framework for testing Claude Code skills at scale. Run regression suites across model versions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors