test(comment-review): promptfoo eval suite + root eval tooling by grixu · Pull Request #4 · grixu/cc-toolkit

grixu · 2026-06-18T08:49:43Z

What

Turns the leftover skill-creator eval artifacts for the comment-review skill into a repeatable promptfoo eval suite, plus shared root-level tooling to run evals across the repo.

How the suite works

Provider: native anthropic:claude-agent-sdk, loading comment-review as a local plugin and running the real skill with tools — on the Claude Code subscription (apiKeyRequired: false, no ANTHROPIC_API_KEY needed).
5 tests (one per fixture set), assertions ported from the saved eval_metadata.json / evals.json bullets as llm-rubric, plus a regex "skill ran" proxy (plugin skills load via Agent-Skills injection, so skill-used/skillCalls stay empty).
Fixtures: 2 ported from the skill-creator workspace (payment-validator.ts, host-allowlist.ts) + 4 reconstructed from their assertion lists (datadog-integration.tf, scheduler.ts, dlq-codes.ts + dlq.handler.ts).

Baseline

	sonnet	opus
asserts	41 / 42	42 / 42

Two assertions are deliberately instructive:

eval-3 §4.1 — genuine judgment boundary (after the spec-id is stripped the comment borders on R1). Accepts REWRITE or REMOVE; only "kept as-is" fails.
eval-4 token-verification — real capability discriminator: kept strict. Sonnet takes a "letter+number ⇒ spec-id" shortcut and fails; opus shows the per-token check and passes.

Root tooling

package.json — single shared node_modules, dev deps + eval scripts.
scripts/run-evals.sh — discovers and runs every plugins/*/evals/promptfooconfig.yaml.
pnpm-workspace.yaml — allowBuilds: false (avoids ERR_PNPM_IGNORED_BUILDS breaking pnpm 11's pre-run check).

Run

pnpm install
pnpm eval                 # all suites
pnpm eval:comment-review  # just this one

See plugins/comment-review/evals/README.md for details and knobs.

Single shared root node_modules + scripts to run every plugin's eval suite. - package.json: devDeps (@anthropic-ai/claude-agent-sdk, promptfoo) + eval scripts - scripts/run-evals.sh: discovers and runs plugins/*/evals/promptfooconfig.yaml - pnpm-workspace.yaml: allowBuilds=false (avoids ERR_PNPM_IGNORED_BUILDS on pnpm 11) - .gitignore: node_modules, pnpm-lock.yaml, eval outputs

5 evals (R1-R12 comment-quality verdicts) running the real skill via the native anthropic:claude-agent-sdk provider (local plugin, subscription auth). - fixtures: 2 ported from skill-creator + 4 reconstructed (datadog, scheduler, dlq pair) - assertions: llm-rubric per criterion + regex 'skill ran' proxy - baseline: sonnet 41/42, opus 42/42; eval-3 §4.1 accepts REWRITE-or-REMOVE, eval-4 token-verification kept strict as a model-capability marker

grixu added 2 commits June 18, 2026 10:19

grixu merged commit bef1150 into main Jun 18, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(comment-review): promptfoo eval suite + root eval tooling#4

test(comment-review): promptfoo eval suite + root eval tooling#4
grixu merged 2 commits into
mainfrom
feat/add-evals-to-comments

grixu commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grixu commented Jun 18, 2026

What

How the suite works

Baseline

Root tooling

Run

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant