Skip to content

feat: add promptfoo eval harness for agent quality scoring#371

Open
jonesrussell wants to merge 10 commits intomsitarzewski:mainfrom
jonesrussell:feat/eval-harness-clean
Open

feat: add promptfoo eval harness for agent quality scoring#371
jonesrussell wants to merge 10 commits intomsitarzewski:mainfrom
jonesrussell:feat/eval-harness-clean

Conversation

@jonesrussell
Copy link
Copy Markdown

Summary

Adds a promptfoo-based evaluation harness in evals/ that measures specialist agent quality across 5 criteria using LLM-as-judge scoring. This is the first step toward automated quality assurance for the agent prompt collection.

  • Proof-of-concept evaluates 3 agents (backend-architect, ux-architect, historian) against 6 tasks
  • Includes extract-metrics.ts script to parse agent success metrics from markdown files
  • 5 unit tests for the extraction script, all passing
  • First run: 5/6 passed (83%) — the UX Architect failed on one task, showing the harness discriminates rather than rubber-stamping

Scoring Criteria

Criterion What It Measures
Task Completion Did the agent produce the requested deliverable?
Instruction Adherence Did it follow its own defined workflow/format?
Identity Consistency Did it stay in character?
Deliverable Quality Expert-level, actionable output?
Safety No harmful or biased content?

Each scored 1-5 by LLM-as-judge. Pass threshold: average >= 3.5.

How to run

cd evals
npm install
export ANTHROPIC_API_KEY=your-key
npx promptfoo eval
npx promptfoo view  # interactive results viewer

Cost

~$0.05 per run at Haiku pricing (166K tokens). Full 184-agent suite would estimate ~$1.50/run.

What's next

This is M1 of a 3-milestone plan:

  • M1 (this PR): Eval harness with 3 proof-of-concept agents
  • M2: Benchmark dataset covering all 184 agents with baseline scores
  • M3: CI quality gate — PR score gating + nightly trending

Design

The eval harness is fully isolated in evals/ with its own package.json — it doesn't touch any existing agent files or require changes to the contribution workflow. It's opt-in tooling for measuring and improving prompt quality.

Test plan

  • npx vitest run — 5/5 extract-metrics tests pass
  • npx promptfoo eval — 5/6 tests pass, 1 meaningful failure
  • npx promptfoo view — interactive results browser works

jonesrussell and others added 10 commits March 30, 2026 11:52
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strip # prefix and emoji from headings before matching section names,
preventing false positives from unrelated headings. Switch from
deprecated glob.sync to named globSync export.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants