feat: add promptfoo eval harness for agent quality scoring by jonesrussell · Pull Request #371 · msitarzewski/agency-agents

jonesrussell · 2026-03-30T16:08:19Z

Summary

Adds a promptfoo-based evaluation harness in evals/ that measures specialist agent quality across 5 criteria using LLM-as-judge scoring. This is the first step toward automated quality assurance for the agent prompt collection.

Proof-of-concept evaluates 3 agents (backend-architect, ux-architect, historian) against 6 tasks
Includes extract-metrics.ts script to parse agent success metrics from markdown files
5 unit tests for the extraction script, all passing
First run: 5/6 passed (83%) — the UX Architect failed on one task, showing the harness discriminates rather than rubber-stamping

Scoring Criteria

Criterion	What It Measures
Task Completion	Did the agent produce the requested deliverable?
Instruction Adherence	Did it follow its own defined workflow/format?
Identity Consistency	Did it stay in character?
Deliverable Quality	Expert-level, actionable output?
Safety	No harmful or biased content?

Each scored 1-5 by LLM-as-judge. Pass threshold: average >= 3.5.

How to run

cd evals
npm install
export ANTHROPIC_API_KEY=your-key
npx promptfoo eval
npx promptfoo view  # interactive results viewer

Cost

~$0.05 per run at Haiku pricing (166K tokens). Full 184-agent suite would estimate ~$1.50/run.

What's next

This is M1 of a 3-milestone plan:

M1 (this PR): Eval harness with 3 proof-of-concept agents
M2: Benchmark dataset covering all 184 agents with baseline scores
M3: CI quality gate — PR score gating + nightly trending

Design

The eval harness is fully isolated in evals/ with its own package.json — it doesn't touch any existing agent files or require changes to the contribution workflow. It's opt-in tooling for measuring and improving prompt quality.

Test plan

npx vitest run — 5/5 extract-metrics tests pass
npx promptfoo eval — 5/6 tests pass, 1 meaningful failure
npx promptfoo view — interactive results browser works

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Strip # prefix and emoji from headings before matching section names, preventing false positives from unrelated headings. Switch from deprecated glob.sync to named globSync export. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jonesrussell and others added 10 commits March 30, 2026 11:52

feat(evals): scaffold evals directory with promptfoo and TypeScript deps

704a9f0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): add extract-metrics script to parse agent success metrics

d51b55c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): define universal 5-criteria scoring rubric with anchors

38919cb

feat(evals): add proof-of-concept task files for 3 categories

ede4f7b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): wire up promptfoo config for 3 proof-of-concept agents

4dbfdf0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs(evals): add README with setup, usage, and scoring documentation

c1a61a3

chore(evals): update promptfoo to latest

b778cb4

fix(evals): use correct Haiku model ID for agent and judge

bc30cc3

chore(evals): add .env to gitignore

ab6ea83

siphomaribo approved these changes Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add promptfoo eval harness for agent quality scoring#371

feat: add promptfoo eval harness for agent quality scoring#371
jonesrussell wants to merge 10 commits intomsitarzewski:mainfrom
jonesrussell:feat/eval-harness-clean

jonesrussell commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonesrussell commented Mar 30, 2026

Summary

Scoring Criteria

How to run

Cost

What's next

Design

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants