feat: add promptfoo eval harness for agent quality scoring#371
Open
jonesrussell wants to merge 10 commits intomsitarzewski:mainfrom
Open
feat: add promptfoo eval harness for agent quality scoring#371jonesrussell wants to merge 10 commits intomsitarzewski:mainfrom
jonesrussell wants to merge 10 commits intomsitarzewski:mainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strip # prefix and emoji from headings before matching section names, preventing false positives from unrelated headings. Switch from deprecated glob.sync to named globSync export. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
siphomaribo
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a promptfoo-based evaluation harness in
evals/that measures specialist agent quality across 5 criteria using LLM-as-judge scoring. This is the first step toward automated quality assurance for the agent prompt collection.extract-metrics.tsscript to parse agent success metrics from markdown filesScoring Criteria
Each scored 1-5 by LLM-as-judge. Pass threshold: average >= 3.5.
How to run
Cost
~$0.05 per run at Haiku pricing (166K tokens). Full 184-agent suite would estimate ~$1.50/run.
What's next
This is M1 of a 3-milestone plan:
Design
The eval harness is fully isolated in
evals/with its ownpackage.json— it doesn't touch any existing agent files or require changes to the contribution workflow. It's opt-in tooling for measuring and improving prompt quality.Test plan
npx vitest run— 5/5 extract-metrics tests passnpx promptfoo eval— 5/6 tests pass, 1 meaningful failurenpx promptfoo view— interactive results browser works