Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions evals/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
node_modules/
dist/
.promptfoo/
results/latest.json
*.log
.env
88 changes: 88 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Agency-Agents Evaluation Harness

Automated quality evaluation for the agency-agents specialist prompt collection using [promptfoo](https://www.promptfoo.dev/).

## Quick Start

```bash
cd evals
npm install
export ANTHROPIC_API_KEY=your-key-here
npx promptfoo eval
```

## How It Works

The eval harness tests each specialist agent prompt by:

1. Loading the agent's markdown file as a system prompt
2. Sending it a representative task for its category
3. Using a separate LLM-as-judge to score the output on 5 criteria
4. Reporting pass/fail per agent

### Scoring Criteria

| Criterion | What It Measures |
|---|---|
| Task Completion | Did the agent produce the requested deliverable? |
| Instruction Adherence | Did it follow its own defined workflow and output format? |
| Identity Consistency | Did it stay in character per its personality and communication style? |
| Deliverable Quality | Is the output well-structured, actionable, and domain-appropriate? |
| Safety | No harmful, biased, or off-topic content |

Each criterion is scored **1-5**. An agent passes if its average score is **>= 3.5**.

### Judge Model

The agent-under-test uses Claude Sonnet. The judge uses Claude Haiku (a different model to avoid self-preference bias).

## Viewing Results

```bash
npx promptfoo view
```

Opens an interactive browser UI with detailed scores, outputs, and judge reasoning.

## Project Structure

```
evals/
promptfooconfig.yaml # Main config — providers, test suites, assertions
rubrics/
universal.yaml # 5 universal criteria with score anchor descriptions
tasks/
engineering.yaml # Test tasks for engineering agents
design.yaml # Test tasks for design agents
academic.yaml # Test tasks for academic agents
scripts/
extract-metrics.ts # Parses agent markdown → structured metrics JSON
```

## Adding Test Cases

Create or edit a file in `tasks/` following this format:

```yaml
- id: unique-task-id
description: "Short description of what this tests"
prompt: |
The actual prompt/task to send to the agent.
Be specific about what you want the agent to produce.
```

## Extract Metrics Script

Parse agent files to see their structured success metrics:

```bash
npx ts-node scripts/extract-metrics.ts "../engineering/*.md"
```

## Cost

Each evaluation runs the agent model once per task and the judge model 5 times per task (once per criterion). For the current 3-agent proof of concept (6 test cases):

- **Agent calls:** ~6 (Claude Sonnet)
- **Judge calls:** ~30 (Claude Haiku)
- **Estimated cost:** < $1 per run
24 changes: 24 additions & 0 deletions evals/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"name": "agency-agents-evals",
"version": "0.1.0",
"private": true,
"description": "Evaluation harness for agency-agents specialist prompts",
"scripts": {
"eval": "promptfoo eval",
"eval:view": "promptfoo view",
"eval:cache-clear": "promptfoo cache clear",
"extract": "ts-node scripts/extract-metrics.ts",
"test": "vitest run",
"test:watch": "vitest"
},
"dependencies": {
"gray-matter": "^4.0.3",
"promptfoo": "^0.121.3"
},
"devDependencies": {
"@types/node": "^22.0.0",
"ts-node": "^10.9.0",
"typescript": "^5.7.0",
"vitest": "^3.0.0"
}
}
Loading