msitarzewski · msitarzewski · Apr 11, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/evals/.gitignore b/evals/.gitignore
@@ -0,0 +1,6 @@
+node_modules/
+dist/
+.promptfoo/
+results/latest.json
+*.log
+.env
diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,88 @@
+# Agency-Agents Evaluation Harness
+
+Automated quality evaluation for the agency-agents specialist prompt collection using [promptfoo](https://www.promptfoo.dev/).
+
+## Quick Start
+
+```bash
+cd evals
+npm install
+export ANTHROPIC_API_KEY=your-key-here
+npx promptfoo eval
+```
+
+## How It Works
+
+The eval harness tests each specialist agent prompt by:
+
+1. Loading the agent's markdown file as a system prompt
+2. Sending it a representative task for its category
+3. Using a separate LLM-as-judge to score the output on 5 criteria
+4. Reporting pass/fail per agent
+
+### Scoring Criteria
+
+| Criterion | What It Measures |
+|---|---|
+| Task Completion | Did the agent produce the requested deliverable? |
+| Instruction Adherence | Did it follow its own defined workflow and output format? |
+| Identity Consistency | Did it stay in character per its personality and communication style? |
+| Deliverable Quality | Is the output well-structured, actionable, and domain-appropriate? |
+| Safety | No harmful, biased, or off-topic content |
+
+Each criterion is scored **1-5**. An agent passes if its average score is **>= 3.5**.
+
+### Judge Model
+
+The agent-under-test uses Claude Sonnet. The judge uses Claude Haiku (a different model to avoid self-preference bias).
+
+## Viewing Results
+
+```bash
+npx promptfoo view
+```
+
+Opens an interactive browser UI with detailed scores, outputs, and judge reasoning.
+
+## Project Structure
+
+```
+evals/
+  promptfooconfig.yaml     # Main config — providers, test suites, assertions
+  rubrics/
+    universal.yaml          # 5 universal criteria with score anchor descriptions
+  tasks/
+    engineering.yaml        # Test tasks for engineering agents
+    design.yaml             # Test tasks for design agents
+    academic.yaml           # Test tasks for academic agents
+  scripts/
+    extract-metrics.ts      # Parses agent markdown → structured metrics JSON
+```
+
+## Adding Test Cases
+
+Create or edit a file in `tasks/` following this format:
+
+```yaml
+- id: unique-task-id
+  description: "Short description of what this tests"
+  prompt: |
+    The actual prompt/task to send to the agent.
+    Be specific about what you want the agent to produce.
+```
+
+## Extract Metrics Script
+
+Parse agent files to see their structured success metrics:
+
+```bash
+npx ts-node scripts/extract-metrics.ts "../engineering/*.md"
+```
+
+## Cost
+
+Each evaluation runs the agent model once per task and the judge model 5 times per task (once per criterion). For the current 3-agent proof of concept (6 test cases):
+
+- **Agent calls:** ~6 (Claude Sonnet)
+- **Judge calls:** ~30 (Claude Haiku)
+- **Estimated cost:** < $1 per run
diff --git a/evals/package.json b/evals/package.json
@@ -0,0 +1,24 @@
+{
+  "name": "agency-agents-evals",
+  "version": "0.1.0",
+  "private": true,
+  "description": "Evaluation harness for agency-agents specialist prompts",
+  "scripts": {
+    "eval": "promptfoo eval",
+    "eval:view": "promptfoo view",
+    "eval:cache-clear": "promptfoo cache clear",
+    "extract": "ts-node scripts/extract-metrics.ts",
+    "test": "vitest run",
+    "test:watch": "vitest"
+  },
+  "dependencies": {
+    "gray-matter": "^4.0.3",
+    "promptfoo": "^0.121.3"
+  },
+  "devDependencies": {
+    "@types/node": "^22.0.0",
+    "ts-node": "^10.9.0",
+    "typescript": "^5.7.0",
+    "vitest": "^3.0.0"
+  }
+}