Add vally eval suite (copilot-sdk executor, real agent evals) by jongio · Pull Request #1115 · bradygaster/squad

jongio · 2026-05-14T17:00:09Z

Fresh Start: Real Agent Evals with Vally

Previous iterations on this PR were mock-based unit tests that tested the model's ability to read markdown, not the actual agent behavior. This is a complete rewrite.

What this does

Adds a proper eval suite using @microsoft/vally-cli (v0.4.0) with the copilot-sdk executor to test Squad's agent behavior end-to-end. No mocks, no workarounds.

Eval Suites

Suite	Stimuli	What it tests

Each suite uses realistic .squad/ workspace fixtures (team.md, routing.md, agent charters, SKILL.md files) so vally's copilot-sdk executor sees the same context a real Squad agent would.

Grading

output-matches graders for deterministic routing checks
prompt graders with rubrics for open-ended quality assessment
Scoring threshold: 70% per suite

CI Integration

.github/workflows/evals.yml provides two jobs:

validate runs �ally lint on every PR to catch schema errors early
run-evals executes the full eval suite, gated behind the
un-evals label or manual dispatch (requires GITHUB_TOKEN with Copilot access)

Results upload as artifacts in JUnit format.

Running locally

�ash npm run eval:lint # validate eval YAML schemas npm run eval # run all evals (requires Copilot-enabled GITHUB_TOKEN)

Files

�vals/routing/eval.yaml + fixtures
�vals/skill-invocation/eval.yaml + fixtures
�vals/task-completion/eval.yaml + fixtures
�vals/README.md
.github/workflows/evals.yml
package.json (added eval scripts + vally devDeps)

Closes #1112

wbreza

Code Review — PR #1115

TL;DR: Adds a Vally evaluation suite with 4 evals and 19 stimuli (routing-accuracy, skill-matching, model-selection, config-validation), CI integration, and documentation. Well-structured eval infrastructure, but 4 issues need attention before leaving draft.

🔴 P0 — Blocker

docs-quality CI check is failing
Hard line breaks in evals/README.md, docs/vally-integration.md, and README.md split words across lines (e.g., "valid\ntion"). This is the only failing required CI check.
Fix: Reflow broken lines to continuous text; let the markdown renderer handle wrapping.

🟡 P1 — Major (3 findings)

1. PR title/description materially misleading
Title claims "181 stimuli across 4 evals" but the code delivers 19 stimuli (6+4+4+5). Description references TypeScript grader files, esbuild bundling, and npm run eval:build — none of which exist in the diff. The internal docs (vally-integration.md) correctly say "19 stimuli," confirming metadata wasn't updated after refactor.
Fix: Update title to "19 stimuli," remove stale references to TS graders, esbuild, and eval:build.

2. CI continue-on-error: true masks eval failures
The evals job uses continue-on-error: true and a || echo "Vally CLI not available" fallback. Since @microsoft/vally-cli is a published package, the evals should be running and enforcing results. Additionally, only the last eval's exit code propagates — failures in evals 1-3 are silently swallowed.
Fix: Remove continue-on-error: true and add proper error aggregation across all 4 eval runs (capture each exit code, fail if any are non-zero).

3. Unpinned @microsoft/vally-cli@latest in CI
npm install -g @microsoft/vally-cli@latest installs whatever version is currently published. This is a supply chain risk — pin to a specific version and upgrade deliberately.
Fix: Replace @latest with a pinned version (e.g., @microsoft/vally-cli@1.x.y).

🔵 P2 — Minor (follow-up suggestions)

No tests for eval infrastructure: Consider adding vally lint step in CI to validate eval YAML schemas and regex patterns.
Redundant skill list in skill-matching stimuli: Same 6-skill list hardcoded in all 4 prompts — consider moving to a fixture file.

⚪ P3 — Nits (non-blocking)

DRY violations in eval YAML (19× repeated grader structure) — manageable at current scale
Sequential eval execution — 42s total today; optimize when scale demands it
.vally.yaml results path inconsistency (results/ vs vally-results/ in docs)

Overall: Clean architecture, good separation of concerns, well-documented eval strategy. The 4 blocking items are straightforward fixes. Looking forward to seeing this land! 🚀

wbreza

Supplementary Review — head 2a7511

Follow-up to my prior CHANGES_REQUESTED review on c1b7576. Posting as a comment (non-blocking) since the PR is still in draft — flagging two new P0s and confirming which prior items remain open.

TL;DR

The 181-stimuli claim is now accurate in YAML (60 + 30 + 36 + 55 = 181) and graders genuinely import real SDK functions. However, the evals never actually executed in CI on this branch — the �vals job logs Error: No executor registered for mock and exits 1 on the first �ally eval call, swallowed by continue-on-error: true. The two CI fixes acknowledged earlier (continue-on-error removal and pinning @microsoft/vally-cli) are still absent, and docs/vally-integration.md is wholly stale.

Prior Finding Status

Prior finding	Status
P0 — docs-quality CI hard line breaks	✅ Resolved (via cspell.json)
P1.1 — "181 stimuli" inaccurate vs 19 in code	✅ Resolved in code (but docs disagree — see F4)
P1.2 — continue-on-error: true masks failures	❌ Not resolved
P1.3 — @microsoft/vally-cli@latest unpinned	❌ Not resolved
P3 — �ally-results/ vs
esults/ path	✅ Resolved

New Findings

🔴 P0 — Blockers

F1. Evals are silently failing in CI; "all checks PASS" is misleading.
CI run 26054894241, job �vals:

Error in evals/routing-accuracy/eval.yaml:
  No executor registered for mock. Available executors: copilot-sdk
##[error]Process completed with exit code 1.

Only continue-on-error: true keeps the job green. Zero stimuli executed by CI on this PR. The 100%/100%/85.6%/100% scores must come from a local environment with a different vally build.
Fix: drop �xecutor: mock (or use the supported executor name in �ally-cli@0.4.0), remove continue-on-error, re-run.

F2. �val:build is never invoked in CI, and .mjs graders are gitignored.
package.json:25 defines �val:build (esbuild → .mjs); eval YAMLs invoke graders by .mjs path. .gitignore:38 ignores �vals/**/graders/.mjs. The workflow never runs
pm run eval:build. Even if F1 is fixed, vally will fail to spawn the graders.
Fix: add Build eval graders: npm run eval:build step before "Run evals", and add �sbuild to devDependencies.

🟡 P1 — Major

F3. docs/vally-integration.md documents the abandoned design.
Added by this PR but describes the old LLM-judge architecture this PR replaces — prompt/pairwise graders, rubrics, "LLM API access needed", "Total: 19 stimuli", per-eval counts of 6/4/4/5. Line 169 even says "Vally evals test whether an agent can correctly apply Squad's concepts" — the inverse of what the PR implements.
Fix: rewrite to match the program-grader design, or delete (evals/README.md already covers it).

F4. Stimulus count disagreement between docs.
�vals/README.md: "181 stimuli". docs/vally-integration.md: "19 stimuli". Same PR. Pick one.

F5. PR description references stale build pipeline.
PR body still shows
pm run eval:build # bundle graders with esbuild. The 181 number is correct but �val:build is not what CI runs (F2). Either commit to esbuild bundling and wire it into CI, or drop �val:build and use sx/native node TS loaders.

F6. �sbuild is not in devDependencies.
package.json:25 calls
px esbuild ... with no entry in devDependencies.
px downloads a fresh esbuild each invocation — unpinned, slow, supply-chain risk.
Fix:
pm i -D esbuild@.

🔵 P2 — Minor

F7. Exit-code aggregation still missing across 4 eval runs.
Four bare �ally eval lines in one
un: block. Without set -e, only the last command's exit code propagates. Use set -e or:

fail=0
for spec in routing-accuracy skill-matching model-selection config-validation; do
  vally eval -e "evals//eval.yaml" || fail=1
done
exit

F8. check-skill-matching.ts silently swallows skill load errors (lines 28-51, catch {}). Malformed SKILL.md is invisibly dropped from the registry; shows up later as a confusing ranking failure. Log to stderr.

F9. check-model.ts strict triple-equality on ier/source (lines 27-29). If a stimulus omits these fields, the comparison may silently pass or silently fail depending on SDK shape. Validate which fields are supplied and only compare those.

F10. NDCG_THRESHOLD parseFloat with no validation (check-skill-matching.ts:18). Non-numeric env → NaN → every stimulus silently fails. Add if (Number.isNaN(threshold)) throw ….

⚪ P3 — Nits

F11. Per-stimulus re-parse of ixtures/routing-rules.md in check-routing.ts (60 spawns). Fine today.
F12. �xpectedErrorPattern substring match is case-insensitive (check-config.ts:55). Risky if matching SDK error codes (e.g. E_AGENT_MISSING).

Strengths

Graders genuinely import real SDK exports — no mocking masquerading as coverage.
NDCG@5 implementation in check-skill-matching.ts is mathematically correct (relevance-rank weighting + log2-discount).
The 181-stimuli claim now matches reality (60 + 30 + 36 + 55).
Stimuli cover both happy-path and negative cases (config-validation has explicit �xpectedValid: false branches with error-pattern checks).

Bottom line: Architecture and grader design are solid. F1 + F2 are the hard blockers — the CI signal is currently a false-positive. Once those are fixed, the still-open prior items (F7, P1.2, P1.3) become trivial cleanup. 🚀

- F1/F2: Replace vally-cli CI dependency with scripts/run-evals.mjs runner that executes graders directly; removes continue-on-error masking - F3/F4: Delete stale docs/vally-integration.md (evals/README.md covers it) - F6: Add esbuild to devDependencies (pinned, no more npx downloads) - F8: Log stderr on skill load failures instead of silent catch - F9: Only compare model fields present in expected (skip undefined) - F10: Throw on NaN NDCG_THRESHOLD instead of silent failure - F7: Runner handles exit-code aggregation via threshold-based pass/fail Addresses review feedback from bradygaster#1115. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Regenerated package-lock.json from clean main base to eliminate nested @bradygaster/squad-sdk registry resolution - Fixed broken link in README.md (removed reference to deleted docs/vally-integration.md) - Reset root version to 0.9.4 (undo local prebuild bump) Closes bradygaster#1115 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace mock-based unit-test evals with proper end-to-end agent evals using @microsoft/vally-cli. Evals send real user prompts to a Copilot agent configured with Squad skills/fixtures, then grade the agent's actual responses using built-in vally graders. Three eval suites: - routing: verifies agent routes to correct squad member - skill-invocation: verifies agent applies right skills for tasks - task-completion: verifies agent produces working solutions All eval YAMLs pass vally lint validation. Uses copilot-sdk executor (default) with output-matches and prompt (LLM-as-judge) graders. CI workflow validates schemas on PR, full eval runs gated behind 'run-evals' label or workflow_dispatch. Closes bradygaster#1115 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza

Supplementary Review — head `6905868`

Follow-up to my prior reviews on fc1b757 (CHANGES_REQUESTED) and f2a7511 (COMMENTED). Posting as a comment since the PR is still in draft.

TL;DR

This is a complete fresh start and the architectural rewrite resolves every prior P0/P1 from earlier reviews: continue-on-error masking, @latest unpinning, esbuild + gitignored .mjs graders, mock executor, stimulus-count mismatch, and the hard-line-break docs issue are all gone. Now: 3 capability evals (6 + 5 + 4 = 15 stimuli) using @microsoft/vally-cli@^0.4.0 with the real copilot-sdk executor, gated CI, and JUnit artifacts. All 11 required checks pass on this head; Run eval suites is correctly skipped without the run-evals label.

Prior Finding Status

Prior finding	Status
P0 — docs-quality hard line breaks	✅ Obsolete (rewrite)
P1.1 — stimulus count mismatch	✅ Resolved (15 in body matches 15 in YAMLs)
P1.2 — `continue-on-error: true` masks failures	✅ Resolved (flag removed)
P1.3 — `@microsoft/vally-cli@latest` unpinned	✅ Resolved (`^0.4.0` devDep)
P0 (prior) — evals never actually executed in CI	✅ Resolved (real executor; `validate` job runs `vally lint` on every PR)
P0 (prior) — `eval:build` + gitignored `.mjs` graders	✅ Obsolete (no custom graders)
F3/F4 (prior) — `docs/vally-integration.md` stale + count disagreement	✅ Obsolete (file removed)
F7 (prior) — exit-code aggregation	✅ Resolved (single `vally eval` call)
F8–F10 (prior) — TS grader robustness	✅ Obsolete (no TS graders)

New Findings

🟡 P1 — Major

F1. evals/README.md claims "Reports scores as PR comments" — workflow does not.
The README's CI section advertises "Reports scores as PR comments", and .github/workflows/evals.yml grants pull-requests: write. But no workflow step actually posts a comment — only actions/upload-artifact@v4. Either add a comment step (and use the permission) or drop the claim from the README and reduce permissions to contents: read. As written, reviewers who add the run-evals label will see no inline scores anywhere — they'll have to download the artifact.

🔵 P2 — Minor

F2. output-matches graders are positive-only and would pass false-negatives.
e.g. pattern: (developer|Developer) in evals/routing/eval.yaml would match a response saying "This is not a Developer task — route to Tester". The companion prompt grader carries the real weight, but the deterministic check is misleading and inflates pass rates. Consider stricter framing (route to (the )?Developer) or pair with output-not-matches for the wrong-agent names.

F3. evals/skill-invocation/ fixtures omit agent charters.
Other suites ship .squad/agents/<name>/charter.md; this one only ships team.md, routing.md, and skills/. Skill matching may behave differently without agent context in a real Squad workspace. Either add minimal charters for parity, or document the intentional omission in evals/README.md.

F4. Bare prompt graders have no name: — score attribution will collide.
Other Vally evals (e.g. in microsoft/evaluate) give prompt graders a name: so JUnit/results output is attributable per stimulus. Optional polish; helpful for triaging failures.

⚪ P3 — Nits

evals/skill-invocation/eval.yaml (API design stimulus): regex (/teams|/v[0-9]|plural|resource naming|pagination) will match a response that merely mentions the word "pagination" with no actual endpoint design. The prompt rubric carries the real signal.
evals/task-completion/eval.yaml (cost-aware model selection): rubric line "matches Squad's existing tier model (direct/lightweight/standard/full)" contradicts the prompt's stated tiers (fast / standard / premium). Pick one canonical tier vocabulary.
evals/routing/eval.yaml "Handles ambiguous request with reasoning" has only a prompt grader — that's a feature (an ambiguity test shouldn't have a deterministic answer), but the suite is then 100% LLM-judge for that stimulus. Worth calling out in evals/README.md so future readers don't try to add a deterministic check.

Strengths

Real agent execution via the copilot-sdk executor — no mocks, no synthetic checks. The pivot from the prior iteration is the right call.
run-evals job is opt-in via label / workflow_dispatch — cost-aware design.
validate job runs cheap vally lint on every PR, gating the expensive eval job via needs: validate. Clean pipeline shape.
Stimulus + top-level rubric: paired with bare type: prompt matches the canonical Vally pattern (microsoft/evaluate, microsoft/waza examples).
Versions pinned ^0.4.0 ⇒ >=0.4.0 <0.5.0 on a 0.x SDK is appropriately tight.
All required CI checks green on this head.

Bottom line: Architecture and design are solid. F1 is the only meaningful pre-undraft cleanup; F2–F4 are quality polish; P3s are taste calls. Ready to leave draft after the README/permission alignment. 🚀

- Remove 'Reports scores as PR comments' claim from README (F1) - Drop unnecessary pull-requests: write permission from CI (F1) - Tighten output-matches patterns to require routing context (F2) - Add agent charters to skill-invocation fixtures (F3) - Add name: field to all prompt graders for JUnit attribution (F4) - Fix tier vocabulary inconsistency in task-completion rubric (P3) - Replace em dashes with hyphens in eval YAML (P3) Closes bradygaster#1112 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza

Supplementary Review — head `55c2414`

Follow-up to my prior reviews on fc1b757 (CHANGES_REQUESTED), f2a7511 (COMMENTED), and 6905868 (COMMENTED). Posting as a comment since the PR is still in draft.

TL;DR

The fixup commit (55c2414) cleanly addresses every prior finding from the 6905868 review. F1 took the "drop the claim and reduce permissions" path (no PR-comment step added, pull-requests: write removed). F2–F4 are resolved with stricter regexes, charter fixtures, and named graders. P3 tier-vocabulary mismatch is fixed by removing the conflicting rubric line. All 11 required CI checks pass on this head; Run eval suites correctly skipped without the run-evals label; Validate eval schemas is green. Ready to leave draft.

Prior Finding Status

Prior finding	Status
F1 (P1) — README claims "Reports scores as PR comments" but no step posts one	✅ Resolved — README rewritten to describe what actually runs (validate on every PR, full suites behind label, JUnit artifacts); `pull-requests: write` permission removed from workflow
F2 (P2) — `output-matches` regexes positive-only / false-negative risk	✅ Resolved — patterns now require action verbs + context, e.g. `(route.[Dd]eveloper\|assign.[Dd]eveloper\|[Dd]eveloper.*(agent\|handle\|fix))` across all routing stimuli
F3 (P2) — `evals/skill-invocation/` fixtures omit agent charters	✅ Resolved — added `fixtures/agents/developer/charter.md` and `fixtures/agents/lead/charter.md`, wired into `environment.files[]`
F4 (P2) — Bare `prompt` graders have no `name:`	✅ Resolved — every prompt grader now named (`routing-correctness`, `ambiguity-handling`, `typescript-skill-usage`, `api-design-skill-usage`, `testing-skill-usage`, `security-skill-usage`, `cross-cutting-skill-usage`, `bug-fix-quality`, `model-selection-quality`, `onboarding-quality`, `routing-table-quality`)
P3 — tier vocabulary contradiction (`fast/standard/premium` vs `direct/lightweight/standard/full`)	✅ Resolved — conflicting rubric line removed from `task-completion/eval.yaml`

Remaining Nits (non-blocking, fine to land)

evals/skill-invocation/eval.yaml API-design stimulus still uses the loose (/teams|/v[0-9]|plural|resource naming|pagination) regex called out previously — a response that just mentions "pagination" passes. The companion prompt grader (name: api-design-skill-usage) carries the real signal, so leaving it is defensible.
Ambiguity stimulus in evals/routing/eval.yaml still has only the prompt grader (now name: ambiguity-handling). That's intentional and correct (no deterministic answer for an ambiguity test) — could be worth a sentence in evals/README.md so future readers don't add a regex grader by reflex.

Neither is a blocker.

Observation (not a finding)

The fixup also normalized a few — / ⇒ Unicode chars to ASCII in rubrics and one prompt. No behavioral impact — graders use markdown literally and Vally doesn't tokenize on those characters.

Strengths

The "drop the unused permission" path for F1 is the right call given how the workflow currently runs — least privilege, no half-built comment step.
Charter additions in skill-invocation bring fixture parity with the other two suites without changing eval semantics.
Named graders will pay off the first time a single stimulus fails — JUnit output will point at which grader (deterministic vs. rubric) dropped the score.

Bottom line: F1–F4 + the actionable P3 are closed. The two remaining nits are taste calls. ✅ Approve for leaving draft.

jongio marked this pull request as draft May 14, 2026 17:16

wbreza suggested changes May 15, 2026

View reviewed changes

jongio changed the title ~~feat: add Vally evaluation suite (181 stimuli across 4 evals)~~ feat: add Vally evaluation suite (19 stimuli across 4 evals) May 15, 2026

jongio changed the title ~~feat: add Vally evaluation suite (19 stimuli across 4 evals)~~ feat(evals): Vally eval suite testing real SDK behavior May 18, 2026

wbreza reviewed May 18, 2026

View reviewed changes

jongio requested a review from wbreza May 18, 2026 19:58

jongio force-pushed the squad/vally-integration-clean branch from 14d6b1e to 9e80b07 Compare May 20, 2026 22:07

jongio changed the title ~~feat(evals): Vally eval suite testing real SDK behavior~~ feat(evals): add vally eval suite with copilot-sdk executor May 20, 2026

jongio force-pushed the squad/vally-integration-clean branch from 9e80b07 to 6905868 Compare May 20, 2026 22:35

jongio changed the title ~~feat(evals): add vally eval suite with copilot-sdk executor~~ Add vally eval suite (copilot-sdk executor, real agent evals) May 21, 2026

wbreza reviewed May 21, 2026

View reviewed changes

jongio marked this pull request as ready for review May 21, 2026 01:37

wbreza reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vally eval suite (copilot-sdk executor, real agent evals)#1115

Add vally eval suite (copilot-sdk executor, real agent evals)#1115
jongio wants to merge 2 commits into
bradygaster:mainfrom
jongio:squad/vally-integration-clean

jongio commented May 14, 2026 •

edited

Loading

Uh oh!

wbreza left a comment

Uh oh!

wbreza left a comment

Uh oh!

wbreza left a comment

Uh oh!

wbreza left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jongio commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fresh Start: Real Agent Evals with Vally

What this does

Eval Suites

Grading

CI Integration

Running locally

Files

Closes #1112

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review — PR #1115

🔴 P0 — Blocker

🟡 P1 — Major (3 findings)

🔵 P2 — Minor (follow-up suggestions)

⚪ P3 — Nits (non-blocking)

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Supplementary Review — head 2a7511

TL;DR

Prior Finding Status

New Findings

🔴 P0 — Blockers

🟡 P1 — Major

🔵 P2 — Minor

⚪ P3 — Nits

Strengths

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Supplementary Review — head 6905868

TL;DR

Prior Finding Status

New Findings

🟡 P1 — Major

🔵 P2 — Minor

⚪ P3 — Nits

Strengths

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Supplementary Review — head 55c2414

TL;DR

Prior Finding Status

Remaining Nits (non-blocking, fine to land)

Observation (not a finding)

Strengths

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jongio commented May 14, 2026 •

edited

Loading

Supplementary Review — head `6905868`

Supplementary Review — head `55c2414`