Add vally eval suite (copilot-sdk executor, real agent evals)#1115
Add vally eval suite (copilot-sdk executor, real agent evals)#1115jongio wants to merge 2 commits into
Conversation
wbreza
left a comment
There was a problem hiding this comment.
Code Review — PR #1115
TL;DR: Adds a Vally evaluation suite with 4 evals and 19 stimuli (routing-accuracy, skill-matching, model-selection, config-validation), CI integration, and documentation. Well-structured eval infrastructure, but 4 issues need attention before leaving draft.
🔴 P0 — Blocker
docs-quality CI check is failing
Hard line breaks in evals/README.md, docs/vally-integration.md, and README.md split words across lines (e.g., "valid\ntion"). This is the only failing required CI check.
Fix: Reflow broken lines to continuous text; let the markdown renderer handle wrapping.
🟡 P1 — Major (3 findings)
1. PR title/description materially misleading
Title claims "181 stimuli across 4 evals" but the code delivers 19 stimuli (6+4+4+5). Description references TypeScript grader files, esbuild bundling, and npm run eval:build — none of which exist in the diff. The internal docs (vally-integration.md) correctly say "19 stimuli," confirming metadata wasn't updated after refactor.
Fix: Update title to "19 stimuli," remove stale references to TS graders, esbuild, and eval:build.
2. CI continue-on-error: true masks eval failures
The evals job uses continue-on-error: true and a || echo "Vally CLI not available" fallback. Since @microsoft/vally-cli is a published package, the evals should be running and enforcing results. Additionally, only the last eval's exit code propagates — failures in evals 1-3 are silently swallowed.
Fix: Remove continue-on-error: true and add proper error aggregation across all 4 eval runs (capture each exit code, fail if any are non-zero).
3. Unpinned @microsoft/vally-cli@latest in CI
npm install -g @microsoft/vally-cli@latest installs whatever version is currently published. This is a supply chain risk — pin to a specific version and upgrade deliberately.
Fix: Replace @latest with a pinned version (e.g., @microsoft/vally-cli@1.x.y).
🔵 P2 — Minor (follow-up suggestions)
- No tests for eval infrastructure: Consider adding
vally lintstep in CI to validate eval YAML schemas and regex patterns. - Redundant skill list in skill-matching stimuli: Same 6-skill list hardcoded in all 4 prompts — consider moving to a fixture file.
⚪ P3 — Nits (non-blocking)
- DRY violations in eval YAML (19× repeated grader structure) — manageable at current scale
- Sequential eval execution — 42s total today; optimize when scale demands it
.vally.yamlresults path inconsistency (results/vsvally-results/in docs)
Overall: Clean architecture, good separation of concerns, well-documented eval strategy. The 4 blocking items are straightforward fixes. Looking forward to seeing this land! 🚀
wbreza
left a comment
There was a problem hiding this comment.
Supplementary Review — head 2a7511
Follow-up to my prior CHANGES_REQUESTED review on c1b7576. Posting as a comment (non-blocking) since the PR is still in draft — flagging two new P0s and confirming which prior items remain open.
TL;DR
The 181-stimuli claim is now accurate in YAML (60 + 30 + 36 + 55 = 181) and graders genuinely import real SDK functions. However, the evals never actually executed in CI on this branch — the �vals job logs Error: No executor registered for mock and exits 1 on the first �ally eval call, swallowed by continue-on-error: true. The two CI fixes acknowledged earlier (continue-on-error removal and pinning @microsoft/vally-cli) are still absent, and docs/vally-integration.md is wholly stale.
Prior Finding Status
| Prior finding | Status |
|---|---|
| P0 — docs-quality CI hard line breaks | ✅ Resolved (via cspell.json) |
| P1.1 — "181 stimuli" inaccurate vs 19 in code | ✅ Resolved in code (but docs disagree — see F4) |
| P1.2 — continue-on-error: true masks failures | ❌ Not resolved |
| P1.3 — @microsoft/vally-cli@latest unpinned | ❌ Not resolved |
| P3 — �ally-results/ vs | |
| esults/ path | ✅ Resolved |
New Findings
🔴 P0 — Blockers
F1. Evals are silently failing in CI; "all checks PASS" is misleading.
CI run 26054894241, job �vals:
Error in evals/routing-accuracy/eval.yaml:
No executor registered for mock. Available executors: copilot-sdk
##[error]Process completed with exit code 1.
Only continue-on-error: true keeps the job green. Zero stimuli executed by CI on this PR. The 100%/100%/85.6%/100% scores must come from a local environment with a different vally build.
Fix: drop �xecutor: mock (or use the supported executor name in �ally-cli@0.4.0), remove continue-on-error, re-run.
F2. �val:build is never invoked in CI, and .mjs graders are gitignored.
package.json:25 defines �val:build (esbuild → .mjs); eval YAMLs invoke graders by .mjs path. .gitignore:38 ignores �vals/**/graders/.mjs. The workflow never runs
pm run eval:build. Even if F1 is fixed, vally will fail to spawn the graders.
Fix: add Build eval graders: npm run eval:build step before "Run evals", and add �sbuild to devDependencies.
🟡 P1 — Major
F3. docs/vally-integration.md documents the abandoned design.
Added by this PR but describes the old LLM-judge architecture this PR replaces — prompt/pairwise graders, rubrics, "LLM API access needed", "Total: 19 stimuli", per-eval counts of 6/4/4/5. Line 169 even says "Vally evals test whether an agent can correctly apply Squad's concepts" — the inverse of what the PR implements.
Fix: rewrite to match the program-grader design, or delete (evals/README.md already covers it).
F4. Stimulus count disagreement between docs.
�vals/README.md: "181 stimuli". docs/vally-integration.md: "19 stimuli". Same PR. Pick one.
F5. PR description references stale build pipeline.
PR body still shows
pm run eval:build # bundle graders with esbuild. The 181 number is correct but �val:build is not what CI runs (F2). Either commit to esbuild bundling and wire it into CI, or drop �val:build and use sx/native node TS loaders.
F6. �sbuild is not in devDependencies.
package.json:25 calls
px esbuild ... with no entry in devDependencies.
px downloads a fresh esbuild each invocation — unpinned, slow, supply-chain risk.
Fix:
pm i -D esbuild@.
🔵 P2 — Minor
F7. Exit-code aggregation still missing across 4 eval runs.
Four bare �ally eval lines in one
un: block. Without set -e, only the last command's exit code propagates. Use set -e or:
fail=0
for spec in routing-accuracy skill-matching model-selection config-validation; do
vally eval -e "evals//eval.yaml" || fail=1
done
exit F8. check-skill-matching.ts silently swallows skill load errors (lines 28-51, catch {}). Malformed SKILL.md is invisibly dropped from the registry; shows up later as a confusing ranking failure. Log to stderr.
F9. check-model.ts strict triple-equality on ier/source (lines 27-29). If a stimulus omits these fields, the comparison may silently pass or silently fail depending on SDK shape. Validate which fields are supplied and only compare those.
F10. NDCG_THRESHOLD parseFloat with no validation (check-skill-matching.ts:18). Non-numeric env → NaN → every stimulus silently fails. Add if (Number.isNaN(threshold)) throw ….
⚪ P3 — Nits
- F11. Per-stimulus re-parse of ixtures/routing-rules.md in check-routing.ts (60 spawns). Fine today.
- F12. �xpectedErrorPattern substring match is case-insensitive (check-config.ts:55). Risky if matching SDK error codes (e.g. E_AGENT_MISSING).
Strengths
- Graders genuinely import real SDK exports — no mocking masquerading as coverage.
- NDCG@5 implementation in check-skill-matching.ts is mathematically correct (relevance-rank weighting + log2-discount).
- The 181-stimuli claim now matches reality (60 + 30 + 36 + 55).
- Stimuli cover both happy-path and negative cases (config-validation has explicit �xpectedValid: false branches with error-pattern checks).
Bottom line: Architecture and grader design are solid. F1 + F2 are the hard blockers — the CI signal is currently a false-positive. Once those are fixed, the still-open prior items (F7, P1.2, P1.3) become trivial cleanup. 🚀
- F1/F2: Replace vally-cli CI dependency with scripts/run-evals.mjs runner that executes graders directly; removes continue-on-error masking - F3/F4: Delete stale docs/vally-integration.md (evals/README.md covers it) - F6: Add esbuild to devDependencies (pinned, no more npx downloads) - F8: Log stderr on skill load failures instead of silent catch - F9: Only compare model fields present in expected (skip undefined) - F10: Throw on NaN NDCG_THRESHOLD instead of silent failure - F7: Runner handles exit-code aggregation via threshold-based pass/fail Addresses review feedback from bradygaster#1115. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Regenerated package-lock.json from clean main base to eliminate nested @bradygaster/squad-sdk registry resolution - Fixed broken link in README.md (removed reference to deleted docs/vally-integration.md) - Reset root version to 0.9.4 (undo local prebuild bump) Closes bradygaster#1115 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace mock-based unit-test evals with proper end-to-end agent evals using @microsoft/vally-cli. Evals send real user prompts to a Copilot agent configured with Squad skills/fixtures, then grade the agent's actual responses using built-in vally graders. Three eval suites: - routing: verifies agent routes to correct squad member - skill-invocation: verifies agent applies right skills for tasks - task-completion: verifies agent produces working solutions All eval YAMLs pass vally lint validation. Uses copilot-sdk executor (default) with output-matches and prompt (LLM-as-judge) graders. CI workflow validates schemas on PR, full eval runs gated behind 'run-evals' label or workflow_dispatch. Closes bradygaster#1115 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
14d6b1e to
9e80b07
Compare
Replace mock-based unit-test evals with proper end-to-end agent evals using @microsoft/vally-cli. Evals send real user prompts to a Copilot agent configured with Squad skills/fixtures, then grade the agent's actual responses using built-in vally graders. Three eval suites: - routing: verifies agent routes to correct squad member - skill-invocation: verifies agent applies right skills for tasks - task-completion: verifies agent produces working solutions All eval YAMLs pass vally lint validation. Uses copilot-sdk executor (default) with output-matches and prompt (LLM-as-judge) graders. CI workflow validates schemas on PR, full eval runs gated behind 'run-evals' label or workflow_dispatch. Closes bradygaster#1115 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9e80b07 to
6905868
Compare
wbreza
left a comment
There was a problem hiding this comment.
Supplementary Review — head 6905868
Follow-up to my prior reviews on
fc1b757(CHANGES_REQUESTED) andf2a7511(COMMENTED). Posting as a comment since the PR is still in draft.
TL;DR
This is a complete fresh start and the architectural rewrite resolves every prior P0/P1 from earlier reviews: continue-on-error masking, @latest unpinning, esbuild + gitignored .mjs graders, mock executor, stimulus-count mismatch, and the hard-line-break docs issue are all gone. Now: 3 capability evals (6 + 5 + 4 = 15 stimuli) using @microsoft/vally-cli@^0.4.0 with the real copilot-sdk executor, gated CI, and JUnit artifacts. All 11 required checks pass on this head; Run eval suites is correctly skipped without the run-evals label.
Prior Finding Status
| Prior finding | Status |
|---|---|
| P0 — docs-quality hard line breaks | ✅ Obsolete (rewrite) |
| P1.1 — stimulus count mismatch | ✅ Resolved (15 in body matches 15 in YAMLs) |
P1.2 — continue-on-error: true masks failures |
✅ Resolved (flag removed) |
P1.3 — @microsoft/vally-cli@latest unpinned |
✅ Resolved (^0.4.0 devDep) |
| P0 (prior) — evals never actually executed in CI | ✅ Resolved (real executor; validate job runs vally lint on every PR) |
P0 (prior) — eval:build + gitignored .mjs graders |
✅ Obsolete (no custom graders) |
F3/F4 (prior) — docs/vally-integration.md stale + count disagreement |
✅ Obsolete (file removed) |
| F7 (prior) — exit-code aggregation | ✅ Resolved (single vally eval call) |
| F8–F10 (prior) — TS grader robustness | ✅ Obsolete (no TS graders) |
New Findings
🟡 P1 — Major
F1. evals/README.md claims "Reports scores as PR comments" — workflow does not.
The README's CI section advertises "Reports scores as PR comments", and .github/workflows/evals.yml grants pull-requests: write. But no workflow step actually posts a comment — only actions/upload-artifact@v4. Either add a comment step (and use the permission) or drop the claim from the README and reduce permissions to contents: read. As written, reviewers who add the run-evals label will see no inline scores anywhere — they'll have to download the artifact.
🔵 P2 — Minor
F2. output-matches graders are positive-only and would pass false-negatives.
e.g. pattern: (developer|Developer) in evals/routing/eval.yaml would match a response saying "This is not a Developer task — route to Tester". The companion prompt grader carries the real weight, but the deterministic check is misleading and inflates pass rates. Consider stricter framing (route to (the )?Developer) or pair with output-not-matches for the wrong-agent names.
F3. evals/skill-invocation/ fixtures omit agent charters.
Other suites ship .squad/agents/<name>/charter.md; this one only ships team.md, routing.md, and skills/. Skill matching may behave differently without agent context in a real Squad workspace. Either add minimal charters for parity, or document the intentional omission in evals/README.md.
F4. Bare prompt graders have no name: — score attribution will collide.
Other Vally evals (e.g. in microsoft/evaluate) give prompt graders a name: so JUnit/results output is attributable per stimulus. Optional polish; helpful for triaging failures.
⚪ P3 — Nits
evals/skill-invocation/eval.yaml(API design stimulus): regex(/teams|/v[0-9]|plural|resource naming|pagination)will match a response that merely mentions the word "pagination" with no actual endpoint design. The prompt rubric carries the real signal.evals/task-completion/eval.yaml(cost-aware model selection): rubric line "matches Squad's existing tier model (direct/lightweight/standard/full)" contradicts the prompt's stated tiers (fast / standard / premium). Pick one canonical tier vocabulary.evals/routing/eval.yaml"Handles ambiguous request with reasoning" has only apromptgrader — that's a feature (an ambiguity test shouldn't have a deterministic answer), but the suite is then 100% LLM-judge for that stimulus. Worth calling out inevals/README.mdso future readers don't try to add a deterministic check.
Strengths
- Real agent execution via the
copilot-sdkexecutor — no mocks, no synthetic checks. The pivot from the prior iteration is the right call. run-evalsjob is opt-in via label /workflow_dispatch— cost-aware design.validatejob runs cheapvally linton every PR, gating the expensive eval job vianeeds: validate. Clean pipeline shape.- Stimulus + top-level
rubric:paired with baretype: promptmatches the canonical Vally pattern (microsoft/evaluate,microsoft/wazaexamples). - Versions pinned
^0.4.0⇒>=0.4.0 <0.5.0on a 0.x SDK is appropriately tight. - All required CI checks green on this head.
Bottom line: Architecture and design are solid. F1 is the only meaningful pre-undraft cleanup; F2–F4 are quality polish; P3s are taste calls. Ready to leave draft after the README/permission alignment. 🚀
- Remove 'Reports scores as PR comments' claim from README (F1) - Drop unnecessary pull-requests: write permission from CI (F1) - Tighten output-matches patterns to require routing context (F2) - Add agent charters to skill-invocation fixtures (F3) - Add name: field to all prompt graders for JUnit attribution (F4) - Fix tier vocabulary inconsistency in task-completion rubric (P3) - Replace em dashes with hyphens in eval YAML (P3) Closes bradygaster#1112 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wbreza
left a comment
There was a problem hiding this comment.
Supplementary Review — head 55c2414
Follow-up to my prior reviews on
fc1b757(CHANGES_REQUESTED),f2a7511(COMMENTED), and6905868(COMMENTED). Posting as a comment since the PR is still in draft.
TL;DR
The fixup commit (55c2414) cleanly addresses every prior finding from the 6905868 review. F1 took the "drop the claim and reduce permissions" path (no PR-comment step added, pull-requests: write removed). F2–F4 are resolved with stricter regexes, charter fixtures, and named graders. P3 tier-vocabulary mismatch is fixed by removing the conflicting rubric line. All 11 required CI checks pass on this head; Run eval suites correctly skipped without the run-evals label; Validate eval schemas is green. Ready to leave draft.
Prior Finding Status
| Prior finding | Status |
|---|---|
| F1 (P1) — README claims "Reports scores as PR comments" but no step posts one | ✅ Resolved — README rewritten to describe what actually runs (validate on every PR, full suites behind label, JUnit artifacts); pull-requests: write permission removed from workflow |
F2 (P2) — output-matches regexes positive-only / false-negative risk |
✅ Resolved — patterns now require action verbs + context, e.g. (route.*[Dd]eveloper|assign.*[Dd]eveloper|[Dd]eveloper.*(agent|handle|fix)) across all routing stimuli |
F3 (P2) — evals/skill-invocation/ fixtures omit agent charters |
✅ Resolved — added fixtures/agents/developer/charter.md and fixtures/agents/lead/charter.md, wired into environment.files[] |
F4 (P2) — Bare prompt graders have no name: |
✅ Resolved — every prompt grader now named (routing-correctness, ambiguity-handling, typescript-skill-usage, api-design-skill-usage, testing-skill-usage, security-skill-usage, cross-cutting-skill-usage, bug-fix-quality, model-selection-quality, onboarding-quality, routing-table-quality) |
P3 — tier vocabulary contradiction (fast/standard/premium vs direct/lightweight/standard/full) |
✅ Resolved — conflicting rubric line removed from task-completion/eval.yaml |
Remaining Nits (non-blocking, fine to land)
evals/skill-invocation/eval.yamlAPI-design stimulus still uses the loose(/teams|/v[0-9]|plural|resource naming|pagination)regex called out previously — a response that just mentions "pagination" passes. The companionpromptgrader (name: api-design-skill-usage) carries the real signal, so leaving it is defensible.- Ambiguity stimulus in
evals/routing/eval.yamlstill has only thepromptgrader (nowname: ambiguity-handling). That's intentional and correct (no deterministic answer for an ambiguity test) — could be worth a sentence inevals/README.mdso future readers don't add a regex grader by reflex.
Neither is a blocker.
Observation (not a finding)
The fixup also normalized a few — / ⇒ Unicode chars to ASCII in rubrics and one prompt. No behavioral impact — graders use markdown literally and Vally doesn't tokenize on those characters.
Strengths
- The "drop the unused permission" path for F1 is the right call given how the workflow currently runs — least privilege, no half-built comment step.
- Charter additions in skill-invocation bring fixture parity with the other two suites without changing eval semantics.
- Named graders will pay off the first time a single stimulus fails — JUnit output will point at which grader (deterministic vs. rubric) dropped the score.
Bottom line: F1–F4 + the actionable P3 are closed. The two remaining nits are taste calls. ✅ Approve for leaving draft.
Fresh Start: Real Agent Evals with Vally
Previous iterations on this PR were mock-based unit tests that tested the model's ability to read markdown, not the actual agent behavior. This is a complete rewrite.
What this does
Adds a proper eval suite using @microsoft/vally-cli (v0.4.0) with the copilot-sdk executor to test Squad's agent behavior end-to-end. No mocks, no workarounds.
Eval Suites
|
outing | 6 | Agent routes user prompts to the correct squad member |
| skill-invocation | 5 | Agent identifies and invokes the right skill for a request |
| ask-completion | 4 | Agent produces complete, correct outputs for real tasks |
Each suite uses realistic .squad/ workspace fixtures (team.md, routing.md, agent charters, SKILL.md files) so vally's copilot-sdk executor sees the same context a real Squad agent would.
Grading
CI Integration
.github/workflows/evals.yml provides two jobs:
un-evals label or manual dispatch (requires GITHUB_TOKEN with Copilot access)
Results upload as artifacts in JUnit format.
Running locally
�ash npm run eval:lint # validate eval YAML schemas npm run eval # run all evals (requires Copilot-enabled GITHUB_TOKEN)Files
Closes #1112