Skip to content

Add promptfoo eval harness for skills + consolidate MCP endpoint#65

Merged
ari-launchdarkly merged 16 commits into
mainfrom
dsanchez/eval-harness
May 21, 2026
Merged

Add promptfoo eval harness for skills + consolidate MCP endpoint#65
ari-launchdarkly merged 16 commits into
mainfrom
dsanchez/eval-harness

Conversation

@dakotasanchez
Copy link
Copy Markdown
Contributor

@dakotasanchez dakotasanchez commented May 19, 2026

Summary

  • Adds a promptfoo-based eval harness under evals/ that runs each skill end-to-end through the Claude Agent SDK with mocked LaunchDarkly MCP tools, asserting on both tool-call trajectory and response quality
  • Adds a CI workflow (.github/workflows/eval-skills.yml) that runs evals on PRs and nightly
  • Fixes several bugs found during review of the harness implementation
  • Consolidates .mcp.json and migration docs to the unified mcp/launchdarkly endpoint

Eval harness (evals/)

Provider (providers/claude-skill-agent-sdk.js): loads the skill the same way a real Claude Code session does (off-disk via .claude/skills/<slug>/), exposes an in-process mock MCP server, and records a full tool-call trajectory for assertions.

Suites (one promptfooconfig.yaml per skill, 3–5 test cases each):

  • aiconfig-create — 4/4 passing (100%)
  • aiconfig-update — 4/5 passing (80%)
  • aiconfig-tools — 3/4 passing (75%)
  • aiconfig-variations — 4/5 passing (80%)
  • launchdarkly-flag-create — 2/3 passing (67%, pre-existing mock conflict with seed data)

Scripts:

  • aggregate.js — runs all suites, writes eval-scores.json at repo root
  • diff-changed-skills.js — git-diff-gated re-run (only re-evaluates suites whose source changed)
  • render-badges.js — rewrites per-skill README eval score badges
  • run-models.js — cross-model matrix runner (haiku / sonnet / opus)

Unit tests (tests/*.test.js, 66 tests, no API calls needed):

  • transform.test.js, output-valid.test.js, assertions.test.js, jsonschema-to-zod.test.js, mock.test.js
  • Run via npm test from evals/

CI workflow (.github/workflows/eval-skills.yml)

  • Updated default AGENT_MODEL from claude-sonnet-4-20250514claude-sonnet-4-6
  • Added a unit-test job that runs npm test (no API key needed) before the LLM eval jobs

Job flow:

  1. unit-test — runs 66 unit tests on every trigger, no API calls
  2. diff — computes which suites changed since their last recorded score
  3. evaluate — runs changed suites in parallel (max 3), gated on unit-test passing
  4. aggregate — merges results into eval-scores.json, posts PR score-diff comment, commits refreshed scores/badges on schedule

Triggers: pull_request (on skills/**, evals/**, workflow file), schedule (09:17 UTC daily), workflow_dispatch (with run_all override)

Bug fixes (found during review against PR #54 implementation)

  • run-models.js: .filter(Array.isArray([]) || true) evaluated to .filter(true) — a TypeError crash on any non-empty results array; removed the no-op filter
  • run-models.js: runSuiteWithModel only checked result.error (OS spawn errors), not result.status !== 0; promptfoo eval failures were silently treated as success
  • run-models.js: output path was relative instead of absolute, inconsistent with aggregate.js
  • _mock.js: update-ai-config-variation spread all of input into the variation, polluting state with configKey, variationKey, etc.; now only patches the allowed variation fields
  • _mock.js: list-ai-configs dropped agent-created configs whose key collided with a seed template key; state entries now always win
  • _mock.js: create-ai-config-variation for a nonexistent parent config silently orphaned the variation; now creates a minimal stub
  • previewMock: unconditionally appended "..." even for strings under 200 chars
  • aiconfig-create Test 3: dead explored variable (computed but never used in pass)
  • flag-create Test 2: pass: false, score: 0.5 was internally inconsistent; changed to pass: true, score: 0.5
  • tool-responses.json: list-feature-flags returned empty while list-flags had seed data; aligned them

MCP endpoint consolidation

  • .mcp.json: replaced separate mcp/fm + mcp/aiconfigs entries with single mcp/launchdarkly entry
  • mcp-config-templates.md: updated migration section to deprecate both mcp/fm and mcp/aiconfigs (previously only mcp/aiconfigs was marked deprecated; mcp/fm was incorrectly described as a valid mirror)

Assorted

  • evals/.env.example: added links to credential consoles (console.anthropic.com, platform.openai.com/api-keys)
  • Removed undocumented EVAL_MODEL env var fallback from the provider (only AGENT_MODEL is supported)
  • Added unit tests for all utility modules

Test plan

  • cd evals && npm test — 66 unit tests pass, no API calls needed
  • npm run eval:aiconfig-create:single — smoke test (~30s, requires ANTHROPIC_API_KEY)
  • npm run eval:all — full suite run, check eval-scores.json matches expected scores above

@dakotasanchez dakotasanchez requested a review from a team as a code owner May 19, 2026 23:22
Copy link
Copy Markdown
Contributor

@ari-launchdarkly ari-launchdarkly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot. I'm gonna ask Devin to take a pass

Comment thread evals/.env.example Outdated
Comment thread evals/shared/output-valid.js Outdated
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Review Summary

Thorough and well-structured PR. The eval harness design is solid — running skills through the Claude Agent SDK with an in-process mock MCP server is the right call for realism. The bug fixes are all clearly motivated and the unit test coverage (66 tests, all passing) is strong. A few observations below.

Architecture & Design

  • The provider (claude-skill-agent-sdk.js) is well-documented with a clear header block, and the isolation strategy (per-invocation temp dirs with symlinked skills + empty .mcp.json stub) is clean. Good call using fs.mkdtempSync for concurrent test safety.
  • The stateful mock overlay in _mock.js is a nice approach — write-tools record into state, read-tools reflect it back. The SEED_CONFIGS data gives the agent believable context without API calls.
  • _manifest.js as a single source of truth for suite ↔ skill mappings is a good pattern for keeping the scripts, CI, and badge rendering in sync.

Bug Fixes

The fixes called out in the description all look correct:

  • The .filter(Array.isArray([]) || true).filter(true) crash in run-models.js was a real bug (good catch).
  • update-ai-config-variation now correctly allowlists VARIATION_FIELDS instead of spreading the entire input — the test at mock.test.js:101-128 validates this well.
  • The previewMock truncation fix is straightforward and correct.

Items Worth Considering

  1. list-feature-flags vs list-flags alignment — The description mentions aligning these in tool-responses.json, but definitions.json only defines list-flags (no list-feature-flags). If any skill or agent happens to call list-feature-flags, it will hit the { error: "No mock configured for tool: ..." } fallback. Worth confirming no skills reference that tool name, or adding a list-feature-flags entry to definitions.json that mirrors list-flags.

  2. CI workflow — contents: write on PRs — The workflow sets permissions: contents: write globally but only needs write access for the schedule/dispatch commit step. Since the commit step is already gated on event_name, this is safe in practice, but tightening to contents: read at the workflow level and overriding to write only in the aggregate job (on schedule/dispatch) would follow least-privilege more closely.

  3. Root package.json — dangling test:llm-evals reference — The old test:llm-evals script pointed to --prefix tests. The diff replaces it with the new eval / eval:all / eval:view scripts, but if there was a tests/ directory that's now stale, it might be worth cleaning up or noting in the description.

  4. Temp directory cleanup on provider failure — The finally block at line 466 in claude-skill-agent-sdk.js does fs.rmSync(cwd, ...) wrapped in a try/catch-swallow. This is fine for normal runs. On early errors (e.g., SDK import failure before cwd is defined), cwd would be undefined, but the catch block handles it gracefully. No action needed — just calling it out.

  5. create-ai-config-variation stub creation — When a variation is created for a nonexistent parent config, the mock creates a stub with mode: "completion" as default. This is pragmatic for mock purposes, but if the agent subsequently checks the config's mode, it might get a misleading answer. A minor edge case — probably fine for eval purposes.

  6. launchdarkly-flag-create score at 67% — The PR description acknowledges the flag-create suite has a "pre-existing mock conflict with seed data" causing Test 2 to fail. Since this is a known issue, might be worth tracking it as a follow-up (e.g., a TODO comment in the config or a GitHub issue) so it doesn't get lost.

Minor Nits

  • evals/.env.example line 21 hardcodes claude-haiku-4-5-20251001 as the default rubric model. If the model aliases in _models.js ever diverge from this, they could drift. Not a real risk today, just noting the coupling.
  • The 9 vulnerabilities (4 moderate, 5 high) from npm install are likely from transitive deps in promptfoo — worth a quick npm audit to confirm none are exploitable in this context.

Overall this is a well-thought-out eval harness with solid engineering. The CI workflow design (diff-gated re-runs, artifact-based aggregation, PR score-diff comments) is particularly nice for keeping API costs down while maintaining visibility.

Copy link
Copy Markdown
Contributor

@ari-launchdarkly ari-launchdarkly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to see if the CI will allow this to go through with the failures

Copy link
Copy Markdown
Contributor

@ari-launchdarkly ari-launchdarkly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's block merges when the CI failures exist

dakotasanchez and others added 2 commits May 20, 2026 09:54
Co-authored-by: ari-launchdarkly <asalem@launchdarkly.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Skill eval results

Skill Before After Δ
ai-configs/aiconfig-create 100/100 (4/4) 75/100 (3/4) -25
ai-configs/aiconfig-tools 75/100 (3/4) 75/100 (3/4) no change
ai-configs/aiconfig-update 80/100 (4/5) 80/100 (4/5) no change
ai-configs/aiconfig-variations 80/100 (4/5) 80/100 (4/5) no change
feature-flags/launchdarkly-flag-create 100/100 (3/3) 100/100 (3/3) no change

Only suites whose source actually changed since their last recorded score were re-run. Soft-failing while we stabilise the baseline.

Add eval-gate job for required status check and remove continue-on-error from evaluate
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
devin-ai-integration[bot]

This comment was marked as resolved.

@dakotasanchez dakotasanchez enabled auto-merge (squash) May 21, 2026 19:35
@ari-launchdarkly ari-launchdarkly marked this pull request as draft May 21, 2026 19:40
auto-merge was automatically disabled May 21, 2026 19:40

Pull request was converted to draft

@ari-launchdarkly ari-launchdarkly marked this pull request as ready for review May 21, 2026 19:40
@ari-launchdarkly ari-launchdarkly enabled auto-merge (squash) May 21, 2026 19:40
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 16 additional findings in Devin Review.

Open in Devin Review

Comment thread evals/providers/_mock.js
@dakotasanchez dakotasanchez force-pushed the dsanchez/eval-harness branch from 14b5af1 to 1ee2558 Compare May 21, 2026 20:26
@ari-launchdarkly ari-launchdarkly merged commit ce4f04e into main May 21, 2026
31 checks passed
@ari-launchdarkly ari-launchdarkly deleted the dsanchez/eval-harness branch May 21, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants