Add promptfoo eval harness for skills + consolidate MCP endpoint by dakotasanchez · Pull Request #65 · launchdarkly/ai-tooling

dakotasanchez · 2026-05-19T23:22:22Z

Summary

Adds a promptfoo-based eval harness under evals/ that runs each skill end-to-end through the Claude Agent SDK with mocked LaunchDarkly MCP tools, asserting on both tool-call trajectory and response quality
Adds a CI workflow (.github/workflows/eval-skills.yml) that runs evals on PRs and nightly
Fixes several bugs found during review of the harness implementation
Consolidates .mcp.json and migration docs to the unified mcp/launchdarkly endpoint

Eval harness (`evals/`)

Provider (providers/claude-skill-agent-sdk.js): loads the skill the same way a real Claude Code session does (off-disk via .claude/skills/<slug>/), exposes an in-process mock MCP server, and records a full tool-call trajectory for assertions.

Suites (one promptfooconfig.yaml per skill, 3–5 test cases each):

aiconfig-create — 4/4 passing (100%)
aiconfig-update — 4/5 passing (80%)
aiconfig-tools — 3/4 passing (75%)
aiconfig-variations — 4/5 passing (80%)
launchdarkly-flag-create — 2/3 passing (67%, pre-existing mock conflict with seed data)

Scripts:

aggregate.js — runs all suites, writes eval-scores.json at repo root
diff-changed-skills.js — git-diff-gated re-run (only re-evaluates suites whose source changed)
render-badges.js — rewrites per-skill README eval score badges
run-models.js — cross-model matrix runner (haiku / sonnet / opus)

Unit tests (tests/*.test.js, 66 tests, no API calls needed):

transform.test.js, output-valid.test.js, assertions.test.js, jsonschema-to-zod.test.js, mock.test.js
Run via npm test from evals/

CI workflow (`.github/workflows/eval-skills.yml`)

Updated default AGENT_MODEL from claude-sonnet-4-20250514 → claude-sonnet-4-6
Added a unit-test job that runs npm test (no API key needed) before the LLM eval jobs

Job flow:

unit-test — runs 66 unit tests on every trigger, no API calls
diff — computes which suites changed since their last recorded score
evaluate — runs changed suites in parallel (max 3), gated on unit-test passing
aggregate — merges results into eval-scores.json, posts PR score-diff comment, commits refreshed scores/badges on schedule

Triggers: pull_request (on skills/**, evals/**, workflow file), schedule (09:17 UTC daily), workflow_dispatch (with run_all override)

Bug fixes (found during review against PR #54 implementation)

run-models.js: .filter(Array.isArray([]) || true) evaluated to .filter(true) — a TypeError crash on any non-empty results array; removed the no-op filter
run-models.js: runSuiteWithModel only checked result.error (OS spawn errors), not result.status !== 0; promptfoo eval failures were silently treated as success
run-models.js: output path was relative instead of absolute, inconsistent with aggregate.js
_mock.js: update-ai-config-variation spread all of input into the variation, polluting state with configKey, variationKey, etc.; now only patches the allowed variation fields
_mock.js: list-ai-configs dropped agent-created configs whose key collided with a seed template key; state entries now always win
_mock.js: create-ai-config-variation for a nonexistent parent config silently orphaned the variation; now creates a minimal stub
previewMock: unconditionally appended "..." even for strings under 200 chars
aiconfig-create Test 3: dead explored variable (computed but never used in pass)
flag-create Test 2: pass: false, score: 0.5 was internally inconsistent; changed to pass: true, score: 0.5
tool-responses.json: list-feature-flags returned empty while list-flags had seed data; aligned them

MCP endpoint consolidation

.mcp.json: replaced separate mcp/fm + mcp/aiconfigs entries with single mcp/launchdarkly entry
mcp-config-templates.md: updated migration section to deprecate both mcp/fm and mcp/aiconfigs (previously only mcp/aiconfigs was marked deprecated; mcp/fm was incorrectly described as a valid mirror)

Assorted

evals/.env.example: added links to credential consoles (console.anthropic.com, platform.openai.com/api-keys)
Removed undocumented EVAL_MODEL env var fallback from the provider (only AGENT_MODEL is supported)
Added unit tests for all utility modules

Test plan

cd evals && npm test — 66 unit tests pass, no API calls needed
npm run eval:aiconfig-create:single — smoke test (~30s, requires ANTHROPIC_API_KEY)
npm run eval:all — full suite run, check eval-scores.json matches expected scores above

ari-launchdarkly

This is a lot. I'm gonna ask Devin to take a pass

devin-ai-integration · 2026-05-20T15:53:08Z

Review Summary

Thorough and well-structured PR. The eval harness design is solid — running skills through the Claude Agent SDK with an in-process mock MCP server is the right call for realism. The bug fixes are all clearly motivated and the unit test coverage (66 tests, all passing) is strong. A few observations below.

Architecture & Design

The provider (claude-skill-agent-sdk.js) is well-documented with a clear header block, and the isolation strategy (per-invocation temp dirs with symlinked skills + empty .mcp.json stub) is clean. Good call using fs.mkdtempSync for concurrent test safety.
The stateful mock overlay in _mock.js is a nice approach — write-tools record into state, read-tools reflect it back. The SEED_CONFIGS data gives the agent believable context without API calls.
_manifest.js as a single source of truth for suite ↔ skill mappings is a good pattern for keeping the scripts, CI, and badge rendering in sync.

Bug Fixes

The fixes called out in the description all look correct:

The .filter(Array.isArray([]) || true) → .filter(true) crash in run-models.js was a real bug (good catch).
update-ai-config-variation now correctly allowlists VARIATION_FIELDS instead of spreading the entire input — the test at mock.test.js:101-128 validates this well.
The previewMock truncation fix is straightforward and correct.

Items Worth Considering

list-feature-flags vs list-flags alignment — The description mentions aligning these in tool-responses.json, but definitions.json only defines list-flags (no list-feature-flags). If any skill or agent happens to call list-feature-flags, it will hit the { error: "No mock configured for tool: ..." } fallback. Worth confirming no skills reference that tool name, or adding a list-feature-flags entry to definitions.json that mirrors list-flags.
CI workflow — contents: write on PRs — The workflow sets permissions: contents: write globally but only needs write access for the schedule/dispatch commit step. Since the commit step is already gated on event_name, this is safe in practice, but tightening to contents: read at the workflow level and overriding to write only in the aggregate job (on schedule/dispatch) would follow least-privilege more closely.
Root package.json — dangling test:llm-evals reference — The old test:llm-evals script pointed to --prefix tests. The diff replaces it with the new eval / eval:all / eval:view scripts, but if there was a tests/ directory that's now stale, it might be worth cleaning up or noting in the description.
Temp directory cleanup on provider failure — The finally block at line 466 in claude-skill-agent-sdk.js does fs.rmSync(cwd, ...) wrapped in a try/catch-swallow. This is fine for normal runs. On early errors (e.g., SDK import failure before cwd is defined), cwd would be undefined, but the catch block handles it gracefully. No action needed — just calling it out.
create-ai-config-variation stub creation — When a variation is created for a nonexistent parent config, the mock creates a stub with mode: "completion" as default. This is pragmatic for mock purposes, but if the agent subsequently checks the config's mode, it might get a misleading answer. A minor edge case — probably fine for eval purposes.
launchdarkly-flag-create score at 67% — The PR description acknowledges the flag-create suite has a "pre-existing mock conflict with seed data" causing Test 2 to fail. Since this is a known issue, might be worth tracking it as a follow-up (e.g., a TODO comment in the config or a GitHub issue) so it doesn't get lost.

Minor Nits

evals/.env.example line 21 hardcodes claude-haiku-4-5-20251001 as the default rubric model. If the model aliases in _models.js ever diverge from this, they could drift. Not a real risk today, just noting the coupling.
The 9 vulnerabilities (4 moderate, 5 high) from npm install are likely from transitive deps in promptfoo — worth a quick npm audit to confirm none are exploitable in this context.

Overall this is a well-thought-out eval harness with solid engineering. The CI workflow design (diff-gated re-runs, artifact-based aggregation, PR score-diff comments) is particularly nice for keeping API costs down while maintaining visibility.

ari-launchdarkly

I want to see if the CI will allow this to go through with the failures

ari-launchdarkly

Let's block merges when the CI failures exist

Co-authored-by: ari-launchdarkly <asalem@launchdarkly.com>

github-actions · 2026-05-20T19:25:48Z

Skill eval results

Skill	Before	After	Δ
`ai-configs/aiconfig-create`	100/100 (4/4)	75/100 (3/4)	-25
`ai-configs/aiconfig-tools`	75/100 (3/4)	75/100 (3/4)	no change
`ai-configs/aiconfig-update`	80/100 (4/5)	80/100 (4/5)	no change
`ai-configs/aiconfig-variations`	80/100 (4/5)	80/100 (4/5)	no change
`feature-flags/launchdarkly-flag-create`	100/100 (3/3)	100/100 (3/3)	no change

Only suites whose source actually changed since their last recorded score were re-run. Soft-failing while we stabilise the baseline.

Add eval-gate job for required status check and remove continue-on-error from evaluate

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 16 additional findings in Devin Review.

dakotasanchez added 2 commits May 19, 2026 15:54

Add prompt evaluations and tests

8a514cb

consolidated MCP server for LD

5115186

dakotasanchez requested a review from a team as a code owner May 19, 2026 23:22

Add CI workflow for skill evals (port from #54 + unit-test job)

7c08190

ari-launchdarkly reviewed May 20, 2026

View reviewed changes

Comment thread evals/.env.example Outdated

Comment thread evals/shared/output-valid.js Outdated

ari-launchdarkly approved these changes May 20, 2026

View reviewed changes

ari-launchdarkly requested changes May 20, 2026

View reviewed changes

dakotasanchez and others added 2 commits May 20, 2026 09:54

Update evals/shared/output-valid.js

8760d0d

Co-authored-by: ari-launchdarkly <asalem@launchdarkly.com>

npm ci calls need --legacy-peer-deps

f86a128

ari-launchdarkly approved these changes May 20, 2026

View reviewed changes

dakotasanchez added 4 commits May 20, 2026 11:16

fix .env detection

f5ffd15

fail CI only if score drops below 75%

3ecda5d

bug in suite eval

8441bf2

fix mocking issue

587a9b2

API key comment on .env.example

2926735

ari-launchdarkly approved these changes May 20, 2026

View reviewed changes

Add eval-gate job for required status checks (#66)

234481e

Add eval-gate job for required status check and remove continue-on-error from evaluate