Provider-dialect smoke harness for AI caller stages#188
Merged
Conversation
QualOps Code Quality AnalysisStatus: Summary
🟡 Medium Issues (1)
📊 Full ReportPowered by QualOps |
valdis
reviewed
May 21, 2026
8f94b19 to
d1337fb
Compare
…OPS-45) Automates the unchecked manual smoke item from PR #145's test plan: exercises the 4 AI caller stages migrated to native structured-output (file-reviewer, validation-resolver, dedup-resolver, root-cause-extract) against each real provider (anthropic, openai, bedrock, github) using one eval dataset entry as input. Validates plumbing only — that the provider-specific dialect path returns a zod-validated response without throwing. Output quality remains scoped to the deferred per-stage golden-evals follow-up. Why: PR #145 introduced six provider-dialect paths (OpenAI strict json_schema, OpenAI json_object fallback, Anthropic output_config, Anthropic tool_use fallback, Bedrock forced tool_use, GitHub Models via OpenAI-compatible) and four zod schemas. Unit tests cover each path with mocked SDKs; nothing exercises a full stage call end-to-end against a real provider. The risk surface is the stage × dialect matrix. Design: - Standalone tsx script at tests/smoke/provider-dialect-smoke.ts. Not a Jest spec — paid API calls must never enter the default npm test run. - Reuses evals/src/run-log.js for run-log shape + error classification. - Per-provider env-var presence determines skip vs attempt; the provider classes' own validateApiKey()/validateConfiguration() handle format validation, so a malformed CI secret surfaces as a real failure (classified errorCode) rather than a silent skip. - root-cause-extract uses AIFactory.createForStage('review') internally and swallows provider errors, so the harness writes a per-provider temp .qualopsrc.*.json, swaps ConfigService.setConfigPath(), and cross-checks token stats + classification distribution post-call to surface silent failures. - 4 stages × 4 providers = 16 calls per full run. Exit 0 if every attempted combination passed (or was skipped for missing credentials), 1 otherwise. Run log uploaded as CI artifact. CI lane: .github/workflows/provider-dialect-smoke.yml — manual workflow_dispatch + nightly cron at 03:17 UTC. Secret names mirror env- var names (secrets.ANTHROPIC_API_KEY, secrets.OPENAI_API_KEY, secrets.GITHUB_API_KEY, AWS_*) matching what src/config/env.ts reads at runtime. Concurrency-gated; not part of PR-blocking CI. Verified locally: - npm run lint clean - npm run test:smoke (no credentials) → 16 skips, exit 0 - npm run test:smoke with a malformed Anthropic key → 4 attempts, 4 fails (3 AUTH_FAILED + 1 UNKNOWN for the silent-fallback stage), exit 1 - Cleanup leaves no prompt files, no tmp configs, no leftover session Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comments on the smoke harness:
1. Provider/model configuration now flows through ConfigService instead
of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
The spec writes a per-provider temp .qualopsrc.json under
tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
the AIProvider via AIFactory.createForStage('review') — the same
path production code uses. Pricing + model defaults come from
PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
for GitHub Models, which is not in that table).
2. Standalone tsx script replaced with a Jest spec at
tests/smoke/provider-dialect-smoke.spec.ts running under its own
jest.smoke.config.ts. The base jest.config.js already constrains
roots to tests/unit/, so this file is unreachable from the default
`npm test` run — no testPathIgnorePatterns entry needed.
`npm run test:smoke` uses the smoke config. Per-provider credential
presence is checked at module load and missing-credential providers
are statically marked describe.skip() so the entire 4-stage block
shows up as Skipped in the test report rather than Pass.
3. Input is now a slice fixture under
evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
The inbox dataset infrastructure from PR #152 has not landed yet,
so this fixture is a self-contained smoke input; it slots into the
new format if/when the slice harness lands.
Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.
Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
401 from anthropic.completeStructured wrapError, 1 root-cause-extract
caught by the token-stats silent-failure assertion), 12 skipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lication - Load .env via dotenv in smoke.setup.ts before envConfig singleton initialises, so npm run test:smoke works without pre-exporting env vars in the shell - Remove the exists-guard in setupPrompts (files are always written and always cleaned up in afterAll, so the guard added complexity with no benefit) - Remove the separate system prompt fallback string; PROJECT_ROOT-relative readFile of the bundled quality.md is sufficient (file always present in source tree)
d1337fb to
0086bbc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Automates the unchecked manual smoke item from PR #145's test plan: exercises each of the 4 AI caller stages migrated to native structured output (
file-reviewer,validation-resolver,dedup-resolver,root-cause-extract) against each real provider (anthropic,openai,bedrock,github) using a slice fixture as input. Validates plumbing only — the provider-specific dialect path returns a zod-validated response without throwing. Output quality stays scoped to the deferred per-stage golden-evals follow-up.Why
PR #145 introduced six provider-dialect paths (OpenAI strict
json_schema, OpenAIjson_objectfallback, Anthropicoutput_config, Anthropictool_usefallback, Bedrock forcedtool_use, GitHub Models via OpenAI-compatible) and four zod schemas. Unit tests cover each path with mocked SDKs; nothing exercises a full stage call end-to-end against a real provider. The risk surface is the stage × dialect matrix.Approach
tests/smoke/provider-dialect-smoke.spec.tswith its ownjest.smoke.config.ts. The basejest.config.jsconstrainsrootstotests/unit/, so this file is unreachable from defaultnpm test;npm run test:smokeuses the smoke config.ConfigService: per-provider temp.qualopsrc.jsonwritten undertests/smoke/.tmp/, loaded viaConfigService.setConfigPath(). Pricing + model defaults come fromPROVIDER_DEFAULTSinsrc/config/config.ts(with one inline default for GitHub Models, which is not in that table). Stage classes are obtained viaAIFactory.createForStage('review')— same path production code uses.describe.skip(), so the entire 4-stage block shows up as Skipped in the test report rather than Pass. Providers with present-but-malformed credentials are attempted and fail loudly via the provider class's ownvalidateApiKey().evals/datasets/inbox/smoke-sql-injection/(slice.json+repo/tree), loosely following TDR 0002. The inbox dataset infrastructure from PR chore(evals): add eval case for PR 144 bash tool security findings #152 hasn't landed yet, so this fixture is self-contained; it slots into the new format if/when the slice harness lands.root-cause-extractswallows provider errors internally and returns synthetic{rootCause: 'other', confidence: 0}classifications. The spec cross-checksAIFactory.createForStage('review').getTokenStats()and the classification distribution to surface silent failures as test failures.CI
Nightly + manual workflow. Workflow file is currently at the repo root pending a workflow-scoped push that moves it under
.github/workflows/.Test plan
npm run lintcleannpm run test:smokewith no credentials → 16 skipped, 0 failed (all 4 describes skipped)npm run test:smokewith a malformed Anthropic key → 4 failed (3 with 401 fromanthropic.completeStructuredwrapError, 1 from theroot-cause-extractsilent-failure assertion), 12 skipped.qualops/prompts/_smoke-*.md,tests/smoke/.tmp/,.qualops/reports/.smoke-*all cleaned byafterAll).github/workflows/