Skip to content

Provider-dialect smoke harness for AI caller stages#188

Merged
valdis merged 9 commits into
mainfrom
feat/qualops-45-provider-dialect-smoke-tests
Jun 4, 2026
Merged

Provider-dialect smoke harness for AI caller stages#188
valdis merged 9 commits into
mainfrom
feat/qualops-45-provider-dialect-smoke-tests

Conversation

@sebastianwessel

@sebastianwessel sebastianwessel commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Automates the unchecked manual smoke item from PR #145's test plan: exercises each of the 4 AI caller stages migrated to native structured output (file-reviewer, validation-resolver, dedup-resolver, root-cause-extract) against each real provider (anthropic, openai, bedrock, github) using a slice fixture as input. Validates plumbing only — the provider-specific dialect path returns a zod-validated response without throwing. Output quality stays scoped to the deferred per-stage golden-evals follow-up.

Why

PR #145 introduced six provider-dialect paths (OpenAI strict json_schema, OpenAI json_object fallback, Anthropic output_config, Anthropic tool_use fallback, Bedrock forced tool_use, GitHub Models via OpenAI-compatible) and four zod schemas. Unit tests cover each path with mocked SDKs; nothing exercises a full stage call end-to-end against a real provider. The risk surface is the stage × dialect matrix.

Approach

  • Jest spec at tests/smoke/provider-dialect-smoke.spec.ts with its own jest.smoke.config.ts. The base jest.config.js constrains roots to tests/unit/, so this file is unreachable from default npm test; npm run test:smoke uses the smoke config.
  • Provider configuration via ConfigService: per-provider temp .qualopsrc.json written under tests/smoke/.tmp/, loaded via ConfigService.setConfigPath(). Pricing + model defaults come from PROVIDER_DEFAULTS in src/config/config.ts (with one inline default for GitHub Models, which is not in that table). Stage classes are obtained via AIFactory.createForStage('review') — same path production code uses.
  • Skip vs fail: per-provider credential presence is checked at module load and missing-credential providers are statically marked describe.skip(), so the entire 4-stage block shows up as Skipped in the test report rather than Pass. Providers with present-but-malformed credentials are attempted and fail loudly via the provider class's own validateApiKey().
  • Input: slice fixture at evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree), loosely following TDR 0002. The inbox dataset infrastructure from PR chore(evals): add eval case for PR 144 bash tool security findings #152 hasn't landed yet, so this fixture is self-contained; it slots into the new format if/when the slice harness lands.
  • root-cause-extract swallows provider errors internally and returns synthetic {rootCause: 'other', confidence: 0} classifications. The spec cross-checks AIFactory.createForStage('review').getTokenStats() and the classification distribution to surface silent failures as test failures.

CI

Nightly + manual workflow. Workflow file is currently at the repo root pending a workflow-scoped push that moves it under .github/workflows/.

Test plan

  • npm run lint clean
  • npm run test:smoke with no credentials → 16 skipped, 0 failed (all 4 describes skipped)
  • npm run test:smoke with a malformed Anthropic key → 4 failed (3 with 401 from anthropic.completeStructured wrapError, 1 from the root-cause-extract silent-failure assertion), 12 skipped
  • No artifacts left behind after a smoke run (.qualops/prompts/_smoke-*.md, tests/smoke/.tmp/, .qualops/reports/.smoke-* all cleaned by afterAll)
  • CI workflow dry-run after the workflow file is moved into .github/workflows/

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown

QualOps Code Quality Analysis

Status: ⚠️ WARNINGS - Medium severity issues found

Summary

  • Total Issues: 1
  • Critical: 0 🔴
  • High: 0 🟠
  • Medium: 1 🟡
  • Low: 0 🟢
  • Files Analyzed: 11

🟡 Medium Issues (1)

  • .github/workflows/provider-dialect-smoke.yml:56 - bug
    GitHub Actions workflow inputs providers and model are silently ignored — the test spec does not parse CLI arguments

📊 Full Report

View detailed report


Powered by QualOps

@sebastianwessel sebastianwessel requested a review from valdis May 21, 2026 11:23
Comment thread tests/smoke/provider-dialect-smoke.ts Outdated
Comment thread tests/smoke/provider-dialect-smoke.ts Outdated
Comment thread tests/smoke/provider-dialect-smoke.ts Outdated
Comment thread tests/smoke/provider-dialect-smoke.ts Outdated
@sebastianwessel sebastianwessel changed the title Feat/qualops 45 provider dialect smoke tests Provider-dialect smoke harness for AI caller stages May 21, 2026
@valdis valdis force-pushed the feat/qualops-45-provider-dialect-smoke-tests branch 2 times, most recently from 8f94b19 to d1337fb Compare May 29, 2026 11:43
sebastianwessel and others added 9 commits June 4, 2026 16:27
…OPS-45)

Automates the unchecked manual smoke item from PR #145's test plan:
exercises the 4 AI caller stages migrated to native structured-output
(file-reviewer, validation-resolver, dedup-resolver, root-cause-extract)
against each real provider (anthropic, openai, bedrock, github) using one
eval dataset entry as input. Validates plumbing only — that the
provider-specific dialect path returns a zod-validated response without
throwing. Output quality remains scoped to the deferred per-stage
golden-evals follow-up.

Why: PR #145 introduced six provider-dialect paths (OpenAI strict
json_schema, OpenAI json_object fallback, Anthropic output_config,
Anthropic tool_use fallback, Bedrock forced tool_use, GitHub Models via
OpenAI-compatible) and four zod schemas. Unit tests cover each path
with mocked SDKs; nothing exercises a full stage call end-to-end against
a real provider. The risk surface is the stage × dialect matrix.

Design:
- Standalone tsx script at tests/smoke/provider-dialect-smoke.ts. Not a
  Jest spec — paid API calls must never enter the default npm test run.
- Reuses evals/src/run-log.js for run-log shape + error classification.
- Per-provider env-var presence determines skip vs attempt; the provider
  classes' own validateApiKey()/validateConfiguration() handle format
  validation, so a malformed CI secret surfaces as a real failure
  (classified errorCode) rather than a silent skip.
- root-cause-extract uses AIFactory.createForStage('review') internally
  and swallows provider errors, so the harness writes a per-provider
  temp .qualopsrc.*.json, swaps ConfigService.setConfigPath(), and
  cross-checks token stats + classification distribution post-call to
  surface silent failures.
- 4 stages × 4 providers = 16 calls per full run. Exit 0 if every
  attempted combination passed (or was skipped for missing credentials),
  1 otherwise. Run log uploaded as CI artifact.

CI lane: .github/workflows/provider-dialect-smoke.yml — manual
workflow_dispatch + nightly cron at 03:17 UTC. Secret names mirror env-
var names (secrets.ANTHROPIC_API_KEY, secrets.OPENAI_API_KEY,
secrets.GITHUB_API_KEY, AWS_*) matching what src/config/env.ts reads at
runtime. Concurrency-gated; not part of PR-blocking CI.

Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skips, exit 0
- npm run test:smoke with a malformed Anthropic key → 4 attempts, 4
  fails (3 AUTH_FAILED + 1 UNKNOWN for the silent-fallback stage), exit 1
- Cleanup leaves no prompt files, no tmp configs, no leftover session

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comments on the smoke harness:

1. Provider/model configuration now flows through ConfigService instead
   of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
   The spec writes a per-provider temp .qualopsrc.json under
   tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
   the AIProvider via AIFactory.createForStage('review') — the same
   path production code uses. Pricing + model defaults come from
   PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
   for GitHub Models, which is not in that table).

2. Standalone tsx script replaced with a Jest spec at
   tests/smoke/provider-dialect-smoke.spec.ts running under its own
   jest.smoke.config.ts. The base jest.config.js already constrains
   roots to tests/unit/, so this file is unreachable from the default
   `npm test` run — no testPathIgnorePatterns entry needed.
   `npm run test:smoke` uses the smoke config. Per-provider credential
   presence is checked at module load and missing-credential providers
   are statically marked describe.skip() so the entire 4-stage block
   shows up as Skipped in the test report rather than Pass.

3. Input is now a slice fixture under
   evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
   loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
   The inbox dataset infrastructure from PR #152 has not landed yet,
   so this fixture is a self-contained smoke input; it slots into the
   new format if/when the slice harness lands.

Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.

Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
  401 from anthropic.completeStructured wrapError, 1 root-cause-extract
  caught by the token-stats silent-failure assertion), 12 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lication

- Load .env via dotenv in smoke.setup.ts before envConfig singleton initialises,
  so npm run test:smoke works without pre-exporting env vars in the shell
- Remove the exists-guard in setupPrompts (files are always written and always
  cleaned up in afterAll, so the guard added complexity with no benefit)
- Remove the separate system prompt fallback string; PROJECT_ROOT-relative
  readFile of the bundled quality.md is sufficient (file always present in source tree)
@valdis valdis force-pushed the feat/qualops-45-provider-dialect-smoke-tests branch from d1337fb to 0086bbc Compare June 4, 2026 13:28
@valdis valdis merged commit 0aa3a97 into main Jun 4, 2026
7 checks passed
@valdis valdis deleted the feat/qualops-45-provider-dialect-smoke-tests branch June 4, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants