chore(evals): add eval case for PR 144 bash tool security findings#152
chore(evals): add eval case for PR 144 bash tool security findings#152valdis wants to merge 1 commit into
Conversation
4a3fbea to
e9ece28
Compare
QualOps Code Quality AnalysisStatus: Summary
🟠 High Issues (2)
🟡 Medium Issues (2)
📊 Full ReportPowered by QualOps |
04976ff to
fb05563
Compare
Captures 11 security findings from the hardened bash tool PR across multiple self-review iterations. baseSha 415d0e5.
fb05563 to
ceba928
Compare
Addresses review comments on the smoke harness:
1. Provider/model configuration now flows through ConfigService instead
of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
The spec writes a per-provider temp .qualopsrc.json under
tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
the AIProvider via AIFactory.createForStage('review') — the same
path production code uses. Pricing + model defaults come from
PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
for GitHub Models, which is not in that table).
2. Standalone tsx script replaced with a Jest spec at
tests/smoke/provider-dialect-smoke.spec.ts running under its own
jest.smoke.config.ts. The base jest.config.js already constrains
roots to tests/unit/, so this file is unreachable from the default
`npm test` run — no testPathIgnorePatterns entry needed.
`npm run test:smoke` uses the smoke config. Per-provider credential
presence is checked at module load and missing-credential providers
are statically marked describe.skip() so the entire 4-stage block
shows up as Skipped in the test report rather than Pass.
3. Input is now a slice fixture under
evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
The inbox dataset infrastructure from PR #152 has not landed yet,
so this fixture is a self-contained smoke input; it slots into the
new format if/when the slice harness lands.
Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.
Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
401 from anthropic.completeStructured wrapError, 1 root-cause-extract
caught by the token-stats silent-failure assertion), 12 skipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comments on the smoke harness:
1. Provider/model configuration now flows through ConfigService instead
of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
The spec writes a per-provider temp .qualopsrc.json under
tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
the AIProvider via AIFactory.createForStage('review') — the same
path production code uses. Pricing + model defaults come from
PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
for GitHub Models, which is not in that table).
2. Standalone tsx script replaced with a Jest spec at
tests/smoke/provider-dialect-smoke.spec.ts running under its own
jest.smoke.config.ts. The base jest.config.js already constrains
roots to tests/unit/, so this file is unreachable from the default
`npm test` run — no testPathIgnorePatterns entry needed.
`npm run test:smoke` uses the smoke config. Per-provider credential
presence is checked at module load and missing-credential providers
are statically marked describe.skip() so the entire 4-stage block
shows up as Skipped in the test report rather than Pass.
3. Input is now a slice fixture under
evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
The inbox dataset infrastructure from PR #152 has not landed yet,
so this fixture is a self-contained smoke input; it slots into the
new format if/when the slice harness lands.
Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.
Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
401 from anthropic.completeStructured wrapError, 1 root-cause-extract
caught by the token-stats silent-failure assertion), 12 skipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comments on the smoke harness:
1. Provider/model configuration now flows through ConfigService instead
of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
The spec writes a per-provider temp .qualopsrc.json under
tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
the AIProvider via AIFactory.createForStage('review') — the same
path production code uses. Pricing + model defaults come from
PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
for GitHub Models, which is not in that table).
2. Standalone tsx script replaced with a Jest spec at
tests/smoke/provider-dialect-smoke.spec.ts running under its own
jest.smoke.config.ts. The base jest.config.js already constrains
roots to tests/unit/, so this file is unreachable from the default
`npm test` run — no testPathIgnorePatterns entry needed.
`npm run test:smoke` uses the smoke config. Per-provider credential
presence is checked at module load and missing-credential providers
are statically marked describe.skip() so the entire 4-stage block
shows up as Skipped in the test report rather than Pass.
3. Input is now a slice fixture under
evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
The inbox dataset infrastructure from PR #152 has not landed yet,
so this fixture is a self-contained smoke input; it slots into the
new format if/when the slice harness lands.
Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.
Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
401 from anthropic.completeStructured wrapError, 1 root-cause-extract
caught by the token-stats silent-failure assertion), 12 skipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comments on the smoke harness:
1. Provider/model configuration now flows through ConfigService instead
of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
The spec writes a per-provider temp .qualopsrc.json under
tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
the AIProvider via AIFactory.createForStage('review') — the same
path production code uses. Pricing + model defaults come from
PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
for GitHub Models, which is not in that table).
2. Standalone tsx script replaced with a Jest spec at
tests/smoke/provider-dialect-smoke.spec.ts running under its own
jest.smoke.config.ts. The base jest.config.js already constrains
roots to tests/unit/, so this file is unreachable from the default
`npm test` run — no testPathIgnorePatterns entry needed.
`npm run test:smoke` uses the smoke config. Per-provider credential
presence is checked at module load and missing-credential providers
are statically marked describe.skip() so the entire 4-stage block
shows up as Skipped in the test report rather than Pass.
3. Input is now a slice fixture under
evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
The inbox dataset infrastructure from PR #152 has not landed yet,
so this fixture is a self-contained smoke input; it slots into the
new format if/when the slice harness lands.
Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.
Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
401 from anthropic.completeStructured wrapError, 1 root-cause-extract
caught by the token-stats silent-failure assertion), 12 skipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Automates the unchecked manual smoke item from PR #145's test plan: exercises each of the 4 AI caller stages migrated to native structured output (`file-reviewer`, `validation-resolver`, `dedup-resolver`, `root-cause-extract`) against each real provider (`anthropic`, `openai`, `bedrock`, `github`) using a slice fixture as input. Validates plumbing only — the provider-specific dialect path returns a zod-validated response without throwing. Output quality stays scoped to the deferred per-stage golden-evals follow-up. ## Why PR #145 introduced six provider-dialect paths (OpenAI strict `json_schema`, OpenAI `json_object` fallback, Anthropic `output_config`, Anthropic `tool_use` fallback, Bedrock forced `tool_use`, GitHub Models via OpenAI-compatible) and four zod schemas. Unit tests cover each path with mocked SDKs; nothing exercises a full stage call end-to-end against a real provider. The risk surface is the stage × dialect matrix. ## Approach - **Jest spec** at `tests/smoke/provider-dialect-smoke.spec.ts` with its own `jest.smoke.config.ts`. The base `jest.config.js` constrains `roots` to `tests/unit/`, so this file is unreachable from default `npm test`; `npm run test:smoke` uses the smoke config. - **Provider configuration via `ConfigService`**: per-provider temp `.qualopsrc.json` written under `tests/smoke/.tmp/`, loaded via `ConfigService.setConfigPath()`. Pricing + model defaults come from `PROVIDER_DEFAULTS` in `src/config/config.ts` (with one inline default for GitHub Models, which is not in that table). Stage classes are obtained via `AIFactory.createForStage('review')` — same path production code uses. - **Skip vs fail**: per-provider credential presence is checked at module load and missing-credential providers are statically marked `describe.skip()`, so the entire 4-stage block shows up as Skipped in the test report rather than Pass. Providers with present-but-malformed credentials are attempted and fail loudly via the provider class's own `validateApiKey()`. - **Input**: slice fixture at `evals/datasets/inbox/smoke-sql-injection/` (`slice.json` + `repo/` tree), loosely following [TDR 0002](https://github.com/eggai-tech/qualops/blob/main/docs/tdr/0002-evals-from-real-prs.md). The inbox dataset infrastructure from PR #152 hasn't landed yet, so this fixture is self-contained; it slots into the new format if/when the slice harness lands. - **`root-cause-extract`** swallows provider errors internally and returns synthetic `{rootCause: 'other', confidence: 0}` classifications. The spec cross-checks `AIFactory.createForStage('review').getTokenStats()` and the classification distribution to surface silent failures as test failures. ## CI Nightly + manual workflow. Workflow file is currently at the repo root pending a workflow-scoped push that moves it under `.github/workflows/`. ## Test plan - [x] `npm run lint` clean - [x] `npm run test:smoke` with no credentials → 16 skipped, 0 failed (all 4 describes skipped) - [x] `npm run test:smoke` with a malformed Anthropic key → 4 failed (3 with 401 from `anthropic.completeStructured` wrapError, 1 from the `root-cause-extract` silent-failure assertion), 12 skipped - [x] No artifacts left behind after a smoke run (`.qualops/prompts/_smoke-*.md`, `tests/smoke/.tmp/`, `.qualops/reports/.smoke-*` all cleaned by `afterAll`) - [ ] CI workflow dry-run after the workflow file is moved into `.github/workflows/` --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Valdis Pornieks <pornieks@gmail.com>
Captures 11 security findings from the hardened bash tool PR across multiple self-review iterations. baseSha 415d0e5.