Skip to content

chore(evals): add eval case for PR 144 bash tool security findings#152

Open
valdis wants to merge 1 commit into
mainfrom
proposed-eval-for-security-scans
Open

chore(evals): add eval case for PR 144 bash tool security findings#152
valdis wants to merge 1 commit into
mainfrom
proposed-eval-for-security-scans

Conversation

@valdis

@valdis valdis commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Captures 11 security findings from the hardened bash tool PR across multiple self-review iterations. baseSha 415d0e5.

@valdis valdis force-pushed the proposed-eval-for-security-scans branch from 4a3fbea to e9ece28 Compare May 11, 2026 16:30
@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown

QualOps Code Quality Analysis

Status: ⚠️ FAILED - Critical or high severity issues found

Summary

  • Total Issues: 4
  • Critical: 0 🔴
  • High: 2 🟠
  • Medium: 2 🟡
  • Low: 0 🟢
  • Files Analyzed: 18

🟠 High Issues (2)

  • evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/policy.ts:393 - security
    git -c config injection bypass: only the first -c flag is checked, allowing subsequent -c flags to inject dangerous git config like core.hooksPath or core.fsmonitor

  • evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/env-scrub.ts:140 - security
    BASH_ENV environment variable not scrubbed — allows pre-execution code injection into the non-interactive bash shell

🟡 Medium Issues (2)

  • evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/policy.ts:365 - security
    Path traversal via relative cd + relative file access: cd with ../ segments is allowed by policy, enabling file reads outside workspace

  • evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/policy.ts:31 - security
    Policy bypass via proxy commands (xargs, find -exec/-delete, perl, ruby, awk) that can execute denied binaries indirectly

📊 Full Report

View detailed report


Powered by QualOps

@valdis valdis force-pushed the proposed-eval-for-security-scans branch 4 times, most recently from 04976ff to fb05563 Compare May 15, 2026 11:46
Captures 11 security findings from the hardened bash tool PR across
multiple self-review iterations. baseSha 415d0e5.
@valdis valdis force-pushed the proposed-eval-for-security-scans branch from fb05563 to ceba928 Compare May 19, 2026 08:04
sebastianwessel added a commit that referenced this pull request May 21, 2026
Addresses review comments on the smoke harness:

1. Provider/model configuration now flows through ConfigService instead
   of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
   The spec writes a per-provider temp .qualopsrc.json under
   tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
   the AIProvider via AIFactory.createForStage('review') — the same
   path production code uses. Pricing + model defaults come from
   PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
   for GitHub Models, which is not in that table).

2. Standalone tsx script replaced with a Jest spec at
   tests/smoke/provider-dialect-smoke.spec.ts running under its own
   jest.smoke.config.ts. The base jest.config.js already constrains
   roots to tests/unit/, so this file is unreachable from the default
   `npm test` run — no testPathIgnorePatterns entry needed.
   `npm run test:smoke` uses the smoke config. Per-provider credential
   presence is checked at module load and missing-credential providers
   are statically marked describe.skip() so the entire 4-stage block
   shows up as Skipped in the test report rather than Pass.

3. Input is now a slice fixture under
   evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
   loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
   The inbox dataset infrastructure from PR #152 has not landed yet,
   so this fixture is a self-contained smoke input; it slots into the
   new format if/when the slice harness lands.

Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.

Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
  401 from anthropic.completeStructured wrapError, 1 root-cause-extract
  caught by the token-stats silent-failure assertion), 12 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
valdis pushed a commit that referenced this pull request May 29, 2026
Addresses review comments on the smoke harness:

1. Provider/model configuration now flows through ConfigService instead
   of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
   The spec writes a per-provider temp .qualopsrc.json under
   tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
   the AIProvider via AIFactory.createForStage('review') — the same
   path production code uses. Pricing + model defaults come from
   PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
   for GitHub Models, which is not in that table).

2. Standalone tsx script replaced with a Jest spec at
   tests/smoke/provider-dialect-smoke.spec.ts running under its own
   jest.smoke.config.ts. The base jest.config.js already constrains
   roots to tests/unit/, so this file is unreachable from the default
   `npm test` run — no testPathIgnorePatterns entry needed.
   `npm run test:smoke` uses the smoke config. Per-provider credential
   presence is checked at module load and missing-credential providers
   are statically marked describe.skip() so the entire 4-stage block
   shows up as Skipped in the test report rather than Pass.

3. Input is now a slice fixture under
   evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
   loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
   The inbox dataset infrastructure from PR #152 has not landed yet,
   so this fixture is a self-contained smoke input; it slots into the
   new format if/when the slice harness lands.

Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.

Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
  401 from anthropic.completeStructured wrapError, 1 root-cause-extract
  caught by the token-stats silent-failure assertion), 12 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
valdis pushed a commit that referenced this pull request May 29, 2026
Addresses review comments on the smoke harness:

1. Provider/model configuration now flows through ConfigService instead
   of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
   The spec writes a per-provider temp .qualopsrc.json under
   tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
   the AIProvider via AIFactory.createForStage('review') — the same
   path production code uses. Pricing + model defaults come from
   PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
   for GitHub Models, which is not in that table).

2. Standalone tsx script replaced with a Jest spec at
   tests/smoke/provider-dialect-smoke.spec.ts running under its own
   jest.smoke.config.ts. The base jest.config.js already constrains
   roots to tests/unit/, so this file is unreachable from the default
   `npm test` run — no testPathIgnorePatterns entry needed.
   `npm run test:smoke` uses the smoke config. Per-provider credential
   presence is checked at module load and missing-credential providers
   are statically marked describe.skip() so the entire 4-stage block
   shows up as Skipped in the test report rather than Pass.

3. Input is now a slice fixture under
   evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
   loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
   The inbox dataset infrastructure from PR #152 has not landed yet,
   so this fixture is a self-contained smoke input; it slots into the
   new format if/when the slice harness lands.

Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.

Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
  401 from anthropic.completeStructured wrapError, 1 root-cause-extract
  caught by the token-stats silent-failure assertion), 12 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
valdis pushed a commit that referenced this pull request Jun 4, 2026
Addresses review comments on the smoke harness:

1. Provider/model configuration now flows through ConfigService instead
   of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness.
   The spec writes a per-provider temp .qualopsrc.json under
   tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains
   the AIProvider via AIFactory.createForStage('review') — the same
   path production code uses. Pricing + model defaults come from
   PROVIDER_DEFAULTS in src/config/config.ts (with one inline default
   for GitHub Models, which is not in that table).

2. Standalone tsx script replaced with a Jest spec at
   tests/smoke/provider-dialect-smoke.spec.ts running under its own
   jest.smoke.config.ts. The base jest.config.js already constrains
   roots to tests/unit/, so this file is unreachable from the default
   `npm test` run — no testPathIgnorePatterns entry needed.
   `npm run test:smoke` uses the smoke config. Per-provider credential
   presence is checked at module load and missing-credential providers
   are statically marked describe.skip() so the entire 4-stage block
   shows up as Skipped in the test report rather than Pass.

3. Input is now a slice fixture under
   evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree),
   loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md).
   The inbox dataset infrastructure from PR #152 has not landed yet,
   so this fixture is a self-contained smoke input; it slots into the
   new format if/when the slice harness lands.

Workflow file is left in its current repo-root location for now; a
follow-up with workflow-scoped credentials will move it back under
.github/workflows/.

Verified locally:
- npm run lint clean
- npm run test:smoke (no credentials) → 16 skipped, 0 failed
- npm run test:smoke with malformed Anthropic key → 4 failed (3 with
  401 from anthropic.completeStructured wrapError, 1 root-cause-extract
  caught by the token-stats silent-failure assertion), 12 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
valdis added a commit that referenced this pull request Jun 4, 2026
## Summary

Automates the unchecked manual smoke item from PR #145's test plan:
exercises each of the 4 AI caller stages migrated to native structured
output (`file-reviewer`, `validation-resolver`, `dedup-resolver`,
`root-cause-extract`) against each real provider (`anthropic`, `openai`,
`bedrock`, `github`) using a slice fixture as input. Validates plumbing
only — the provider-specific dialect path returns a zod-validated
response without throwing. Output quality stays scoped to the deferred
per-stage golden-evals follow-up.

## Why

PR #145 introduced six provider-dialect paths (OpenAI strict
`json_schema`, OpenAI `json_object` fallback, Anthropic `output_config`,
Anthropic `tool_use` fallback, Bedrock forced `tool_use`, GitHub Models
via OpenAI-compatible) and four zod schemas. Unit tests cover each path
with mocked SDKs; nothing exercises a full stage call end-to-end against
a real provider. The risk surface is the stage × dialect matrix.

## Approach

- **Jest spec** at `tests/smoke/provider-dialect-smoke.spec.ts` with its
own `jest.smoke.config.ts`. The base `jest.config.js` constrains `roots`
to `tests/unit/`, so this file is unreachable from default `npm test`;
`npm run test:smoke` uses the smoke config.
- **Provider configuration via `ConfigService`**: per-provider temp
`.qualopsrc.json` written under `tests/smoke/.tmp/`, loaded via
`ConfigService.setConfigPath()`. Pricing + model defaults come from
`PROVIDER_DEFAULTS` in `src/config/config.ts` (with one inline default
for GitHub Models, which is not in that table). Stage classes are
obtained via `AIFactory.createForStage('review')` — same path production
code uses.
- **Skip vs fail**: per-provider credential presence is checked at
module load and missing-credential providers are statically marked
`describe.skip()`, so the entire 4-stage block shows up as Skipped in
the test report rather than Pass. Providers with present-but-malformed
credentials are attempted and fail loudly via the provider class's own
`validateApiKey()`.
- **Input**: slice fixture at
`evals/datasets/inbox/smoke-sql-injection/` (`slice.json` + `repo/`
tree), loosely following [TDR
0002](https://github.com/eggai-tech/qualops/blob/main/docs/tdr/0002-evals-from-real-prs.md).
The inbox dataset infrastructure from PR #152 hasn't landed yet, so this
fixture is self-contained; it slots into the new format if/when the
slice harness lands.
- **`root-cause-extract`** swallows provider errors internally and
returns synthetic `{rootCause: 'other', confidence: 0}` classifications.
The spec cross-checks
`AIFactory.createForStage('review').getTokenStats()` and the
classification distribution to surface silent failures as test failures.

## CI

Nightly + manual workflow. Workflow file is currently at the repo root
pending a workflow-scoped push that moves it under `.github/workflows/`.

## Test plan

- [x] `npm run lint` clean
- [x] `npm run test:smoke` with no credentials → 16 skipped, 0 failed
(all 4 describes skipped)
- [x] `npm run test:smoke` with a malformed Anthropic key → 4 failed (3
with 401 from `anthropic.completeStructured` wrapError, 1 from the
`root-cause-extract` silent-failure assertion), 12 skipped
- [x] No artifacts left behind after a smoke run
(`.qualops/prompts/_smoke-*.md`, `tests/smoke/.tmp/`,
`.qualops/reports/.smoke-*` all cleaned by `afterAll`)
- [ ] CI workflow dry-run after the workflow file is moved into
`.github/workflows/`

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Valdis Pornieks <pornieks@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant