chore(evals): add eval case for PR 144 bash tool security findings by valdis · Pull Request #152 · eggai-tech/qualops

valdis · 2026-05-11T16:30:22Z

Captures 11 security findings from the hardened bash tool PR across multiple self-review iterations. baseSha 415d0e5.

github-actions · 2026-05-11T16:34:36Z

QualOps Code Quality Analysis

Status: ⚠️ FAILED - Critical or high severity issues found

Summary

Total Issues: 4
Critical: 0 🔴
High: 2 🟠
Medium: 2 🟡
Low: 0 🟢
Files Analyzed: 18

🟠 High Issues (2)

evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/policy.ts:393 - security
git -c config injection bypass: only the first -c flag is checked, allowing subsequent -c flags to inject dangerous git config like core.hooksPath or core.fsmonitor
evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/env-scrub.ts:140 - security
BASH_ENV environment variable not scrubbed — allows pre-execution code injection into the non-interactive bash shell

🟡 Medium Issues (2)

evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/policy.ts:365 - security
Path traversal via relative cd + relative file access: cd with ../ segments is allowed by policy, enabling file reads outside workspace
evals/datasets/inbox/qualops-pr-144-bash-tool/repo/src/stages/review/agentic/tools/bash/policy.ts:31 - security
Policy bypass via proxy commands (xargs, find -exec/-delete, perl, ruby, awk) that can execute denied binaries indirectly

📊 Full Report

View detailed report

Powered by QualOps

Captures 11 security findings from the hardened bash tool PR across multiple self-review iterations. baseSha 415d0e5.

Addresses review comments on the smoke harness: 1. Provider/model configuration now flows through ConfigService instead of a hardcoded PROVIDER_DEFAULTS table local to the smoke harness. The spec writes a per-provider temp .qualopsrc.json under tests/smoke/.tmp/, calls ConfigService.setConfigPath(), and obtains the AIProvider via AIFactory.createForStage('review') — the same path production code uses. Pricing + model defaults come from PROVIDER_DEFAULTS in src/config/config.ts (with one inline default for GitHub Models, which is not in that table). 2. Standalone tsx script replaced with a Jest spec at tests/smoke/provider-dialect-smoke.spec.ts running under its own jest.smoke.config.ts. The base jest.config.js already constrains roots to tests/unit/, so this file is unreachable from the default `npm test` run — no testPathIgnorePatterns entry needed. `npm run test:smoke` uses the smoke config. Per-provider credential presence is checked at module load and missing-credential providers are statically marked describe.skip() so the entire 4-stage block shows up as Skipped in the test report rather than Pass. 3. Input is now a slice fixture under evals/datasets/inbox/smoke-sql-injection/ (slice.json + repo/ tree), loosely following TDR 0002 (docs/tdr/0002-evals-from-real-prs.md). The inbox dataset infrastructure from PR #152 has not landed yet, so this fixture is a self-contained smoke input; it slots into the new format if/when the slice harness lands. Workflow file is left in its current repo-root location for now; a follow-up with workflow-scoped credentials will move it back under .github/workflows/. Verified locally: - npm run lint clean - npm run test:smoke (no credentials) → 16 skipped, 0 failed - npm run test:smoke with malformed Anthropic key → 4 failed (3 with 401 from anthropic.completeStructured wrapError, 1 root-cause-extract caught by the token-stats silent-failure assertion), 12 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary Automates the unchecked manual smoke item from PR #145's test plan: exercises each of the 4 AI caller stages migrated to native structured output (`file-reviewer`, `validation-resolver`, `dedup-resolver`, `root-cause-extract`) against each real provider (`anthropic`, `openai`, `bedrock`, `github`) using a slice fixture as input. Validates plumbing only — the provider-specific dialect path returns a zod-validated response without throwing. Output quality stays scoped to the deferred per-stage golden-evals follow-up. ## Why PR #145 introduced six provider-dialect paths (OpenAI strict `json_schema`, OpenAI `json_object` fallback, Anthropic `output_config`, Anthropic `tool_use` fallback, Bedrock forced `tool_use`, GitHub Models via OpenAI-compatible) and four zod schemas. Unit tests cover each path with mocked SDKs; nothing exercises a full stage call end-to-end against a real provider. The risk surface is the stage × dialect matrix. ## Approach - **Jest spec** at `tests/smoke/provider-dialect-smoke.spec.ts` with its own `jest.smoke.config.ts`. The base `jest.config.js` constrains `roots` to `tests/unit/`, so this file is unreachable from default `npm test`; `npm run test:smoke` uses the smoke config. - **Provider configuration via `ConfigService`**: per-provider temp `.qualopsrc.json` written under `tests/smoke/.tmp/`, loaded via `ConfigService.setConfigPath()`. Pricing + model defaults come from `PROVIDER_DEFAULTS` in `src/config/config.ts` (with one inline default for GitHub Models, which is not in that table). Stage classes are obtained via `AIFactory.createForStage('review')` — same path production code uses. - **Skip vs fail**: per-provider credential presence is checked at module load and missing-credential providers are statically marked `describe.skip()`, so the entire 4-stage block shows up as Skipped in the test report rather than Pass. Providers with present-but-malformed credentials are attempted and fail loudly via the provider class's own `validateApiKey()`. - **Input**: slice fixture at `evals/datasets/inbox/smoke-sql-injection/` (`slice.json` + `repo/` tree), loosely following [TDR 0002](https://github.com/eggai-tech/qualops/blob/main/docs/tdr/0002-evals-from-real-prs.md). The inbox dataset infrastructure from PR #152 hasn't landed yet, so this fixture is self-contained; it slots into the new format if/when the slice harness lands. - **`root-cause-extract`** swallows provider errors internally and returns synthetic `{rootCause: 'other', confidence: 0}` classifications. The spec cross-checks `AIFactory.createForStage('review').getTokenStats()` and the classification distribution to surface silent failures as test failures. ## CI Nightly + manual workflow. Workflow file is currently at the repo root pending a workflow-scoped push that moves it under `.github/workflows/`. ## Test plan - [x] `npm run lint` clean - [x] `npm run test:smoke` with no credentials → 16 skipped, 0 failed (all 4 describes skipped) - [x] `npm run test:smoke` with a malformed Anthropic key → 4 failed (3 with 401 from `anthropic.completeStructured` wrapError, 1 from the `root-cause-extract` silent-failure assertion), 12 skipped - [x] No artifacts left behind after a smoke run (`.qualops/prompts/_smoke-*.md`, `tests/smoke/.tmp/`, `.qualops/reports/.smoke-*` all cleaned by `afterAll`) - [ ] CI workflow dry-run after the workflow file is moved into `.github/workflows/` --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Valdis Pornieks <pornieks@gmail.com>

valdis force-pushed the proposed-eval-for-security-scans branch from 4a3fbea to e9ece28 Compare May 11, 2026 16:30

valdis force-pushed the proposed-eval-for-security-scans branch 4 times, most recently from 04976ff to fb05563 Compare May 15, 2026 11:46

chore(evals): add eval case for PR 144 bash tool security findings

ceba928

Captures 11 security findings from the hardened bash tool PR across multiple self-review iterations. baseSha 415d0e5.

valdis force-pushed the proposed-eval-for-security-scans branch from fb05563 to ceba928 Compare May 19, 2026 08:04

valdis mentioned this pull request May 21, 2026

Provider-dialect smoke harness for AI caller stages #188

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(evals): add eval case for PR 144 bash tool security findings#152

chore(evals): add eval case for PR 144 bash tool security findings#152
valdis wants to merge 1 commit into
mainfrom
proposed-eval-for-security-scans

valdis commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

valdis commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QualOps Code Quality Analysis

Summary

🟠 High Issues (2)

🟡 Medium Issues (2)

📊 Full Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 11, 2026 •

edited

Loading