Replace grep-based structural tests with behavioral tests (#160) by TomHennen · Pull Request #236 · TomHennen/wrangle

TomHennen · 2026-05-20T02:15:05Z

Closes #160.

Summary

build/actions/{npm,python}/test.bats and tools/scorecard/test.bats relied heavily on greps that asserted a particular string appeared somewhere in the action.yml or helper script. Those tests confirmed the presence of text, not behavior: moving a keyword into a comment would keep them passing, and a regression that broke the actual logic might not trip them. They also broke on harmless refactors (PR #156's helper extraction forced unrelated test edits with no behavioral change).

This PR follows the template established by lib/validate_path.sh and container/validate_inputs.sh: invoke each helper script against fixture inputs and assert on exit codes, GITHUB_OUTPUT contents, and stdout/stderr. Per the issue, greps remain only as supply-chain guard rails (no curl | sh, SHA-pinned actions, ${{ inputs.* }} flowing through env:, action.yml actually delegating to the scripts the behavioral tests now cover) and for workflow-level structural checks that have no behavioral counterpart.

What changed

npm: +27 behavioral tests for validate_inputs.sh (package.json check, lockfile detection, yarn rejection, workspaces rejection, npm+pnpm ambiguity), detect_tooling.sh (the four node-version-resolution branches, npm/pnpm cache split that encodes the issue When pnpm/Yarn support lands, do NOT enable pnpm-store / yarn cache (Mini Shai-Hulud lesson) #205 decision), and build_and_pack.sh (npm/pnpm flow, --ignore-scripts threading through install AND pack, 'no test specified' substring guard, tarball-count check, scripts: null safety). PATH shims for npm/pnpm record invocations; jq stays real. Deleted ~15 source-string greps the behavioral tests supersede.
python: +11 behavioral tests for the pyproject.toml requirement in validate_inputs.sh, the three test-discovery branches in run_tests.sh (tests/, test/, [tool.pytest]), the uv/pip split, and the [test] → [dev] → bare install fallback chain in install_deps.sh. Stub pytest/uv/python on PATH record invocations. Deleted 3 source-string greps.
scorecard: +7 behavioral tests for sarif_to_markdown.sh (header+rows, empty results, HTML stripping, WRANGLE_MAX_SUMMARY truncation, missing-file and invalid-JSON exit codes, usage error). The truncation and HTML-stripping logic in particular were silent-failure-class bugs that no grep could catch.

Net: +28 behavioral tests, -18 source-string greps. Total test count in the touched files: npm 77, python 62, scorecard 13.

What was deliberately NOT changed

build/actions/container/test.bats — already has solid behavioral tests for validate_inputs.sh and resolve_cache.sh (added in claude: SLSA L3 audit defense-in-depth — ::stop-commands:: guard + adopter-tunable PR cache knobs #225). Remaining greps are workflow/action.yml structural checks the issue explicitly calls out as legitimate guard rails.
build/actions/shell/test.bats — the action wraps shellcheck/bats; no in-action script logic to test behaviorally.
tools/{dependency-review,osv,syft}/test.bats — already have comprehensive behavioral tests with mocks.
tools/{zizmor,scorecard}/action.yml checks, supply-chain greps (curl | sh, SHA-pin assertions, set -f presence), and action.yml → script delegation checks — these are the guard rails the issue says to keep.

Test plan

bats build/actions/npm/test.bats — 77/77 pass
bats build/actions/python/test.bats — 62/62 pass
bats tools/scorecard/test.bats — 13/13 pass
Full bats sweep across all in-scope files — 0 new failures (pre-existing bump_action_pins.bats failures are an unrelated test-env git config issue, present on main as well)
CI passes (will be exercised on PR)
Integration test in wrangle-test still passes (unit-test surface deliberately narrowed; integration covers the real npm/python/pip/uv invocations the PATH shims stand in for)

Generated by Claude Code

The build-action and scorecard test.bats files relied heavily on greps that asserted a particular string appeared somewhere in the action.yml or helper script. Those tests confirmed the presence of text, not behavior: moving a keyword into a comment would keep them passing, and a regression that broke the actual logic might not trip them. They also broke on harmless refactors (PR #156's helper extraction forced unrelated test edits). This change follows the template established by lib/validate_path.sh and container/validate_inputs.sh: invoke each helper script against fixture inputs and assert on exit codes, GITHUB_OUTPUT contents, and stdout. Greps remain only as supply-chain guard rails (no curl|sh, SHA-pinned actions, action.yml flowing inputs through env:, action.yml actually delegating to the script the behavioral tests cover) and for workflow-level structural checks that have no behavioral counterpart. npm: + 27 behavioral tests covering validate_inputs.sh (package.json, lockfile, yarn rejection, workspaces rejection, npm+pnpm ambiguity), detect_tooling.sh (the four node-version-resolution branches, npm/pnpm cache split that encodes issue #205), and build_and_pack.sh (npm/pnpm pack flow, --ignore-scripts threading, 'no test specified' substring guard, tarball count check). PATH shims for npm/pnpm record invocations; jq stays real. Deleted ~15 source-grep tests these supersede. python: + 11 behavioral tests covering pyproject.toml requirement in validate_inputs.sh, the three test-discovery branches in run_tests.sh, the uv/pip split, and the [test]/[dev]/bare install fallback chain in install_deps.sh. Stub pytest/uv/python on PATH. Deleted 3 source-grep tests. scorecard: + 7 behavioral tests for sarif_to_markdown.sh (header+rows, empty results, HTML stripping, WRANGLE_MAX_SUMMARY truncation, missing-file and invalid-JSON exit codes, usage error). The truncation and HTML stripping in particular were silent-failure-class bugs that grep could not catch. Net: +28 behavioral tests, -18 source-string greps. All 152 tests in the touched files pass locally.

TomHennen

Verdict: approve-with-nits

Delivers what it claims — supply-chain greps preserved, fixtures exercise real branches, no actual coverage regression among the 18 deleted greps, CI green. A few should-fix items and one approach-level thought.

Should-fix

The [dev] middle branch of install_deps.sh fallback isn't directly tested. build/actions/python/test.bats:264-291 covers [test] success and the bare-install endpoint, but not the middle ([test] fails → [dev] succeeds → log "Installed with [dev] extra"). A regression that swapped [test] and [dev] order, or dropped the [dev] attempt entirely, would still pass. Add a fixture where the shim fails call 2 and succeeds on call 3, asserting "Installed with [dev] extra".
No explicit pack-uses-same-PM-as-install assertion. build/actions/npm/test.bats:421-451 — the deleted grep asserted "$PM" pack literally. Coverage is preserved by indirection (a regression in the pnpm path would fail the pnpm test), but consider one composite assertion: an npm-path test that asserts ! [[ "$output" == *"pnpm "* ]], and vice versa.
Shim relies on TMP_DIR from caller env — build/actions/python/test.bats:262-291. Fragile if bats ever sandboxes the run. Use $BATS_TEST_TMPDIR or a dedicated env var.
Pre-existing: run_tests.sh:30 grep -qF '[tool.pytest' matches the literal substring inside any string in pyproject.toml. Worth one extra negative fixture in a follow-up.

Nits

build/actions/python/test.bats:268 — case "${COUNTER_FILE:-/tmp/missing}" in /tmp/missing) ;; *) ;; esac is dead code. Remove.
build/actions/npm/test.bats:432-450 — install_pm_shim comment claims "$TMP_DIR/calls.log" but the shims write to stdout. Stale.
setup() recreates shims per-test (77 of them). Move read-only shims to setup_file() for speed.

Approach-level

PATH-shim is the middle of a tradeoff — realistic enough to catch most regressions, fragile enough that real-tool behavior changes can silently break the assumptions. Worth picking deliberately rather than defaulting to "shim everything":

Pure function — extract decision logic so it's just args in → string out. Tightest tests, no shim drift, no real-tool flakiness. detect_tooling.sh's node-version resolution is a strong candidate.
Real tool — works for tools you can install cheaply in CI (jq already, pytest plausibly).
Shim — only when neither of the above fits.

Default to (1)/(2); fall to (3) only when forced.

Good

Fixtures realistically drive decision branches (workspaces, scripts:null, both lockfiles, all four node-version-resolution branches, all four cache modes).
Issue #205 "decision" asserted via cache= empty + belt-and-braces negative grep — exactly the "assert the decision, not the implementation" framing.
Scorecard tests catch truncation/HTML-stripping silent-failure classes that no grep could.
Supply-chain greps (curl|sh, /usr/local/bin, SHA-pin, env-flow, action.yml→script delegation, stop_commands_guard wrapping) preserved verbatim.
Negative-path tests assert on error-message shape, not just exit code.
CI all green.

Generated by Claude Code

Review feedback on the initial PR was: PATH-shimming everything is the middle of a tradeoff. Where the decision logic is pure (args in → string out), extract it and test the function directly — tightest tests, no shim drift, no real-tool flakiness. Reserve shims for the orchestration that legitimately needs to drive external tools. Refactor (the "approach-level" item, the main thing here): detect_tooling.sh — split into two pure functions: resolve_node_version(input_version, project_dir) — prints "<version>|<file>|<reason>" for the four version-source branches. resolve_pm_cache(project_dir) — prints "<package-manager>|<cache>". main() composes them and writes to GITHUB_OUTPUT. build_and_pack.sh — split into four pure decision functions: detect_pm(path) — "npm" or "pnpm" from the lockfile. has_build_script(path) — null-safe scripts.build check. has_real_test_script(path) — null-safe + "no test specified" substring guard. find_one_tarball(path) — globs dist/*.tgz, asserts count==1, prints the bare filename. main() orchestrates install/build/test/pack against real npm/pnpm — that strand still legitimately needs a shim. run_tests.sh — split out: should_run_pytest(path) — exit 0 iff tests/, test/, or [tool.pytest.*] header in pyproject.toml. The original grep-qF substring match also fixed: anchored to start of line so a description string containing "[tool.pytest" doesn't false-positive. Each script has a sourcing guard (`if [[ "${BASH_SOURCE[0]}" == \ "${0}" ]]; then main "$@"; fi`) so the functions can be sourced into a test subshell without running main. Test reorg follows the same three-tier shape: Tier 1 — Pure-function tests: `run bash -c 'source script; func "$@"' -- args`. No shims, no GITHUB_OUTPUT plumbing. Covers every branch of every pure function. Includes the "scripts: null" null-safety case and the new `should_run_pytest` start-of-line guard against the literal-substring-in-description regression. Tier 2 — validate_inputs.sh end-to-end (no external deps beyond jq). Tier 3 — Integration: a small set of orchestration tests that drive build_and_pack.sh / run_tests.sh / install_deps.sh through PATH shims. The shims now live in setup_file() under $BATS_FILE_TMPDIR (per the reviewer's perf nit), not per-test. Should-fix items from the review: 1. Missing [dev] middle-branch test for install_deps.sh's fallback cascade. Added: WRANGLE_TEST_PIP_FAILS=1 makes [test] fail and asserts [dev] runs and succeeds. WRANGLE_TEST _PIP_FAILS=1,2 covers the full cascade to bare install. The python shim takes the fail list as a comma-separated env var. 2. No explicit pack-uses-same-PM-as-install assertion. Added cross-PM negation in both end-to-end pipeline tests: `! grep -qE '^pnpm '` on the npm path, `! grep -qE '^npm '` on the pnpm path. 3. Python shim referenced caller's `$TMP_DIR`. Replaced with $BATS_TEST_TMPDIR (per-test) and $BATS_FILE_TMPDIR (per-file shim dir). 4. run_tests.sh's grep -qF '[tool.pytest' matched the substring anywhere in pyproject.toml. Anchored to ^[tool\.pytest and added a regression test (description containing the literal "[tool.pytest configuration]" must NOT trigger discovery). Nits from the review: - Removed dead `case "${COUNTER_FILE:-/tmp/missing}"` code. - Removed stale "writes to calls.log" comment; shims write to stdout. - Shim creation moved from per-test setup() to per-file setup_file() under $BATS_FILE_TMPDIR. Test counts (touched files): npm 82, python 66, scorecard 13. All scripts shellcheck-clean; the three refactored scripts preserve their action.yml calling contracts.

…-binary e2e Restructures tools/osv/test.bats into three layers, mirroring the behavioral-test pattern adopted in #236: - render_md.sh — direct invocation with inline SARIF heredocs per test. New coverage: CVSS bucket mapping (CRITICAL/HIGH/MEDIUM/ LOW/UNKNOWN), per-ruleId dedupe, file:// stripping, fixed-version extraction from rule.help.markdown, em-dash on missing fixed versions, empty-SARIF "No known vulnerabilities" line, exit codes for missing file (1), invalid JSON (2), and missing arg (1). - adapter.sh — end-to-end via the PATH-shimmed osv-scanner. Keeps the existing exit-code contract tests; #197 regression now asserts count parity between unique ruleIds in SARIF and rendered rows. - install.sh — unchanged verification-chain tests, renamed TEST_DIR → TMP_DIR. Drops the source-mutating "render failure does not fail the adapter" test (it backed up render_md.sh, overwrote it with a stub, and restored — a cleanup race could leave the repo dirty). The behaviour it covered is now split across direct render_md.sh tests that assert clean non-zero exits, plus an adapter test that confirms output.md is non-empty on the happy path. The `|| placeholder` shell wiring in adapter.sh is one line and doesn't need its own bats test. Adds a real-binary e2e (`osv e2e: real osv-scanner produces consistent SARIF + MD`) that drives the full pipeline against testdata/vulnerable_go.mod (pinned to go 1.20 stdlib for stable advisories). Skips gracefully when osv-scanner isn't on PATH or osv.dev is unreachable (sandboxed CI), so it doesn't flake; in CI with network access it asserts the #197 invariant against findings the tool actually produces. Net: 30 osv tests (was 17), all pass; shellcheck clean.

TomHennen temporarily deployed to integration-test May 20, 2026 02:15 — with GitHub Actions Inactive

TomHennen commented May 21, 2026

View reviewed changes

TomHennen temporarily deployed to integration-test May 21, 2026 01:10 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace grep-based structural tests with behavioral tests (#160)#236

Replace grep-based structural tests with behavioral tests (#160)#236
TomHennen wants to merge 2 commits into
mainfrom
claude/wrangle-issue-160-DHJ5P

TomHennen commented May 20, 2026

Uh oh!

TomHennen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TomHennen commented May 20, 2026

Summary

What changed

What was deliberately NOT changed

Test plan

Uh oh!

TomHennen left a comment

Choose a reason for hiding this comment

Verdict: approve-with-nits

Should-fix

Nits

Approach-level

Good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants