Skip to content

Replace grep-based structural tests with behavioral tests (#160)#236

Open
TomHennen wants to merge 2 commits into
mainfrom
claude/wrangle-issue-160-DHJ5P
Open

Replace grep-based structural tests with behavioral tests (#160)#236
TomHennen wants to merge 2 commits into
mainfrom
claude/wrangle-issue-160-DHJ5P

Conversation

@TomHennen
Copy link
Copy Markdown
Owner

Closes #160.

Summary

build/actions/{npm,python}/test.bats and tools/scorecard/test.bats relied heavily on greps that asserted a particular string appeared somewhere in the action.yml or helper script. Those tests confirmed the presence of text, not behavior: moving a keyword into a comment would keep them passing, and a regression that broke the actual logic might not trip them. They also broke on harmless refactors (PR #156's helper extraction forced unrelated test edits with no behavioral change).

This PR follows the template established by lib/validate_path.sh and container/validate_inputs.sh: invoke each helper script against fixture inputs and assert on exit codes, GITHUB_OUTPUT contents, and stdout/stderr. Per the issue, greps remain only as supply-chain guard rails (no curl | sh, SHA-pinned actions, ${{ inputs.* }} flowing through env:, action.yml actually delegating to the scripts the behavioral tests now cover) and for workflow-level structural checks that have no behavioral counterpart.

What changed

  • npm: +27 behavioral tests for validate_inputs.sh (package.json check, lockfile detection, yarn rejection, workspaces rejection, npm+pnpm ambiguity), detect_tooling.sh (the four node-version-resolution branches, npm/pnpm cache split that encodes the issue When pnpm/Yarn support lands, do NOT enable pnpm-store / yarn cache (Mini Shai-Hulud lesson) #205 decision), and build_and_pack.sh (npm/pnpm flow, --ignore-scripts threading through install AND pack, 'no test specified' substring guard, tarball-count check, scripts: null safety). PATH shims for npm/pnpm record invocations; jq stays real. Deleted ~15 source-string greps the behavioral tests supersede.
  • python: +11 behavioral tests for the pyproject.toml requirement in validate_inputs.sh, the three test-discovery branches in run_tests.sh (tests/, test/, [tool.pytest]), the uv/pip split, and the [test][dev] → bare install fallback chain in install_deps.sh. Stub pytest/uv/python on PATH record invocations. Deleted 3 source-string greps.
  • scorecard: +7 behavioral tests for sarif_to_markdown.sh (header+rows, empty results, HTML stripping, WRANGLE_MAX_SUMMARY truncation, missing-file and invalid-JSON exit codes, usage error). The truncation and HTML-stripping logic in particular were silent-failure-class bugs that no grep could catch.

Net: +28 behavioral tests, -18 source-string greps. Total test count in the touched files: npm 77, python 62, scorecard 13.

What was deliberately NOT changed

  • build/actions/container/test.bats — already has solid behavioral tests for validate_inputs.sh and resolve_cache.sh (added in claude: SLSA L3 audit defense-in-depth — ::stop-commands:: guard + adopter-tunable PR cache knobs #225). Remaining greps are workflow/action.yml structural checks the issue explicitly calls out as legitimate guard rails.
  • build/actions/shell/test.bats — the action wraps shellcheck/bats; no in-action script logic to test behaviorally.
  • tools/{dependency-review,osv,syft}/test.bats — already have comprehensive behavioral tests with mocks.
  • tools/{zizmor,scorecard}/action.yml checks, supply-chain greps (curl | sh, SHA-pin assertions, set -f presence), and action.yml → script delegation checks — these are the guard rails the issue says to keep.

Test plan

  • bats build/actions/npm/test.bats — 77/77 pass
  • bats build/actions/python/test.bats — 62/62 pass
  • bats tools/scorecard/test.bats — 13/13 pass
  • Full bats sweep across all in-scope files — 0 new failures (pre-existing bump_action_pins.bats failures are an unrelated test-env git config issue, present on main as well)
  • CI passes (will be exercised on PR)
  • Integration test in wrangle-test still passes (unit-test surface deliberately narrowed; integration covers the real npm/python/pip/uv invocations the PATH shims stand in for)

Generated by Claude Code

The build-action and scorecard test.bats files relied heavily on
greps that asserted a particular string appeared somewhere in the
action.yml or helper script. Those tests confirmed the presence of
text, not behavior: moving a keyword into a comment would keep them
passing, and a regression that broke the actual logic might not
trip them. They also broke on harmless refactors (PR #156's helper
extraction forced unrelated test edits).

This change follows the template established by lib/validate_path.sh
and container/validate_inputs.sh: invoke each helper script against
fixture inputs and assert on exit codes, GITHUB_OUTPUT contents, and
stdout. Greps remain only as supply-chain guard rails (no curl|sh,
SHA-pinned actions, action.yml flowing inputs through env:, action.yml
actually delegating to the script the behavioral tests cover) and for
workflow-level structural checks that have no behavioral counterpart.

  npm: + 27 behavioral tests covering validate_inputs.sh (package.json,
       lockfile, yarn rejection, workspaces rejection, npm+pnpm
       ambiguity), detect_tooling.sh (the four node-version-resolution
       branches, npm/pnpm cache split that encodes issue #205), and
       build_and_pack.sh (npm/pnpm pack flow, --ignore-scripts threading,
       'no test specified' substring guard, tarball count check).
       PATH shims for npm/pnpm record invocations; jq stays real.
       Deleted ~15 source-grep tests these supersede.

  python: + 11 behavioral tests covering pyproject.toml requirement in
          validate_inputs.sh, the three test-discovery branches in
          run_tests.sh, the uv/pip split, and the [test]/[dev]/bare
          install fallback chain in install_deps.sh. Stub pytest/uv/python
          on PATH. Deleted 3 source-grep tests.

  scorecard: + 7 behavioral tests for sarif_to_markdown.sh (header+rows,
             empty results, HTML stripping, WRANGLE_MAX_SUMMARY
             truncation, missing-file and invalid-JSON exit codes, usage
             error). The truncation and HTML stripping in particular were
             silent-failure-class bugs that grep could not catch.

Net: +28 behavioral tests, -18 source-string greps. All 152 tests in
the touched files pass locally.
@TomHennen TomHennen temporarily deployed to integration-test May 20, 2026 02:15 — with GitHub Actions Inactive
Copy link
Copy Markdown
Owner Author

@TomHennen TomHennen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: approve-with-nits

Delivers what it claims — supply-chain greps preserved, fixtures exercise real branches, no actual coverage regression among the 18 deleted greps, CI green. A few should-fix items and one approach-level thought.

Should-fix

  1. The [dev] middle branch of install_deps.sh fallback isn't directly tested. build/actions/python/test.bats:264-291 covers [test] success and the bare-install endpoint, but not the middle ([test] fails → [dev] succeeds → log "Installed with [dev] extra"). A regression that swapped [test] and [dev] order, or dropped the [dev] attempt entirely, would still pass. Add a fixture where the shim fails call 2 and succeeds on call 3, asserting "Installed with [dev] extra".
  2. No explicit pack-uses-same-PM-as-install assertion. build/actions/npm/test.bats:421-451 — the deleted grep asserted "$PM" pack literally. Coverage is preserved by indirection (a regression in the pnpm path would fail the pnpm test), but consider one composite assertion: an npm-path test that asserts ! [[ "$output" == *"pnpm "* ]], and vice versa.
  3. Shim relies on TMP_DIR from caller envbuild/actions/python/test.bats:262-291. Fragile if bats ever sandboxes the run. Use $BATS_TEST_TMPDIR or a dedicated env var.
  4. Pre-existing: run_tests.sh:30 grep -qF '[tool.pytest' matches the literal substring inside any string in pyproject.toml. Worth one extra negative fixture in a follow-up.

Nits

  • build/actions/python/test.bats:268case "${COUNTER_FILE:-/tmp/missing}" in /tmp/missing) ;; *) ;; esac is dead code. Remove.
  • build/actions/npm/test.bats:432-450install_pm_shim comment claims "$TMP_DIR/calls.log" but the shims write to stdout. Stale.
  • setup() recreates shims per-test (77 of them). Move read-only shims to setup_file() for speed.

Approach-level

PATH-shim is the middle of a tradeoff — realistic enough to catch most regressions, fragile enough that real-tool behavior changes can silently break the assumptions. Worth picking deliberately rather than defaulting to "shim everything":

  1. Pure function — extract decision logic so it's just args in → string out. Tightest tests, no shim drift, no real-tool flakiness. detect_tooling.sh's node-version resolution is a strong candidate.
  2. Real tool — works for tools you can install cheaply in CI (jq already, pytest plausibly).
  3. Shim — only when neither of the above fits.

Default to (1)/(2); fall to (3) only when forced.

Good

  • Fixtures realistically drive decision branches (workspaces, scripts:null, both lockfiles, all four node-version-resolution branches, all four cache modes).
  • Issue #205 "decision" asserted via cache= empty + belt-and-braces negative grep — exactly the "assert the decision, not the implementation" framing.
  • Scorecard tests catch truncation/HTML-stripping silent-failure classes that no grep could.
  • Supply-chain greps (curl|sh, /usr/local/bin, SHA-pin, env-flow, action.yml→script delegation, stop_commands_guard wrapping) preserved verbatim.
  • Negative-path tests assert on error-message shape, not just exit code.
  • CI all green.

Generated by Claude Code

Review feedback on the initial PR was: PATH-shimming everything is
the middle of a tradeoff. Where the decision logic is pure (args in
→ string out), extract it and test the function directly — tightest
tests, no shim drift, no real-tool flakiness. Reserve shims for the
orchestration that legitimately needs to drive external tools.

Refactor (the "approach-level" item, the main thing here):

  detect_tooling.sh — split into two pure functions:
    resolve_node_version(input_version, project_dir) — prints
        "<version>|<file>|<reason>" for the four version-source
        branches.
    resolve_pm_cache(project_dir) — prints "<package-manager>|<cache>".
    main() composes them and writes to GITHUB_OUTPUT.

  build_and_pack.sh — split into four pure decision functions:
    detect_pm(path) — "npm" or "pnpm" from the lockfile.
    has_build_script(path) — null-safe scripts.build check.
    has_real_test_script(path) — null-safe + "no test specified"
        substring guard.
    find_one_tarball(path) — globs dist/*.tgz, asserts count==1,
        prints the bare filename.
    main() orchestrates install/build/test/pack against real
        npm/pnpm — that strand still legitimately needs a shim.

  run_tests.sh — split out:
    should_run_pytest(path) — exit 0 iff tests/, test/, or
        [tool.pytest.*] header in pyproject.toml. The original
        grep-qF substring match also fixed: anchored to start of
        line so a description string containing "[tool.pytest"
        doesn't false-positive.

Each script has a sourcing guard (`if [[ "${BASH_SOURCE[0]}" == \
"${0}" ]]; then main "$@"; fi`) so the functions can be sourced
into a test subshell without running main.

Test reorg follows the same three-tier shape:

  Tier 1 — Pure-function tests: `run bash -c 'source script;
  func "$@"' -- args`. No shims, no GITHUB_OUTPUT plumbing. Covers
  every branch of every pure function. Includes the "scripts: null"
  null-safety case and the new `should_run_pytest` start-of-line
  guard against the literal-substring-in-description regression.

  Tier 2 — validate_inputs.sh end-to-end (no external deps beyond jq).

  Tier 3 — Integration: a small set of orchestration tests that
  drive build_and_pack.sh / run_tests.sh / install_deps.sh through
  PATH shims. The shims now live in setup_file() under
  $BATS_FILE_TMPDIR (per the reviewer's perf nit), not per-test.

Should-fix items from the review:

  1. Missing [dev] middle-branch test for install_deps.sh's
     fallback cascade. Added: WRANGLE_TEST_PIP_FAILS=1 makes
     [test] fail and asserts [dev] runs and succeeds. WRANGLE_TEST
     _PIP_FAILS=1,2 covers the full cascade to bare install. The
     python shim takes the fail list as a comma-separated env var.

  2. No explicit pack-uses-same-PM-as-install assertion. Added
     cross-PM negation in both end-to-end pipeline tests:
     `! grep -qE '^pnpm '` on the npm path, `! grep -qE '^npm '`
     on the pnpm path.

  3. Python shim referenced caller's `$TMP_DIR`. Replaced with
     $BATS_TEST_TMPDIR (per-test) and $BATS_FILE_TMPDIR (per-file
     shim dir).

  4. run_tests.sh's grep -qF '[tool.pytest' matched the substring
     anywhere in pyproject.toml. Anchored to ^[tool\.pytest and
     added a regression test (description containing the literal
     "[tool.pytest configuration]" must NOT trigger discovery).

Nits from the review:

  - Removed dead `case "${COUNTER_FILE:-/tmp/missing}"` code.
  - Removed stale "writes to calls.log" comment; shims write to
    stdout.
  - Shim creation moved from per-test setup() to per-file
    setup_file() under $BATS_FILE_TMPDIR.

Test counts (touched files): npm 82, python 66, scorecard 13.
All scripts shellcheck-clean; the three refactored scripts
preserve their action.yml calling contracts.
@TomHennen TomHennen temporarily deployed to integration-test May 21, 2026 01:10 — with GitHub Actions Inactive
TomHennen pushed a commit that referenced this pull request May 21, 2026
…-binary e2e

Restructures tools/osv/test.bats into three layers, mirroring the
behavioral-test pattern adopted in #236:

  - render_md.sh — direct invocation with inline SARIF heredocs per
    test. New coverage: CVSS bucket mapping (CRITICAL/HIGH/MEDIUM/
    LOW/UNKNOWN), per-ruleId dedupe, file:// stripping, fixed-version
    extraction from rule.help.markdown, em-dash on missing fixed
    versions, empty-SARIF "No known vulnerabilities" line, exit codes
    for missing file (1), invalid JSON (2), and missing arg (1).
  - adapter.sh — end-to-end via the PATH-shimmed osv-scanner. Keeps
    the existing exit-code contract tests; #197 regression now
    asserts count parity between unique ruleIds in SARIF and rendered
    rows.
  - install.sh — unchanged verification-chain tests, renamed
    TEST_DIR → TMP_DIR.

Drops the source-mutating "render failure does not fail the adapter"
test (it backed up render_md.sh, overwrote it with a stub, and
restored — a cleanup race could leave the repo dirty). The behaviour
it covered is now split across direct render_md.sh tests that assert
clean non-zero exits, plus an adapter test that confirms output.md
is non-empty on the happy path. The `|| placeholder` shell wiring
in adapter.sh is one line and doesn't need its own bats test.

Adds a real-binary e2e (`osv e2e: real osv-scanner produces consistent
SARIF + MD`) that drives the full pipeline against
testdata/vulnerable_go.mod (pinned to go 1.20 stdlib for stable
advisories). Skips gracefully when osv-scanner isn't on PATH or
osv.dev is unreachable (sandboxed CI), so it doesn't flake; in CI
with network access it asserts the #197 invariant against findings
the tool actually produces.

Net: 30 osv tests (was 17), all pass; shellcheck clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace grep-based structural tests with behavioral tests

2 participants