Replace grep-based structural tests with behavioral tests (#160)#236
Open
TomHennen wants to merge 2 commits into
Open
Replace grep-based structural tests with behavioral tests (#160)#236TomHennen wants to merge 2 commits into
TomHennen wants to merge 2 commits into
Conversation
The build-action and scorecard test.bats files relied heavily on greps that asserted a particular string appeared somewhere in the action.yml or helper script. Those tests confirmed the presence of text, not behavior: moving a keyword into a comment would keep them passing, and a regression that broke the actual logic might not trip them. They also broke on harmless refactors (PR #156's helper extraction forced unrelated test edits). This change follows the template established by lib/validate_path.sh and container/validate_inputs.sh: invoke each helper script against fixture inputs and assert on exit codes, GITHUB_OUTPUT contents, and stdout. Greps remain only as supply-chain guard rails (no curl|sh, SHA-pinned actions, action.yml flowing inputs through env:, action.yml actually delegating to the script the behavioral tests cover) and for workflow-level structural checks that have no behavioral counterpart. npm: + 27 behavioral tests covering validate_inputs.sh (package.json, lockfile, yarn rejection, workspaces rejection, npm+pnpm ambiguity), detect_tooling.sh (the four node-version-resolution branches, npm/pnpm cache split that encodes issue #205), and build_and_pack.sh (npm/pnpm pack flow, --ignore-scripts threading, 'no test specified' substring guard, tarball count check). PATH shims for npm/pnpm record invocations; jq stays real. Deleted ~15 source-grep tests these supersede. python: + 11 behavioral tests covering pyproject.toml requirement in validate_inputs.sh, the three test-discovery branches in run_tests.sh, the uv/pip split, and the [test]/[dev]/bare install fallback chain in install_deps.sh. Stub pytest/uv/python on PATH. Deleted 3 source-grep tests. scorecard: + 7 behavioral tests for sarif_to_markdown.sh (header+rows, empty results, HTML stripping, WRANGLE_MAX_SUMMARY truncation, missing-file and invalid-JSON exit codes, usage error). The truncation and HTML stripping in particular were silent-failure-class bugs that grep could not catch. Net: +28 behavioral tests, -18 source-string greps. All 152 tests in the touched files pass locally.
TomHennen
commented
May 21, 2026
Owner
Author
TomHennen
left a comment
There was a problem hiding this comment.
Verdict: approve-with-nits
Delivers what it claims — supply-chain greps preserved, fixtures exercise real branches, no actual coverage regression among the 18 deleted greps, CI green. A few should-fix items and one approach-level thought.
Should-fix
- The
[dev]middle branch ofinstall_deps.shfallback isn't directly tested.build/actions/python/test.bats:264-291covers[test]success and the bare-install endpoint, but not the middle ([test]fails →[dev]succeeds → log"Installed with [dev] extra"). A regression that swapped[test]and[dev]order, or dropped the[dev]attempt entirely, would still pass. Add a fixture where the shim fails call 2 and succeeds on call 3, asserting"Installed with [dev] extra". - No explicit pack-uses-same-PM-as-install assertion.
build/actions/npm/test.bats:421-451— the deleted grep asserted"$PM" packliterally. Coverage is preserved by indirection (a regression in the pnpm path would fail the pnpm test), but consider one composite assertion: an npm-path test that asserts! [[ "$output" == *"pnpm "* ]], and vice versa. - Shim relies on
TMP_DIRfrom caller env —build/actions/python/test.bats:262-291. Fragile if bats ever sandboxes the run. Use$BATS_TEST_TMPDIRor a dedicated env var. - Pre-existing:
run_tests.sh:30grep -qF '[tool.pytest'matches the literal substring inside any string in pyproject.toml. Worth one extra negative fixture in a follow-up.
Nits
build/actions/python/test.bats:268—case "${COUNTER_FILE:-/tmp/missing}" in /tmp/missing) ;; *) ;; esacis dead code. Remove.build/actions/npm/test.bats:432-450—install_pm_shimcomment claims"$TMP_DIR/calls.log"but the shims write to stdout. Stale.setup()recreates shims per-test (77 of them). Move read-only shims tosetup_file()for speed.
Approach-level
PATH-shim is the middle of a tradeoff — realistic enough to catch most regressions, fragile enough that real-tool behavior changes can silently break the assumptions. Worth picking deliberately rather than defaulting to "shim everything":
- Pure function — extract decision logic so it's just args in → string out. Tightest tests, no shim drift, no real-tool flakiness.
detect_tooling.sh's node-version resolution is a strong candidate. - Real tool — works for tools you can install cheaply in CI (
jqalready,pytestplausibly). - Shim — only when neither of the above fits.
Default to (1)/(2); fall to (3) only when forced.
Good
- Fixtures realistically drive decision branches (workspaces, scripts:null, both lockfiles, all four node-version-resolution branches, all four cache modes).
- Issue #205 "decision" asserted via
cache=empty + belt-and-braces negative grep — exactly the "assert the decision, not the implementation" framing. - Scorecard tests catch truncation/HTML-stripping silent-failure classes that no grep could.
- Supply-chain greps (
curl|sh,/usr/local/bin, SHA-pin, env-flow, action.yml→script delegation, stop_commands_guard wrapping) preserved verbatim. - Negative-path tests assert on error-message shape, not just exit code.
- CI all green.
Generated by Claude Code
Review feedback on the initial PR was: PATH-shimming everything is
the middle of a tradeoff. Where the decision logic is pure (args in
→ string out), extract it and test the function directly — tightest
tests, no shim drift, no real-tool flakiness. Reserve shims for the
orchestration that legitimately needs to drive external tools.
Refactor (the "approach-level" item, the main thing here):
detect_tooling.sh — split into two pure functions:
resolve_node_version(input_version, project_dir) — prints
"<version>|<file>|<reason>" for the four version-source
branches.
resolve_pm_cache(project_dir) — prints "<package-manager>|<cache>".
main() composes them and writes to GITHUB_OUTPUT.
build_and_pack.sh — split into four pure decision functions:
detect_pm(path) — "npm" or "pnpm" from the lockfile.
has_build_script(path) — null-safe scripts.build check.
has_real_test_script(path) — null-safe + "no test specified"
substring guard.
find_one_tarball(path) — globs dist/*.tgz, asserts count==1,
prints the bare filename.
main() orchestrates install/build/test/pack against real
npm/pnpm — that strand still legitimately needs a shim.
run_tests.sh — split out:
should_run_pytest(path) — exit 0 iff tests/, test/, or
[tool.pytest.*] header in pyproject.toml. The original
grep-qF substring match also fixed: anchored to start of
line so a description string containing "[tool.pytest"
doesn't false-positive.
Each script has a sourcing guard (`if [[ "${BASH_SOURCE[0]}" == \
"${0}" ]]; then main "$@"; fi`) so the functions can be sourced
into a test subshell without running main.
Test reorg follows the same three-tier shape:
Tier 1 — Pure-function tests: `run bash -c 'source script;
func "$@"' -- args`. No shims, no GITHUB_OUTPUT plumbing. Covers
every branch of every pure function. Includes the "scripts: null"
null-safety case and the new `should_run_pytest` start-of-line
guard against the literal-substring-in-description regression.
Tier 2 — validate_inputs.sh end-to-end (no external deps beyond jq).
Tier 3 — Integration: a small set of orchestration tests that
drive build_and_pack.sh / run_tests.sh / install_deps.sh through
PATH shims. The shims now live in setup_file() under
$BATS_FILE_TMPDIR (per the reviewer's perf nit), not per-test.
Should-fix items from the review:
1. Missing [dev] middle-branch test for install_deps.sh's
fallback cascade. Added: WRANGLE_TEST_PIP_FAILS=1 makes
[test] fail and asserts [dev] runs and succeeds. WRANGLE_TEST
_PIP_FAILS=1,2 covers the full cascade to bare install. The
python shim takes the fail list as a comma-separated env var.
2. No explicit pack-uses-same-PM-as-install assertion. Added
cross-PM negation in both end-to-end pipeline tests:
`! grep -qE '^pnpm '` on the npm path, `! grep -qE '^npm '`
on the pnpm path.
3. Python shim referenced caller's `$TMP_DIR`. Replaced with
$BATS_TEST_TMPDIR (per-test) and $BATS_FILE_TMPDIR (per-file
shim dir).
4. run_tests.sh's grep -qF '[tool.pytest' matched the substring
anywhere in pyproject.toml. Anchored to ^[tool\.pytest and
added a regression test (description containing the literal
"[tool.pytest configuration]" must NOT trigger discovery).
Nits from the review:
- Removed dead `case "${COUNTER_FILE:-/tmp/missing}"` code.
- Removed stale "writes to calls.log" comment; shims write to
stdout.
- Shim creation moved from per-test setup() to per-file
setup_file() under $BATS_FILE_TMPDIR.
Test counts (touched files): npm 82, python 66, scorecard 13.
All scripts shellcheck-clean; the three refactored scripts
preserve their action.yml calling contracts.
TomHennen
pushed a commit
that referenced
this pull request
May 21, 2026
…-binary e2e Restructures tools/osv/test.bats into three layers, mirroring the behavioral-test pattern adopted in #236: - render_md.sh — direct invocation with inline SARIF heredocs per test. New coverage: CVSS bucket mapping (CRITICAL/HIGH/MEDIUM/ LOW/UNKNOWN), per-ruleId dedupe, file:// stripping, fixed-version extraction from rule.help.markdown, em-dash on missing fixed versions, empty-SARIF "No known vulnerabilities" line, exit codes for missing file (1), invalid JSON (2), and missing arg (1). - adapter.sh — end-to-end via the PATH-shimmed osv-scanner. Keeps the existing exit-code contract tests; #197 regression now asserts count parity between unique ruleIds in SARIF and rendered rows. - install.sh — unchanged verification-chain tests, renamed TEST_DIR → TMP_DIR. Drops the source-mutating "render failure does not fail the adapter" test (it backed up render_md.sh, overwrote it with a stub, and restored — a cleanup race could leave the repo dirty). The behaviour it covered is now split across direct render_md.sh tests that assert clean non-zero exits, plus an adapter test that confirms output.md is non-empty on the happy path. The `|| placeholder` shell wiring in adapter.sh is one line and doesn't need its own bats test. Adds a real-binary e2e (`osv e2e: real osv-scanner produces consistent SARIF + MD`) that drives the full pipeline against testdata/vulnerable_go.mod (pinned to go 1.20 stdlib for stable advisories). Skips gracefully when osv-scanner isn't on PATH or osv.dev is unreachable (sandboxed CI), so it doesn't flake; in CI with network access it asserts the #197 invariant against findings the tool actually produces. Net: 30 osv tests (was 17), all pass; shellcheck clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #160.
Summary
build/actions/{npm,python}/test.batsandtools/scorecard/test.batsrelied heavily on greps that asserted a particular string appeared somewhere in the action.yml or helper script. Those tests confirmed the presence of text, not behavior: moving a keyword into a comment would keep them passing, and a regression that broke the actual logic might not trip them. They also broke on harmless refactors (PR #156's helper extraction forced unrelated test edits with no behavioral change).This PR follows the template established by
lib/validate_path.shandcontainer/validate_inputs.sh: invoke each helper script against fixture inputs and assert on exit codes,GITHUB_OUTPUTcontents, and stdout/stderr. Per the issue, greps remain only as supply-chain guard rails (nocurl | sh, SHA-pinned actions,${{ inputs.* }}flowing throughenv:, action.yml actually delegating to the scripts the behavioral tests now cover) and for workflow-level structural checks that have no behavioral counterpart.What changed
validate_inputs.sh(package.json check, lockfile detection, yarn rejection, workspaces rejection, npm+pnpm ambiguity),detect_tooling.sh(the four node-version-resolution branches, npm/pnpm cache split that encodes the issue When pnpm/Yarn support lands, do NOT enable pnpm-store / yarn cache (Mini Shai-Hulud lesson) #205 decision), andbuild_and_pack.sh(npm/pnpm flow,--ignore-scriptsthreading through install AND pack,'no test specified'substring guard, tarball-count check,scripts: nullsafety). PATH shims fornpm/pnpmrecord invocations;jqstays real. Deleted ~15 source-string greps the behavioral tests supersede.validate_inputs.sh, the three test-discovery branches inrun_tests.sh(tests/,test/,[tool.pytest]), the uv/pip split, and the[test]→[dev]→ bare install fallback chain ininstall_deps.sh. Stubpytest/uv/pythonon PATH record invocations. Deleted 3 source-string greps.sarif_to_markdown.sh(header+rows, empty results, HTML stripping,WRANGLE_MAX_SUMMARYtruncation, missing-file and invalid-JSON exit codes, usage error). The truncation and HTML-stripping logic in particular were silent-failure-class bugs that no grep could catch.Net: +28 behavioral tests, -18 source-string greps. Total test count in the touched files: npm 77, python 62, scorecard 13.
What was deliberately NOT changed
build/actions/container/test.bats— already has solid behavioral tests forvalidate_inputs.shandresolve_cache.sh(added in claude: SLSA L3 audit defense-in-depth — ::stop-commands:: guard + adopter-tunable PR cache knobs #225). Remaining greps are workflow/action.yml structural checks the issue explicitly calls out as legitimate guard rails.build/actions/shell/test.bats— the action wraps shellcheck/bats; no in-action script logic to test behaviorally.tools/{dependency-review,osv,syft}/test.bats— already have comprehensive behavioral tests with mocks.tools/{zizmor,scorecard}/action.ymlchecks, supply-chain greps (curl | sh, SHA-pin assertions,set -fpresence), andaction.yml → scriptdelegation checks — these are the guard rails the issue says to keep.Test plan
bats build/actions/npm/test.bats— 77/77 passbats build/actions/python/test.bats— 62/62 passbats tools/scorecard/test.bats— 13/13 passbump_action_pins.batsfailures are an unrelated test-env git config issue, present onmainas well)Generated by Claude Code