Releases: thejefflarson/soundcheck
v1.8.1
Patch release: self-review hardening and quarterly threat-review automation.
Security cleanup (self-review)
Ran /security-review against Soundcheck itself and applied the 13 fixes it flagged (commit 22fcd05). Highlights:
- Self-review poisoning —
.github/workflows/self-review.ymlnow loadssecurity-review/SKILL.mdfrom a trusted second checkout of the base ref, so a PR that edits its own reviewer prompt can't flip the gate. - Empty-findings integrity gate —
scripts/security-review-action.pyexits non-zero when a diff touches 3+ files but returns zero findings (prevents a silent-pass failure mode) and emits a signed audit record with the skill's sha256. - Budget + timeout clamps —
scripts/_claude_cli.pygained a 1 USD default budget, a 20 USD hard cap, a 30–1800s timeout range, and aSOUNDCHECK_DISABLE=1kill switch. - Dependabot allowlist — auto-merge now requires
semver-patchAND a dependency name on an explicit allowlist of first-party actions. - Release-cascade safety —
scripts/release.pyverifies the pushed commit exists on the remote before moving thev1floating tag, and consultsgit ls-remote --tagswhen picking the next action version. - Markdown-cell escaping — fixed a backslash/backtick ordering bug in the findings table renderer.
One finding (F14, API key rotation/canary) was skipped as operational, not code-level.
Quarterly threat review now does the checklist
The skill-smoke-tests.yml quarterly job used to open an empty checklist issue. It now drafts the issue body: scripts/quarterly-threat-review.py reads docs/threat-radar.md, pulls its last-modified date from git, and runs claude -p with WebSearch/WebFetch/Read to check OWASP LLM/API Top 10, NVD for AI/LLM CVEs, and watching-tier promotion candidates. Falls back to the prior checklist if the CLI call fails.
Action pin
thejefflarson/soundcheck-action v1.0.12 / v1 now pins soundcheck e63ae37.
v1.8.0
Highlights
All 45 skills de-specialized — dropped language-specific secure-pattern code blocks
in favor of prose principles with a compact pseudo-code anchor. Skills no longer bias
reviews toward a specific framework or struct shape; the security property, not the
implementation, is what the model applies.
New paired smoke test replaces the previous plugin-vs-bare A/B. Every fixture is now
reviewed twice (with skill loaded vs neutral reviewer), scored against the skill's
verification criteria per-criterion, and compared with a Wilcoxon signed-rank test. See
docs/smoke-test-methodology.md for the design rationale.
Measured effect (paired smoke, after excluding judge-parse failures):
| Model | Plugin full-pass | Bare full-pass | Gap | Wilcoxon p |
|---|---|---|---|---|
| Haiku | 77% (98/126) | 40% (51/126) | +37pts | < 1e-6 |
| Sonnet | 90% (117/130) | 58% (75/130) | +32pts | < 1e-4 |
External validation — SecurityEval (104 Python samples, external CWE-labeled dataset):
98% full-pass, 99% detection, 99% fix on plugin arm.
Targeted skill fixes based on cross-model regression analysis:
excessive-agency— prescriptive guidance to redesign tool interfaces, not add
denylistsssrf— flags proxy/webhook/URL-preview as highest-risk shapesprompt-injection— full call site showing output validation gatelogging-failures— CRLF stripping covers dedicated actor/subject parametersmodel-theft— log criterion accepts prompt fingerprint as a privacy tradeoffintegrity-failures,broken-access-control,sensitive-disclosure— criterion
wording tightened for conditional application
Harness fixes:
- Judge-parse failures now surface distinctly instead of masquerading as 0-score rows
- A/B bare-mode used
--bareflag which stripped session auth; replaced with empty
plugin-dir approach - Per-call latency tracking on smoke + SecurityEval JSONL rows
Removed scripts/test-auto-invocation*.py (superseded by paired smoke test).
v1.7.0
Eight new skills beyond OWASP, distinctness audit, expanded A/B auto-invocation harness (41 skills / 82 prompts), smoke + A/B retry/robustness. See commits v1.6.0..v1.7.0.
v1.6.0
What's new
Real-world benchmark
New scripts/benchmark-realworld.py validates Soundcheck skills against files from intentionally vulnerable open-source applications at pinned commits:
- OWASP Juice Shop (TypeScript/Node.js) — 10 files: injection, broken-access-control, cryptographic-failures, authentication-failures, integrity-failures
- OWASP PyGoat (Python/Django) — 3 files: injection, authentication-failures, broken-access-control
Result: 13/13 passing, 100% detection rate, 100% fix rate on first full clean run.
SecurityEval benchmark improvements
- 8 more CWEs mapped: 102 → 110 of 121 samples now covered (91%)
temperature=0on all judge calls for deterministic results
Security fixes (from self-review)
- CI expression injection:
workflow_dispatchinputs bound to env vars before shell use (CWE-78) - Prompt injection defense in
security-review-action.py: neutralize<soundcheck-*>tags in reviewed file content - Atomic file writes in
apply_rewriteswith reviewed-file allowlist check - Implicit
Nonereturn inapi_call_with_retryreplaced with explicitRuntimeError
Other
- Dependabot automerge workflow for patch-level dependency updates
retry-afterheader parsed as float to handle fractional values
v1.5.0
What's new
4 new skills
insecure-local-storage(A02:2025) — plaintext credential storage in files, NSUserDefaults, SharedPreferences, localStorageipc-security(A01:2025) — unvalidated URL scheme handlers, exported Android intents, unauthenticated IPC socketsmulti-agent-trust(LLM08:2025) — agent-to-agent auth, permission scoping, message validation in multi-agent pipelinestoken-smuggling(LLM01:2025) — Unicode RTL override, homoglyph, and zero-width character injection in LLM prompts
threat-model improvements
- Renamed from
threat-modelingfor consistency - Added STRIDE Repudiation checks (audit logs, tamper-evident logs)
- Added STRIDE DoS checks (compute cost caps, timeouts, circuit breakers)
SecurityEval benchmark
- New
scripts/benchmark-securityeval.py— tests skills against the SecurityEval dataset (121 Python samples, 69 CWEs); 102 samples mapped across 10 skills
Security review action
- New
scripts/security-review-action.py— powers the soundcheck-action GitHub Action
Bug fixes
- Fix CI expression injection: bind
workflow_dispatchinputs to env vars before shell use (CWE-78) - Fix implicit
Nonereturn inapi_call_with_retrywhen all retries are exhausted - Fix
Retry-Afterfloat parsing (int(float(...))) in smoke test and benchmark scripts