Releases · thejefflarson/soundcheck

21 Apr 08:11

thejefflarson

v1.8.1

e63ae37

v1.8.1 Latest

Latest

Patch release: self-review hardening and quarterly threat-review automation.

Security cleanup (self-review)

Ran /security-review against Soundcheck itself and applied the 13 fixes it flagged (commit 22fcd05). Highlights:

Self-review poisoning — .github/workflows/self-review.yml now loads security-review/SKILL.md from a trusted second checkout of the base ref, so a PR that edits its own reviewer prompt can't flip the gate.
Empty-findings integrity gate — scripts/security-review-action.py exits non-zero when a diff touches 3+ files but returns zero findings (prevents a silent-pass failure mode) and emits a signed audit record with the skill's sha256.
Budget + timeout clamps — scripts/_claude_cli.py gained a 1 USD default budget, a 20 USD hard cap, a 30–1800s timeout range, and a SOUNDCHECK_DISABLE=1 kill switch.
Dependabot allowlist — auto-merge now requires semver-patch AND a dependency name on an explicit allowlist of first-party actions.
Release-cascade safety — scripts/release.py verifies the pushed commit exists on the remote before moving the v1 floating tag, and consults git ls-remote --tags when picking the next action version.
Markdown-cell escaping — fixed a backslash/backtick ordering bug in the findings table renderer.

One finding (F14, API key rotation/canary) was skipped as operational, not code-level.

Quarterly threat review now does the checklist

The skill-smoke-tests.yml quarterly job used to open an empty checklist issue. It now drafts the issue body: scripts/quarterly-threat-review.py reads docs/threat-radar.md, pulls its last-modified date from git, and runs claude -p with WebSearch/WebFetch/Read to check OWASP LLM/API Top 10, NVD for AI/LLM CVEs, and watching-tier promotion candidates. Falls back to the prior checklist if the CLI call fails.

Action pin

thejefflarson/soundcheck-action v1.0.12 / v1 now pins soundcheck e63ae37.

Assets 2

20 Apr 04:42

thejefflarson

v1.8.0

89ccdb5

v1.8.0

Highlights

All 45 skills de-specialized — dropped language-specific secure-pattern code blocks
in favor of prose principles with a compact pseudo-code anchor. Skills no longer bias
reviews toward a specific framework or struct shape; the security property, not the
implementation, is what the model applies.

New paired smoke test replaces the previous plugin-vs-bare A/B. Every fixture is now
reviewed twice (with skill loaded vs neutral reviewer), scored against the skill's
verification criteria per-criterion, and compared with a Wilcoxon signed-rank test. See
docs/smoke-test-methodology.md for the design rationale.

Measured effect (paired smoke, after excluding judge-parse failures):

Model	Plugin full-pass	Bare full-pass	Gap	Wilcoxon p
Haiku	77% (98/126)	40% (51/126)	+37pts	< 1e-6
Sonnet	90% (117/130)	58% (75/130)	+32pts	< 1e-4

External validation — SecurityEval (104 Python samples, external CWE-labeled dataset):
98% full-pass, 99% detection, 99% fix on plugin arm.

Targeted skill fixes based on cross-model regression analysis:

excessive-agency — prescriptive guidance to redesign tool interfaces, not add
denylists
ssrf — flags proxy/webhook/URL-preview as highest-risk shapes
prompt-injection — full call site showing output validation gate
logging-failures — CRLF stripping covers dedicated actor/subject parameters
model-theft — log criterion accepts prompt fingerprint as a privacy tradeoff
integrity-failures, broken-access-control, sensitive-disclosure — criterion
wording tightened for conditional application

Harness fixes:

Judge-parse failures now surface distinctly instead of masquerading as 0-score rows
A/B bare-mode used --bare flag which stripped session auth; replaced with empty
plugin-dir approach
Per-call latency tracking on smoke + SecurityEval JSONL rows

Removed scripts/test-auto-invocation*.py (superseded by paired smoke test).

Assets 2

17 Apr 05:51

thejefflarson

v1.7.0

b77dec2

v1.7.0

Eight new skills beyond OWASP, distinctness audit, expanded A/B auto-invocation harness (41 skills / 82 prompts), smoke + A/B retry/robustness. See commits v1.6.0..v1.7.0.

Assets 2

24 Mar 10:51

thejefflarson

v1.6.0

0b1b0a2

v1.6.0

What's new

Real-world benchmark

New scripts/benchmark-realworld.py validates Soundcheck skills against files from intentionally vulnerable open-source applications at pinned commits:

OWASP Juice Shop (TypeScript/Node.js) — 10 files: injection, broken-access-control, cryptographic-failures, authentication-failures, integrity-failures
OWASP PyGoat (Python/Django) — 3 files: injection, authentication-failures, broken-access-control

Result: 13/13 passing, 100% detection rate, 100% fix rate on first full clean run.

SecurityEval benchmark improvements

8 more CWEs mapped: 102 → 110 of 121 samples now covered (91%)
temperature=0 on all judge calls for deterministic results

Security fixes (from self-review)

CI expression injection: workflow_dispatch inputs bound to env vars before shell use (CWE-78)
Prompt injection defense in security-review-action.py: neutralize <soundcheck-*> tags in reviewed file content
Atomic file writes in apply_rewrites with reviewed-file allowlist check
Implicit None return in api_call_with_retry replaced with explicit RuntimeError

Other

Dependabot automerge workflow for patch-level dependency updates
retry-after header parsed as float to handle fractional values

Assets 2

19 Mar 01:21

thejefflarson

v1.5.0

6a37673

v1.5.0

What's new

4 new skills

insecure-local-storage (A02:2025) — plaintext credential storage in files, NSUserDefaults, SharedPreferences, localStorage
ipc-security (A01:2025) — unvalidated URL scheme handlers, exported Android intents, unauthenticated IPC sockets
multi-agent-trust (LLM08:2025) — agent-to-agent auth, permission scoping, message validation in multi-agent pipelines
token-smuggling (LLM01:2025) — Unicode RTL override, homoglyph, and zero-width character injection in LLM prompts

threat-model improvements

Renamed from threat-modeling for consistency
Added STRIDE Repudiation checks (audit logs, tamper-evident logs)
Added STRIDE DoS checks (compute cost caps, timeouts, circuit breakers)

SecurityEval benchmark

New scripts/benchmark-securityeval.py — tests skills against the SecurityEval dataset (121 Python samples, 69 CWEs); 102 samples mapped across 10 skills

Security review action

New scripts/security-review-action.py — powers the soundcheck-action GitHub Action

Bug fixes

Fix CI expression injection: bind workflow_dispatch inputs to env vars before shell use (CWE-78)
Fix implicit None return in api_call_with_retry when all retries are exhausted
Fix Retry-After float parsing (int(float(...))) in smoke test and benchmark scripts

Assets 2

Releases: thejefflarson/soundcheck

v1.8.1

Security cleanup (self-review)

Quarterly threat review now does the checklist

Action pin

Uh oh!

v1.8.0

Highlights

Uh oh!

v1.7.0

Uh oh!

v1.6.0

What's new

Real-world benchmark

SecurityEval benchmark improvements

Security fixes (from self-review)

Other

Uh oh!

v1.5.0

What's new

4 new skills

threat-model improvements

SecurityEval benchmark

Security review action

Bug fixes

Uh oh!