Skip to content

Releases: thejefflarson/soundcheck

v1.8.1

21 Apr 08:11

Choose a tag to compare

Patch release: self-review hardening and quarterly threat-review automation.

Security cleanup (self-review)

Ran /security-review against Soundcheck itself and applied the 13 fixes it flagged (commit 22fcd05). Highlights:

  • Self-review poisoning.github/workflows/self-review.yml now loads security-review/SKILL.md from a trusted second checkout of the base ref, so a PR that edits its own reviewer prompt can't flip the gate.
  • Empty-findings integrity gatescripts/security-review-action.py exits non-zero when a diff touches 3+ files but returns zero findings (prevents a silent-pass failure mode) and emits a signed audit record with the skill's sha256.
  • Budget + timeout clampsscripts/_claude_cli.py gained a 1 USD default budget, a 20 USD hard cap, a 30–1800s timeout range, and a SOUNDCHECK_DISABLE=1 kill switch.
  • Dependabot allowlist — auto-merge now requires semver-patch AND a dependency name on an explicit allowlist of first-party actions.
  • Release-cascade safetyscripts/release.py verifies the pushed commit exists on the remote before moving the v1 floating tag, and consults git ls-remote --tags when picking the next action version.
  • Markdown-cell escaping — fixed a backslash/backtick ordering bug in the findings table renderer.

One finding (F14, API key rotation/canary) was skipped as operational, not code-level.

Quarterly threat review now does the checklist

The skill-smoke-tests.yml quarterly job used to open an empty checklist issue. It now drafts the issue body: scripts/quarterly-threat-review.py reads docs/threat-radar.md, pulls its last-modified date from git, and runs claude -p with WebSearch/WebFetch/Read to check OWASP LLM/API Top 10, NVD for AI/LLM CVEs, and watching-tier promotion candidates. Falls back to the prior checklist if the CLI call fails.

Action pin

thejefflarson/soundcheck-action v1.0.12 / v1 now pins soundcheck e63ae37.

v1.8.0

20 Apr 04:42

Choose a tag to compare

Highlights

All 45 skills de-specialized — dropped language-specific secure-pattern code blocks
in favor of prose principles with a compact pseudo-code anchor. Skills no longer bias
reviews toward a specific framework or struct shape; the security property, not the
implementation, is what the model applies.

New paired smoke test replaces the previous plugin-vs-bare A/B. Every fixture is now
reviewed twice (with skill loaded vs neutral reviewer), scored against the skill's
verification criteria per-criterion, and compared with a Wilcoxon signed-rank test. See
docs/smoke-test-methodology.md for the design rationale.

Measured effect (paired smoke, after excluding judge-parse failures):

Model Plugin full-pass Bare full-pass Gap Wilcoxon p
Haiku 77% (98/126) 40% (51/126) +37pts < 1e-6
Sonnet 90% (117/130) 58% (75/130) +32pts < 1e-4

External validation — SecurityEval (104 Python samples, external CWE-labeled dataset):
98% full-pass, 99% detection, 99% fix on plugin arm.

Targeted skill fixes based on cross-model regression analysis:

  • excessive-agency — prescriptive guidance to redesign tool interfaces, not add
    denylists
  • ssrf — flags proxy/webhook/URL-preview as highest-risk shapes
  • prompt-injection — full call site showing output validation gate
  • logging-failures — CRLF stripping covers dedicated actor/subject parameters
  • model-theft — log criterion accepts prompt fingerprint as a privacy tradeoff
  • integrity-failures, broken-access-control, sensitive-disclosure — criterion
    wording tightened for conditional application

Harness fixes:

  • Judge-parse failures now surface distinctly instead of masquerading as 0-score rows
  • A/B bare-mode used --bare flag which stripped session auth; replaced with empty
    plugin-dir approach
  • Per-call latency tracking on smoke + SecurityEval JSONL rows

Removed scripts/test-auto-invocation*.py (superseded by paired smoke test).

v1.7.0

17 Apr 05:51

Choose a tag to compare

Eight new skills beyond OWASP, distinctness audit, expanded A/B auto-invocation harness (41 skills / 82 prompts), smoke + A/B retry/robustness. See commits v1.6.0..v1.7.0.

v1.6.0

24 Mar 10:51

Choose a tag to compare

What's new

Real-world benchmark

New scripts/benchmark-realworld.py validates Soundcheck skills against files from intentionally vulnerable open-source applications at pinned commits:

  • OWASP Juice Shop (TypeScript/Node.js) — 10 files: injection, broken-access-control, cryptographic-failures, authentication-failures, integrity-failures
  • OWASP PyGoat (Python/Django) — 3 files: injection, authentication-failures, broken-access-control

Result: 13/13 passing, 100% detection rate, 100% fix rate on first full clean run.

SecurityEval benchmark improvements

  • 8 more CWEs mapped: 102 → 110 of 121 samples now covered (91%)
  • temperature=0 on all judge calls for deterministic results

Security fixes (from self-review)

  • CI expression injection: workflow_dispatch inputs bound to env vars before shell use (CWE-78)
  • Prompt injection defense in security-review-action.py: neutralize <soundcheck-*> tags in reviewed file content
  • Atomic file writes in apply_rewrites with reviewed-file allowlist check
  • Implicit None return in api_call_with_retry replaced with explicit RuntimeError

Other

  • Dependabot automerge workflow for patch-level dependency updates
  • retry-after header parsed as float to handle fractional values

v1.5.0

19 Mar 01:21

Choose a tag to compare

What's new

4 new skills

  • insecure-local-storage (A02:2025) — plaintext credential storage in files, NSUserDefaults, SharedPreferences, localStorage
  • ipc-security (A01:2025) — unvalidated URL scheme handlers, exported Android intents, unauthenticated IPC sockets
  • multi-agent-trust (LLM08:2025) — agent-to-agent auth, permission scoping, message validation in multi-agent pipelines
  • token-smuggling (LLM01:2025) — Unicode RTL override, homoglyph, and zero-width character injection in LLM prompts

threat-model improvements

  • Renamed from threat-modeling for consistency
  • Added STRIDE Repudiation checks (audit logs, tamper-evident logs)
  • Added STRIDE DoS checks (compute cost caps, timeouts, circuit breakers)

SecurityEval benchmark

  • New scripts/benchmark-securityeval.py — tests skills against the SecurityEval dataset (121 Python samples, 69 CWEs); 102 samples mapped across 10 skills

Security review action

  • New scripts/security-review-action.py — powers the soundcheck-action GitHub Action

Bug fixes

  • Fix CI expression injection: bind workflow_dispatch inputs to env vars before shell use (CWE-78)
  • Fix implicit None return in api_call_with_retry when all retries are exhausted
  • Fix Retry-After float parsing (int(float(...))) in smoke test and benchmark scripts