Detect whitespace padding used to hide prompt-injection instructions (P9)#24
Open
korjavin wants to merge 12 commits into
Open
Detect whitespace padding used to hide prompt-injection instructions (P9)#24korjavin wants to merge 12 commits into
korjavin wants to merge 12 commits into
Conversation
Plan for issue NVIDIA#20 — detect large whitespace padding used to hide prompt-injection instructions from review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds rule P9 "Whitespace Padding" under Prompt Injection, for issue #20. It detects padding that pushes injected instructions out of a reviewer's view while the agent still reads them.
P6 through P8 were taken by System Prompt Leakage, so this uses P9. One id covers all three signals; confidence carries the weighting.
Signals (all reported as P9):
Whitespace is classified by Unicode category rather than ASCII space/tab: controls (
\t \n \r \v \f), categories Zs/Zl/Zp (U+00A0, U+2028, U+2029, U+3000, and so on), and the zero-width family (U+200B/C/D, U+2060, U+FEFF). That zero-width set is now one shared constant (ZERO_WIDTH_CHARS) used by P2's regex and themcp_tool_poisoningzero-width check, so the two cannot drift; the MCP check also picks up U+2060/U+FEFF.Each finding points at the line where padding starts and includes a visible snippet of what was hidden (for example
U+00A0 x82or\n x80).False-positive guards: markdown fenced code is skipped for the horizontal signal; vendored files are skipped (
*.min.js,*.min.css,*.lock,package-lock.json,yarn.lock,*.svg,*.map); binary-ish content (containing U+FFFD) bails out; the ratio signal stays at LOW. Eval-dataset prose and files over 1 MB are already skipped upstream.MCP manifest description fields are covered by the same detector (horizontal and block signals; the per-file ratio signal is skipped since fields are short).
Tests cover all three signals at their thresholds, the full Unicode evasion set, the false-positive guards, and the MCP path. Thresholds are named constants, easy to tune against a real corpus. Happy to adjust the signals or thresholds before merge.