Skip to content

Bound build_context file reads and fail closed on oversized inputs#19

Open
wernerkasselman-au wants to merge 2 commits into
NVIDIA:mainfrom
wernerkasselman-au:fix/bound-build-context-reads
Open

Bound build_context file reads and fail closed on oversized inputs#19
wernerkasselman-au wants to merge 2 commits into
NVIDIA:mainfrom
wernerkasselman-au:fix/bound-build-context-reads

Conversation

@wernerkasselman-au

Copy link
Copy Markdown
Contributor

Hi Keshav,

This is the separate MR we agreed to on #6, holding the _MAX_READ_BYTES work that I pulled out so that one could stay scoped to the TP3 fix. It is stacked on #6 (the first commit shown here is #6's TP3 commit), so please merge #6 first, after which this diff reduces to the single bounded-reads commit.

What it does:

  • raises the per-file cap to 50 MiB and aligns it with static_runner.MAX_FILE_BYTES, so the limit lives as one number rather than two that can drift.
  • makes build_context fail closed: it now aborts in _validate_file_sizes before _read_file_cache ever runs, rather than silently caching an oversized file as empty. This is the heart of your original concern, a file large enough that we skip it should not come back reported as "safe".
  • counts lines in fixed 64 KiB binary chunks (so a multi-GB, newline-free file is never materialised whole).
  • reads only a bounded prefix of an oversized SKILL.md, since the frontmatter sits at the top of the file.
  • adds the same byte-counted guard to the direct analyzer entry points (static, AST, taint, YARA), so a caller that bypasses build_context does not reintroduce a silent skip.

Before raising this I had three other models review it independently against the actual code (not my description of it), and I want to be upfront about what they found, because it bears on exactly the discussion you wanted to have:

  1. the fail-closed guard does not yet cover the LLM semantic_* analyzers. llm_analyzer_base.get_batches has no size check, so a programmatic caller that injects an oversized file_cache still gets through on that path.
  2. the manifest prefix read is in text mode, so the cap is counting characters and not UTF-8 bytes (the comment claims bytes), which only matters for multibyte content reaching _parse_manifest directly.
  3. user-supplied YARA rule files (via --yara-rules-dir) are compiled with no size cap. They are not skill components, so they sidestep the gate.
  4. the larger one, which is really a surrounding issue rather than something this MR introduces: the whole ingest layer in input_handler.py runs before build_context ever sees a file, and it is unbounded. The URL download buffers the full body in memory, extractall has no uncompressed-size bound so a "zip bomb" fills the disk, and the git clone has no post-clone size check. So the 50 MiB per-file gate can be defeated upstream, before it runs.

My instinct, and I am happy to be argued out of it, is to keep this MR to the per-file build_context bounding (which is self-contained and tested at 605 passing), and to raise a separate issue for the ingest-layer hardening, since that is a bigger change with its own design questions (streaming downloads against a running byte counter, a total-uncompressed-size budget for zip extraction, a clone size cap). I can fold (1) to (3) into this MR if you would rather have them here, or split them out as well, whichever you prefer.

Please add comments inline or on the thread, whichever is easier for you.

…cans

The MCP tool-poisoning analyzer's parameter checks never ran on real
scans: _parse_manifest dropped the `parameters` frontmatter, so the
manifest reaching the analyzer never carried it. Parse and preserve
`parameters` (as dicts) so those checks fire. Verified end-to-end:
scanning tests/fixtures/mcp_poisoned_tool/ now surfaces TP3.

Minor fixes surfaced alongside:
- mcp_tool_poisoning TP4 use_llm gate defaults to True, matching every
  other LLM-using node (semantic_*, meta_analyzer).
- Remove the unused, incomplete LLM_BASED_NODE_IDS constant from state.py.
- README: drop the dead OSS_RELEASE.md link and the non-existent
  `skillspector patterns` command; fix the example version (v0.1.0 -> v2.0.0).

Add a regression test for parameter parsing that fails against the
pre-fix code.
build_context read whole files into memory with no size guard, so a
large or binary file in a scanned directory, zip, or cloned repo could
exhaust memory. Bound the read paths and fail the scan (rather than
silently skip) when a file exceeds the per-file analysis limit, so
oversized content cannot hide from analysis and be reported as safe:

- _MAX_READ_BYTES raised to 50 MiB and aligned with static_runner.
- build_context aborts via _validate_file_sizes before populating
  file_cache when any discovered file exceeds the limit.
- _count_lines counts newlines in fixed 64 KiB binary chunks, so a
  multi-GB or newline-free file is never materialized.
- _parse_manifest reads only a bounded prefix (frontmatter lives at the
  top of the file).
- Direct analyzer paths (static_runner, behavioral_ast,
  behavioral_taint_tracking, static_yara, llm_analyzer_base) raise as
  well, counting UTF-8 bytes (not Python chars), so bypassing
  build_context cannot reintroduce a silent skip.
- meta_analyzer: fix a partial-chunk edge where findings in a failed
  chunk could be dropped.

Adds regression tests for the fail-closed behavior and all bounded read
paths; each fails against the pre-fix code.
@wernerkasselman-au

Copy link
Copy Markdown
Contributor Author

I opened #21 for the ingest-layer hardening I mentioned above (the unbounded URL download, the extractall zip-bomb, the git clone with no size check), so this MR can stay scoped to the per-file build_context bounding. A multi-model review of the issue draft turned up two more paths worth folding into that work, the single-file shutil.copy2 in _wrap_single_file, and symlinks planted in a zip or clone (a symlink to /etc/passwd passes the size gate today and gets read), so #21 now covers those as well.

@wernerkasselman-au

Copy link
Copy Markdown
Contributor Author

While I was in this area I also opened #20, which is a different concern but a related one: detecting large blocks of whitespace used to hide prompt-injection instructions from a human reviewer (padding a file so the injected text sits well below, or off to the right of, what anyone sees on opening it). It is a detection pattern rather than ingest bounding, so it does not belong in this MR, but I am noting it here since it came out of the same read-through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant