Bound build_context file reads and fail closed on oversized inputs by wernerkasselman-au · Pull Request #19 · NVIDIA/SkillSpector

wernerkasselman-au · 2026-06-10T02:52:32Z

Hi Keshav,

This is the separate MR we agreed to on #6, holding the _MAX_READ_BYTES work that I pulled out so that one could stay scoped to the TP3 fix. It is stacked on #6 (the first commit shown here is #6's TP3 commit), so please merge #6 first, after which this diff reduces to the single bounded-reads commit.

What it does:

raises the per-file cap to 50 MiB and aligns it with static_runner.MAX_FILE_BYTES, so the limit lives as one number rather than two that can drift.
makes build_context fail closed: it now aborts in _validate_file_sizes before _read_file_cache ever runs, rather than silently caching an oversized file as empty. This is the heart of your original concern, a file large enough that we skip it should not come back reported as "safe".
counts lines in fixed 64 KiB binary chunks (so a multi-GB, newline-free file is never materialised whole).
reads only a bounded prefix of an oversized SKILL.md, since the frontmatter sits at the top of the file.
adds the same byte-counted guard to the direct analyzer entry points (static, AST, taint, YARA), so a caller that bypasses build_context does not reintroduce a silent skip.

Before raising this I had three other models review it independently against the actual code (not my description of it), and I want to be upfront about what they found, because it bears on exactly the discussion you wanted to have:

the fail-closed guard does not yet cover the LLM semantic_* analyzers. llm_analyzer_base.get_batches has no size check, so a programmatic caller that injects an oversized file_cache still gets through on that path.
the manifest prefix read is in text mode, so the cap is counting characters and not UTF-8 bytes (the comment claims bytes), which only matters for multibyte content reaching _parse_manifest directly.
user-supplied YARA rule files (via --yara-rules-dir) are compiled with no size cap. They are not skill components, so they sidestep the gate.
the larger one, which is really a surrounding issue rather than something this MR introduces: the whole ingest layer in input_handler.py runs before build_context ever sees a file, and it is unbounded. The URL download buffers the full body in memory, extractall has no uncompressed-size bound so a "zip bomb" fills the disk, and the git clone has no post-clone size check. So the 50 MiB per-file gate can be defeated upstream, before it runs.

My instinct, and I am happy to be argued out of it, is to keep this MR to the per-file build_context bounding (which is self-contained and tested at 605 passing), and to raise a separate issue for the ingest-layer hardening, since that is a bigger change with its own design questions (streaming downloads against a running byte counter, a total-uncompressed-size budget for zip extraction, a clone size cap). I can fold (1) to (3) into this MR if you would rather have them here, or split them out as well, whichever you prefer.

Please add comments inline or on the thread, whichever is easier for you.

…cans The MCP tool-poisoning analyzer's parameter checks never ran on real scans: _parse_manifest dropped the `parameters` frontmatter, so the manifest reaching the analyzer never carried it. Parse and preserve `parameters` (as dicts) so those checks fire. Verified end-to-end: scanning tests/fixtures/mcp_poisoned_tool/ now surfaces TP3. Minor fixes surfaced alongside: - mcp_tool_poisoning TP4 use_llm gate defaults to True, matching every other LLM-using node (semantic_*, meta_analyzer). - Remove the unused, incomplete LLM_BASED_NODE_IDS constant from state.py. - README: drop the dead OSS_RELEASE.md link and the non-existent `skillspector patterns` command; fix the example version (v0.1.0 -> v2.0.0). Add a regression test for parameter parsing that fails against the pre-fix code.

build_context read whole files into memory with no size guard, so a large or binary file in a scanned directory, zip, or cloned repo could exhaust memory. Bound the read paths and fail the scan (rather than silently skip) when a file exceeds the per-file analysis limit, so oversized content cannot hide from analysis and be reported as safe: - _MAX_READ_BYTES raised to 50 MiB and aligned with static_runner. - build_context aborts via _validate_file_sizes before populating file_cache when any discovered file exceeds the limit. - _count_lines counts newlines in fixed 64 KiB binary chunks, so a multi-GB or newline-free file is never materialized. - _parse_manifest reads only a bounded prefix (frontmatter lives at the top of the file). - Direct analyzer paths (static_runner, behavioral_ast, behavioral_taint_tracking, static_yara, llm_analyzer_base) raise as well, counting UTF-8 bytes (not Python chars), so bypassing build_context cannot reintroduce a silent skip. - meta_analyzer: fix a partial-chunk edge where findings in a failed chunk could be dropped. Adds regression tests for the fail-closed behavior and all bounded read paths; each fails against the pre-fix code.

wernerkasselman-au · 2026-06-10T03:33:44Z

I opened #21 for the ingest-layer hardening I mentioned above (the unbounded URL download, the extractall zip-bomb, the git clone with no size check), so this MR can stay scoped to the per-file build_context bounding. A multi-model review of the issue draft turned up two more paths worth folding into that work, the single-file shutil.copy2 in _wrap_single_file, and symlinks planted in a zip or clone (a symlink to /etc/passwd passes the size gate today and gets read), so #21 now covers those as well.

wernerkasselman-au · 2026-06-10T03:35:44Z

While I was in this area I also opened #20, which is a different concern but a related one: detecting large blocks of whitespace used to hide prompt-injection instructions from a human reviewer (padding a file so the injected text sits well below, or off to the right of, what anyone sees on opening it). It is a detection pattern rather than ingest bounding, so it does not belong in this MR, but I am noting it here since it came out of the same read-through.

wernerkasselman-au added 2 commits June 10, 2026 12:44

wernerkasselman-au mentioned this pull request Jun 10, 2026

Bound the input_handler ingest layer (URL download, zip extraction, git clone) before the per-file gate #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound build_context file reads and fail closed on oversized inputs#19

Bound build_context file reads and fail closed on oversized inputs#19
wernerkasselman-au wants to merge 2 commits into
NVIDIA:mainfrom
wernerkasselman-au:fix/bound-build-context-reads

wernerkasselman-au commented Jun 10, 2026

Uh oh!

wernerkasselman-au commented Jun 10, 2026

Uh oh!

wernerkasselman-au commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wernerkasselman-au commented Jun 10, 2026

Uh oh!

wernerkasselman-au commented Jun 10, 2026

Uh oh!

wernerkasselman-au commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant