Bound build_context file reads and fail closed on oversized inputs#19
Bound build_context file reads and fail closed on oversized inputs#19wernerkasselman-au wants to merge 2 commits into
Conversation
…cans The MCP tool-poisoning analyzer's parameter checks never ran on real scans: _parse_manifest dropped the `parameters` frontmatter, so the manifest reaching the analyzer never carried it. Parse and preserve `parameters` (as dicts) so those checks fire. Verified end-to-end: scanning tests/fixtures/mcp_poisoned_tool/ now surfaces TP3. Minor fixes surfaced alongside: - mcp_tool_poisoning TP4 use_llm gate defaults to True, matching every other LLM-using node (semantic_*, meta_analyzer). - Remove the unused, incomplete LLM_BASED_NODE_IDS constant from state.py. - README: drop the dead OSS_RELEASE.md link and the non-existent `skillspector patterns` command; fix the example version (v0.1.0 -> v2.0.0). Add a regression test for parameter parsing that fails against the pre-fix code.
build_context read whole files into memory with no size guard, so a large or binary file in a scanned directory, zip, or cloned repo could exhaust memory. Bound the read paths and fail the scan (rather than silently skip) when a file exceeds the per-file analysis limit, so oversized content cannot hide from analysis and be reported as safe: - _MAX_READ_BYTES raised to 50 MiB and aligned with static_runner. - build_context aborts via _validate_file_sizes before populating file_cache when any discovered file exceeds the limit. - _count_lines counts newlines in fixed 64 KiB binary chunks, so a multi-GB or newline-free file is never materialized. - _parse_manifest reads only a bounded prefix (frontmatter lives at the top of the file). - Direct analyzer paths (static_runner, behavioral_ast, behavioral_taint_tracking, static_yara, llm_analyzer_base) raise as well, counting UTF-8 bytes (not Python chars), so bypassing build_context cannot reintroduce a silent skip. - meta_analyzer: fix a partial-chunk edge where findings in a failed chunk could be dropped. Adds regression tests for the fail-closed behavior and all bounded read paths; each fails against the pre-fix code.
|
I opened #21 for the ingest-layer hardening I mentioned above (the unbounded URL download, the |
|
While I was in this area I also opened #20, which is a different concern but a related one: detecting large blocks of whitespace used to hide prompt-injection instructions from a human reviewer (padding a file so the injected text sits well below, or off to the right of, what anyone sees on opening it). It is a detection pattern rather than ingest bounding, so it does not belong in this MR, but I am noting it here since it came out of the same read-through. |
Hi Keshav,
This is the separate MR we agreed to on #6, holding the
_MAX_READ_BYTESwork that I pulled out so that one could stay scoped to the TP3 fix. It is stacked on #6 (the first commit shown here is #6's TP3 commit), so please merge #6 first, after which this diff reduces to the single bounded-reads commit.What it does:
static_runner.MAX_FILE_BYTES, so the limit lives as one number rather than two that can drift.build_contextfail closed: it now aborts in_validate_file_sizesbefore_read_file_cacheever runs, rather than silently caching an oversized file as empty. This is the heart of your original concern, a file large enough that we skip it should not come back reported as "safe".build_contextdoes not reintroduce a silent skip.Before raising this I had three other models review it independently against the actual code (not my description of it), and I want to be upfront about what they found, because it bears on exactly the discussion you wanted to have:
semantic_*analyzers.llm_analyzer_base.get_batcheshas no size check, so a programmatic caller that injects an oversizedfile_cachestill gets through on that path._parse_manifestdirectly.--yara-rules-dir) are compiled with no size cap. They are not skill components, so they sidestep the gate.input_handler.pyruns beforebuild_contextever sees a file, and it is unbounded. The URL download buffers the full body in memory,extractallhas no uncompressed-size bound so a "zip bomb" fills the disk, and the git clone has no post-clone size check. So the 50 MiB per-file gate can be defeated upstream, before it runs.My instinct, and I am happy to be argued out of it, is to keep this MR to the per-file
build_contextbounding (which is self-contained and tested at 605 passing), and to raise a separate issue for the ingest-layer hardening, since that is a bigger change with its own design questions (streaming downloads against a running byte counter, a total-uncompressed-size budget for zip extraction, a clone size cap). I can fold (1) to (3) into this MR if you would rather have them here, or split them out as well, whichever you prefer.Please add comments inline or on the thread, whichever is easier for you.