This is the follow-up I mentioned on #19. That MR makes build_context fail closed on oversized files, but the fail-closed gate (_validate_file_sizes) only runs once we are walking files inside the resolved skill directory. Everything that gets us to that directory, the ingest layer in input_handler.py, runs first and is unbounded, so the 50 MiB per-file limit can be defeated upstream before it ever runs.
Three concrete paths, all confirmed by an independent multi-model review of #19:
- URL download buffers the whole body in memory.
input_handler.py:156-170 (_download_file) does content = response.content, which materialises the entire response before anything checks its size. A multi-GB URL is a memory-exhaustion DoS, and it happens before resolve_input hands off to build_context.
- Zip extraction has no uncompressed-size bound.
input_handler.py:181-182 (_extract_zip) calls zf.extractall(extract_dir) with no per-entry or total-uncompressed budget. A "zip bomb" expands to disk during extraction, again before the gate. (Path traversal is not the concern here, zipfile sanitises member names, the resource exhaustion is.)
- Git clone has no post-clone size check.
input_handler.py:130-135 clones with --depth 1 only, so a large repository lands on disk with nothing capping it before build_context walks it.
- Single local file is copied unbounded. I missed this one in my first pass, a review of the draft caught it:
_wrap_single_file at input_handler.py:191-197 does shutil.copy2(file_path, dest) for a local file input (a bare SKILL.md, say), with no size check, so a multi-GB local file is duplicated to the temp directory before the gate ever runs. The gate then rejects it, but the disk is already spent.
Why it matters
SkillSpector's job is to decide whether a skill is safe to install, and a scanner that can be knocked over (or made to fill a disk) by the input it is asked to scan is itself a soft target. This is the same underlying concern @keshprad raised on #6: oversized input should fail the scan, not get a chance to exhaust the host first.
Proposed hardening
I would reuse the same 50 MiB cap that build_context._MAX_READ_BYTES and static_runner.MAX_FILE_BYTES already share, applied at ingest:
- Download: stream the body (
httpx stream() / iter_bytes()) into the temp file with a running byte counter, and abort the moment it crosses the cap. Honour Content-Length as an early reject when present, but do not trust it as the only check. For a zip delivered over a URL, stream straight to the temp file rather than buffering the body and then writing it, so we do not hold the whole archive in memory before extraction even begins.
- Zip: the bound has to be a running byte counter during decompression, not a declared-size pre-sum.
ZipInfo.file_size is attacker-controlled metadata (I confirmed a ~10 KiB archive can declare 10 MiB, and the reverse, a small declared size hiding a bomb, is the dangerous direction), so it is only good for an early reject. Extract entry by entry, reading each member through a bounded loop, and abort on the first of: per-entry cap, cumulative-uncompressed cap, entry-count cap, or a compression-ratio limit (reject members that expand past, say, 100x their compressed size). Reject symlink entries outright (see below).
- Clone: after
--depth 1, check the clone's total on-disk tree size against the cap, not just per-file, since many sub-cap files still add up, and fail closed over it.
- Single file: stat the source before
shutil.copy2 and reject over the cap (or stream-copy under a byte counter), so _wrap_single_file does not duplicate an oversized file.
- Symlinks (cross-cutting). A review of the draft showed this is a real hole, not a hypothetical:
build_context discovers files with item.is_file() and reads them with read_text(), both of which follow symlinks, so a symlink planted by a crafted zip or a cloned repo can point build_context at, for example, /etc/passwd, and it is read and scanned (the size gate passes, since the link target is small). We should reject symlinks (lstat() / is_symlink()) during zip extraction and during the build_context walk, certainly for cloned and extracted input.
Test gaps
We have no coverage for any of these. Existing fail-closed tests cover the local-directory path and the programmatic file_cache bypass for the static analyzers only (tests/nodes/test_build_context.py, tests/unit/test_cli.py). This issue should add tests for an oversized URL body, a zip bomb (both a high declared size and a low-declared-size-but-high-actual bomb), an oversized clone, an oversized single-file input, and a symlink planted in a zip or clone, each driven through resolve_input so they exercise the real ingest path.
Scope note
I am keeping this separate from #19 deliberately, since it is a larger change with its own design questions (streaming vs buffered download, the total-uncompressed budget for zips, how to bound a clone cheaply) and I did not want to widen #19 past the self-contained build_context bounding it already has.
This is the follow-up I mentioned on #19. That MR makes
build_contextfail closed on oversized files, but the fail-closed gate (_validate_file_sizes) only runs once we are walking files inside the resolved skill directory. Everything that gets us to that directory, the ingest layer ininput_handler.py, runs first and is unbounded, so the 50 MiB per-file limit can be defeated upstream before it ever runs.Three concrete paths, all confirmed by an independent multi-model review of #19:
input_handler.py:156-170(_download_file) doescontent = response.content, which materialises the entire response before anything checks its size. A multi-GB URL is a memory-exhaustion DoS, and it happens beforeresolve_inputhands off tobuild_context.input_handler.py:181-182(_extract_zip) callszf.extractall(extract_dir)with no per-entry or total-uncompressed budget. A "zip bomb" expands to disk during extraction, again before the gate. (Path traversal is not the concern here,zipfilesanitises member names, the resource exhaustion is.)input_handler.py:130-135clones with--depth 1only, so a large repository lands on disk with nothing capping it beforebuild_contextwalks it._wrap_single_fileatinput_handler.py:191-197doesshutil.copy2(file_path, dest)for a local file input (a bareSKILL.md, say), with no size check, so a multi-GB local file is duplicated to the temp directory before the gate ever runs. The gate then rejects it, but the disk is already spent.Why it matters
SkillSpector's job is to decide whether a skill is safe to install, and a scanner that can be knocked over (or made to fill a disk) by the input it is asked to scan is itself a soft target. This is the same underlying concern @keshprad raised on #6: oversized input should fail the scan, not get a chance to exhaust the host first.
Proposed hardening
I would reuse the same 50 MiB cap that
build_context._MAX_READ_BYTESandstatic_runner.MAX_FILE_BYTESalready share, applied at ingest:httpxstream()/iter_bytes()) into the temp file with a running byte counter, and abort the moment it crosses the cap. HonourContent-Lengthas an early reject when present, but do not trust it as the only check. For a zip delivered over a URL, stream straight to the temp file rather than buffering the body and then writing it, so we do not hold the whole archive in memory before extraction even begins.ZipInfo.file_sizeis attacker-controlled metadata (I confirmed a ~10 KiB archive can declare 10 MiB, and the reverse, a small declared size hiding a bomb, is the dangerous direction), so it is only good for an early reject. Extract entry by entry, reading each member through a bounded loop, and abort on the first of: per-entry cap, cumulative-uncompressed cap, entry-count cap, or a compression-ratio limit (reject members that expand past, say, 100x their compressed size). Reject symlink entries outright (see below).--depth 1, check the clone's total on-disk tree size against the cap, not just per-file, since many sub-cap files still add up, and fail closed over it.shutil.copy2and reject over the cap (or stream-copy under a byte counter), so_wrap_single_filedoes not duplicate an oversized file.build_contextdiscovers files withitem.is_file()and reads them withread_text(), both of which follow symlinks, so a symlink planted by a crafted zip or a cloned repo can pointbuild_contextat, for example,/etc/passwd, and it is read and scanned (the size gate passes, since the link target is small). We should reject symlinks (lstat()/is_symlink()) during zip extraction and during thebuild_contextwalk, certainly for cloned and extracted input.Test gaps
We have no coverage for any of these. Existing fail-closed tests cover the local-directory path and the programmatic
file_cachebypass for the static analyzers only (tests/nodes/test_build_context.py,tests/unit/test_cli.py). This issue should add tests for an oversized URL body, a zip bomb (both a high declared size and a low-declared-size-but-high-actual bomb), an oversized clone, an oversized single-file input, and a symlink planted in a zip or clone, each driven throughresolve_inputso they exercise the real ingest path.Scope note
I am keeping this separate from #19 deliberately, since it is a larger change with its own design questions (streaming vs buffered download, the total-uncompressed budget for zips, how to bound a clone cheaply) and I did not want to widen #19 past the self-contained
build_contextbounding it already has.