Bound the input_handler ingest layer (URL download, zip extraction, git clone) before the per-file gate

This is the follow-up I mentioned on #19. That MR makes `build_context` fail closed on oversized files, but the fail-closed gate (`_validate_file_sizes`) only runs once we are walking files inside the resolved skill directory. Everything that gets us to that directory, the ingest layer in `input_handler.py`, runs first and is unbounded, so the 50 MiB per-file limit can be defeated upstream before it ever runs.

Three concrete paths, all confirmed by an independent multi-model review of #19:

1. **URL download buffers the whole body in memory.** `input_handler.py:156-170` (`_download_file`) does `content = response.content`, which materialises the entire response before anything checks its size. A multi-GB URL is a memory-exhaustion DoS, and it happens before `resolve_input` hands off to `build_context`.
2. **Zip extraction has no uncompressed-size bound.** `input_handler.py:181-182` (`_extract_zip`) calls `zf.extractall(extract_dir)` with no per-entry or total-uncompressed budget. A "zip bomb" expands to disk during extraction, again before the gate. (Path traversal is not the concern here, `zipfile` sanitises member names, the resource exhaustion is.)
3. **Git clone has no post-clone size check.** `input_handler.py:130-135` clones with `--depth 1` only, so a large repository lands on disk with nothing capping it before `build_context` walks it.
4. **Single local file is copied unbounded.** I missed this one in my first pass, a review of the draft caught it: `_wrap_single_file` at `input_handler.py:191-197` does `shutil.copy2(file_path, dest)` for a local file input (a bare `SKILL.md`, say), with no size check, so a multi-GB local file is duplicated to the temp directory before the gate ever runs. The gate then rejects it, but the disk is already spent.

### Why it matters

SkillSpector's job is to decide whether a skill is safe to install, and a scanner that can be knocked over (or made to fill a disk) by the input it is asked to scan is itself a soft target. This is the same underlying concern @keshprad raised on #6: oversized input should fail the scan, not get a chance to exhaust the host first.

### Proposed hardening

I would reuse the same 50 MiB cap that `build_context._MAX_READ_BYTES` and `static_runner.MAX_FILE_BYTES` already share, applied at ingest:
- **Download:** stream the body (`httpx` `stream()` / `iter_bytes()`) into the temp file with a running byte counter, and abort the moment it crosses the cap. Honour `Content-Length` as an early reject when present, but do not trust it as the only check. For a zip delivered over a URL, stream straight to the temp file rather than buffering the body and then writing it, so we do not hold the whole archive in memory before extraction even begins.
- **Zip:** the bound has to be a running byte counter during decompression, not a declared-size pre-sum. `ZipInfo.file_size` is attacker-controlled metadata (I confirmed a ~10 KiB archive can declare 10 MiB, and the reverse, a small declared size hiding a bomb, is the dangerous direction), so it is only good for an early reject. Extract entry by entry, reading each member through a bounded loop, and abort on the first of: per-entry cap, cumulative-uncompressed cap, entry-count cap, or a compression-ratio limit (reject members that expand past, say, 100x their compressed size). Reject symlink entries outright (see below).
- **Clone:** after `--depth 1`, check the clone's total on-disk tree size against the cap, not just per-file, since many sub-cap files still add up, and fail closed over it.
- **Single file:** stat the source before `shutil.copy2` and reject over the cap (or stream-copy under a byte counter), so `_wrap_single_file` does not duplicate an oversized file.
- **Symlinks (cross-cutting).** A review of the draft showed this is a real hole, not a hypothetical: `build_context` discovers files with `item.is_file()` and reads them with `read_text()`, both of which follow symlinks, so a symlink planted by a crafted zip or a cloned repo can point `build_context` at, for example, `/etc/passwd`, and it is read and scanned (the size gate passes, since the link target is small). We should reject symlinks (`lstat()` / `is_symlink()`) during zip extraction and during the `build_context` walk, certainly for cloned and extracted input.

### Test gaps

We have no coverage for any of these. Existing fail-closed tests cover the local-directory path and the programmatic `file_cache` bypass for the static analyzers only (`tests/nodes/test_build_context.py`, `tests/unit/test_cli.py`). This issue should add tests for an oversized URL body, a zip bomb (both a high declared size and a low-declared-size-but-high-actual bomb), an oversized clone, an oversized single-file input, and a symlink planted in a zip or clone, each driven through `resolve_input` so they exercise the real ingest path.

### Scope note

I am keeping this separate from #19 deliberately, since it is a larger change with its own design questions (streaming vs buffered download, the total-uncompressed budget for zips, how to bound a clone cheaply) and I did not want to widen #19 past the self-contained `build_context` bounding it already has.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound the input_handler ingest layer (URL download, zip extraction, git clone) before the per-file gate #21

Why it matters

Proposed hardening

Test gaps

Scope note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bound the input_handler ingest layer (URL download, zip extraction, git clone) before the per-file gate #21

Description

Why it matters

Proposed hardening

Test gaps

Scope note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions