Skip to content

Optimize stat#1

Open
ervinoro wants to merge 5 commits into
masterfrom
optimize-stat
Open

Optimize stat#1
ervinoro wants to merge 5 commits into
masterfrom
optimize-stat

Conversation

@ervinoro

Copy link
Copy Markdown
Owner

No description provided.

ervinoro and others added 5 commits April 18, 2026 21:39
Adds pre-refactor coverage where silent regressions could land
undetected in the upcoming scandir/Entry refactor:

- test/test_check.py: 4 tests for the check command (Gap A)
  (previously 0 of 24 statements in check were covered).
- test/test_walk_trees.py: 3 collision tests (Gap B) +
  1 empty-dir test (Gap C).

Test test_compare_dirs_raises_when_index_is_index_but_fs_is_file is
marked @expectedfailure; C3 adds the symmetric isinstance check in
_compare_dirs' matched-child branch, which un-marks it.

Test test_check_reports_file_missing_from_disk asserts the pre-C4
"Unable to open" output from slurp(); C4's rewrite flips it to an
explicit "File missing from disk" line.

Latent bug surfaced by test 1: check's "missing from index" loop
flagged index.txt itself on every run, so every healthy archive
reported FAIL. Adds a one-line filter to skip Index.FILENAME. The
filter is preserved when C4 rewrites check.

64 tests pass (was 56); isort/flake8/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI matrix goes to 3.14 only across ubuntu/macos/windows. actions/checkout
and setup-python bump from v2 to v4/v5 respectively — v2's setup-python
predates 3.14 and would fail to install it.

README's "Depends on Python 3.8+" was already a lie (the code uses 3.10+
syntax); updated to 3.14+. The stale ".gitignore" section is removed —
commit 3dcd517 ripped that functionality out and the docs never caught up.

requirements.txt carries no Python pin; untouched. Independent of the
upcoming scandir refactor — the Entry architecture in C2-C4 doesn't lean
on any 3.14-only API. Landing this first lets later commits assume 3.14
without ambiguity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ryPath

Adds Entry, list_dir, walk in utils.py built on os.scandir. Entry carries
path, is_dir, is_file, and size — everything the kernel told us at listing
time. On Windows, DirEntry.is_dir/is_file/stat read from the FindNextFile
cache, so list_dir issues zero per-entry syscalls. On Linux/macOS one stat
per file is unavoidable for size (d_type doesn't include it) but now paid
once per run, not three times.

hash_tree and cp rewritten to consume the new walk (no more rglob, no more
per-file stat in the copy loop). sync's four pre-flight passes collapse to
one per side; the "Calculating data size" pbar becomes count-only since we
no longer pre-count files.

hot_dir / cold_dir are made absolute at CLI entry via .absolute() so every
Entry.path carried through the system is absolute. .absolute() is chosen
over .resolve() to preserve current user-visible behavior (the latter
would follow symlinks on the root path).

Incidental changes required to keep C2 bisect-safe (existing tests green):
- _compare_dirs' Appeared branch: sum(file.stat().st_size for file in walk)
  → sum(e.size for e in walk).
- check's "missing from index" loop: file.relative_to → entry.path.relative_to.

is_relevant keeps its Path signature at the C2 checkpoint because
_compare_dirs still uses iterdir(); C3 flips both.

Behavior note: broken symlinks and exotic entries (sockets, FIFOs) no
longer raise ValueError from walk — they are silently skipped, matching
os.walk's default.

DirEntryPath.py deleted. Its Path-subclass overrides were invalidated by
3.12's with_segments routing (self.entry unset on derived paths), and
3.14's Path.info doesn't carry size so can't replace it.

71 tests pass (was 64), 1 skipped (broken-symlink test on Windows without
SeCreateSymbolicLink). isort/flake8/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_files

Rewrites _compare_dirs to key children by name with Entry values — no more
Set[PurePath] strip-and-rebuild, no more is_file()/is_dir() re-stat in
walk_trees after listing. Matched pairs carry types and sizes straight
from the scandir listing into the match branch.

The file-vs-dir dispatch that walk_trees did today migrates into
_compare_dirs' H=1,C=1 branch, where the Entries are already in hand.
walk_trees becomes a thin wrapper preserving its public signature (so
test_walk_trees is unchanged); it checks the root sub_index type once and
delegates.

Collision raises now fire symmetrically at each matched child:
- FS file vs FS dir (on either side)           → NotImplementedError
- FS file/file but index has Index sub-tree    → NotImplementedError (new)
- FS dir/dir but index has hash string         → NotImplementedError

The second case was silently returning ModifiedCopied on every run,
rewriting the index with a file hash and discarding the sub-tree record.
test_compare_dirs_raises_when_index_is_index_but_fs_is_file was marked
@expectedfailure in C0; un-marked here.

_compare_files now takes two Entry values and reads hot.size directly —
four avoidable stat()/getsize() call sites gone. hot_dir and cold_dir
parameters drop out of its signature.

is_relevant signature flipped from Path to Entry (one less is_file() per
child — the scandir listing already told us). Empty-dir detection still
costs one scandir, scoped to the edge case.

Result: one os.scandir per visited directory per side. Zero extra
type-check syscalls on the matched-child path. Zero stats in
_compare_files. On Windows, zero per-file stat end-to-end on the matched
path.

71 tests pass, 1 skipped (symlink-on-Windows). isort/flake8/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites check to walk cold_dir exactly once, materializing {rel: entry}.
Progress-bar total derives from that walk instead of a per-index stat
loop with a double-stat-via-exists() guard. Index-completeness check uses
dict set subtraction against the same dict — no second walk.

Behavior change: files indexed-but-missing-from-disk now produce an
explicit "File missing from disk: '<path>'" failure line rather than
surfacing only as "Unable to open '<path>'" from slurp(). The latter was
accidental — slurp() swallows IOError and emits the message as a
side-effect of hash attempts against missing files. Now:

- missing from disk → "File missing from disk: '<p>'"   FAIL
- hash mismatch     → "Verification failed: '<p>'"       FAIL
- missing from idx  → "File missing from index: '<p>'"   FAIL

test_check_reports_file_missing_from_disk (added in C0 asserting the
"Unable to open" message) flipped to assert the new message, per plan.

Syscalls on Windows: one scandir per directory, period. The double-stat
via exists() in the old pbar-total computation, and the separate walk for
extras detection, are both gone.

71 tests pass, 1 skipped. isort/flake8/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant