Memery 1.0: restore working state, fix MPS indexer, add tests by deepfates · Pull Request #41 · deepfates/memery

deepfates · 2024-10-15T23:13:41Z

Summary

The rewrite branch was unbuildable on a fresh install (dead streamlit imports, deps from 2022 that no longer co-resolve) and even after fixing that, had a perf disaster on Apple Silicon at scale. This PR makes it run, makes it fast, and adds a test suite so future regressions don't slip in silently.

Four commits in this order — each is independently reviewable:

fix: restore working state — modern deps, swap dead streamlit.cli / streamlit.report_thread imports, expose -nt on the CLI, gitignore local-only paths
perf: eliminate the MPS indexer hang — root-cause the 30-min "hang" at scale, plus related encoder/crafter/loader cleanups
feat: default --workers 2 — empirically determined sweet spot, with the data in the commit body
test: add suite + run in CI — 16 tests (12 fast, 4 integration), CI now actually invokes pytest

Headline numbers

End-to-end on a 700-image realistic-distribution corpus (sampled from a real Downloads folder):

	before	after
`memery build` wall clock	59.7 s	6.7 s (9×)
Indexer phase on 89k images	30+ min and silent	~2 s with progress bar

The indexer story (commit 2)

Profiling the live build at 89k images showed 98.5% of wall clock in one stack: annoy.add_item → PyFloat_AsDouble → Tensor.item() → _local_scalar_dense_mps → Metal waitUntilCompleted. Annoy reads each 512-D vector element-by-element, and on MPS each element access forces a full GPU sync. ~46 million serialized GPU syncs.

Fix at the source: encoder hands CPU tensors to everything downstream; indexer is defensive about whatever it receives and now shows a tqdm bar so the previously-silent "Building treemap" phase is visible.

The workers story (commit 3)

Three trials each on a 4000-image corpus:

```
workers=0: 34.89s, 33.10s, 32.54s → mean 33.5s
workers=2: 28.54s, 28.61s, 28.25s → mean 28.5s (-15%)
```

Distributions don't overlap. Single-trial sweep showed monotonic decline beyond 2 (workers=8 was 63% slower; workers=12 was 133% slower) — a single CPU thread can't quite keep MPS fed, but ~2 can, and beyond that the GPU becomes the serial bottleneck while IPC overhead piles up. `--workers 0` remains an escape hatch and is documented in `--help`.

Test plan

`pytest` (12 fast tests) — passes locally in <1s
`pytest -m integration` (4 end-to-end tests with real CLIP) — passes locally in ~7s
`memery --help`, `memery build`, `memery recall -t … -nt …`, `memery purge` — all smoke-tested on a real folder
`memery serve` — boots streamlit and reaches the welcome screen
CI run on this PR (will fire on push)
Untested: CUDA, Linux, Windows. The bottleneck shape should generalize but flagging here in case anyone hits trouble; the documented fix is `--workers 0`.

Caveats

Skipped fixing `hash_path` (currently `stem + mtime`, can collide for two `IMG_1234.jpg` files in different subfolders saved at the same second). It's a real bug, but fixing it invalidates every existing cache, so I left it for a future change.
I left workers default at the value most likely to win on the typical use case (search a folder of thousands of images on Apple Silicon). If anyone reports regressions on a different platform, falling back to 0 is a one-line change.

The branch was unbuildable on a fresh install: streamlit was pinned to 1.3.1 (incompatible with everything else), `streamlit.cli` was removed upstream years ago, and `streamlit.report_thread` was renamed. - pyproject: relax pins (streamlit ^1.30, typer >=0.12, altair ^5, Pillow ^10), constrain python to <3.13 since torch 2.2 caps there, drop the no-longer-needed protobuf pin - memery/cli.py: import `streamlit.web.cli` (lazy, so --help stays fast) and expose -nt/--negative-text on `recall` so the CLI matches what the GUI already supported - memery/streamlit_app.py: drop the dead `report_thread` import and the 50-line stdout-redirect contraption it powered; replace with st.spinner / logbox.warn-success — same UX, much less surface area - .gitignore: ignore local-only .claude/, .venv/, .pytest_cache/

The dominant bottleneck on Apple Silicon at scale wasn't encoding — it was annoy's `add_item` reading each 512-D embedding element-by- element from a tensor still on the GPU, forcing a Metal `waitUntilCompleted` per float. On an 89k-image library that's ~46 million serialized GPU syncs; profiling showed 98.5% of wall clock in this single path. The indexer ran longer than the entire encoding pass and looked silently hung from the user's side. Fix at the source: - encoder.py: cat batch features into a list and concat once at the end (the previous in-loop torch.cat was O(n^2) memory churn; on MPS each cat also paid a kernel-launch cost — measured 250x microbench speedup at 702 batches, ~15s on a real run). Move the final tensor to CPU before returning so nothing downstream pays the per-element sync tax. Also call model.eval() — defensive correctness, free. - indexer.py: defensive `_to_cpu_array` in case anything still passes a GPU tensor; wrap the add_item loop in tqdm so the previously- silent "Building treemap" phase is now visible. While in the area, two related perf/robustness wins: - crafter.py: pil_loader uses Image.draft('RGB', (256,256)) for fast partial-scale JPEG decode (measured 1.63x preprocessing speedup on a realistic Downloads sample); add safe_collate so a single bad image no longer crashes a whole DataLoader batch; drop the dead-code `slugs` parallel list make_dataset built and nobody read. - loader.py: archive_loader was re-running verify_image on files already filtered upstream, and used O(n) list lookups for hash membership; both fixed. End-to-end on a 700-image realistic-distribution corpus: 59.7s -> 6.7s (9x). Indexer phase alone on 89k images: ~30+ minutes -> ~2 seconds.

Empirically determined on macOS/MPS with a 4000-image corpus sampled from a realistic Downloads folder. Three trials each: workers=0: 34.89s, 33.10s, 32.54s (mean 33.5s) workers=2: 28.54s, 28.61s, 28.25s (mean 28.5s) -15% Distributions don't overlap; the win is real, not noise. Single-trial sweep showed monotonic decline beyond 2 (workers=4 ties workers=0; workers=8 is 63% slower; workers=12 is 133% slower). Read: a single CPU thread can't quite keep the GPU fed, but ~2 can, and beyond that the GPU is a serial bottleneck while IPC overhead piles up. Trade-off: ~3% slower on tiny corpora (<1k images) where spawn cost isn't amortized. Bounded and not user-visible. Documented in --help along with the escape hatch (--workers 0) for environments where multiprocessing misbehaves. Untested on CUDA / Linux / Windows; the bottleneck shape should generalize but if anyone reports trouble, --workers 0 is the fix.

…in CI There were zero tests on this branch. The CI workflow even installed pytest but never invoked it — only flake8 ran. So a previous breakage (e.g. the `streamlit.cli` import that died years ago, or the `streamlit.report_thread` rename) would land silently. - tests/test_loader.py: hash_path format, extension filter, corrupt- file skipping, missing-archive fallbacks - tests/test_crafter.py: pil_loader RGB conversion + None-on-corrupt, safe_collate drops None items / returns None on all-bad batches - tests/test_cli.py: typer CliRunner smoke + verifies -nt is exposed on `recall` + `purge` is idempotent on empty dirs - tests/test_core.py: end-to-end index_flow / query_flow / clean against a 3-image fixture. Marked `integration` so it's opt-in (downloads ~338MB CLIP weights on first run); CI runs only the fast suite by default. - tests/conftest.py: copies the bundled images/ fixtures into tmp dirs per-test so index files never end up in the working tree - pyproject.toml [tool.pytest.ini_options]: addopts = "-m 'not integration'" so default `pytest` is fast - .github/workflows/python-app.yml: bump python to 3.12 to match the new pyproject constraint, also trigger on the rewrite branch, and actually run pytest after lint

Poetry can't resolve streamlit ^1.30 against python >=3.9 because the intermediate streamlit versions had a !=3.9.7 marker that poetry won't unify with our open lower bound. Modern streamlit requires 3.10+ anyway. Pinning to >=3.10,<3.13 lets the resolver pick streamlit 1.50+ cleanly.

Copilot

Pull request overview

This PR restores the rewrite branch to a buildable/runnable state, fixes a major Apple Silicon (MPS) indexing performance issue by ensuring downstream consumers receive CPU-resident embeddings, and adds a pytest suite + CI coverage to prevent regressions.

Changes:

Modernizes dependencies and Streamlit integration, and updates the CLI surface (including negative text flag and default worker behavior).
Fixes the MPS “indexer hang” by avoiding element-wise GPU syncs when feeding embeddings into Annoy; adds safer batching/decoding handling.
Adds unit + integration tests, and updates GitHub Actions to run lint + pytest.

Reviewed changes

Copilot reviewed 27 out of 38 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`pyproject.toml`	Updates dependency constraints and configures pytest defaults/markers.
`.github/workflows/python-app.yml`	Updates CI to use newer actions/Python and to run lint + pytest.
`.gitignore`	Ignores local-only dev artifacts (venv, pytest cache, etc.).
`memery/cli.py`	Updates Typer CLI (negative text option, `serve` wiring, default workers).
`memery/streamlit_app.py`	Streamlit app cleanup (arg parsing, spinners/logging, UI tweaks).
`memery/loader.py`	Loader/archive selection performance tweak (set-based hash lookups).
`memery/crafter.py`	Faster PIL loading path, `safe_collate`, and DataLoader worker defaults.
`memery/encoder.py`	Improves encoder performance (avoid O(n²) cat) and returns CPU embeddings.
`memery/indexer.py`	Adds defensive CPU conversion for embeddings before Annoy indexing + tqdm progress.
`tests/conftest.py`	Adds fixtures (bundled images, tmp copies) to keep tests isolated from repo artifacts.
`tests/test_loader.py`	Unit tests for loader behaviors (hashing, extension filtering, corrupt skipping).
`tests/test_crafter.py`	Tests PIL loader behavior and `safe_collate` semantics.
`tests/test_core.py`	Integration tests for index/query/clean flows (marked `integration`).
`tests/test_cli.py`	CLI smoke tests for command wiring and help output.
`tests/__init__.py`	Initializes test package.
`memery/gui.py`	Removes deprecated/unused GUI code (commented-out widget implementation).
`notebooks/memery.ipynb`	Removes legacy notebook artifact.
`notebooks/09_streamlit_app.ipynb`	Removes legacy notebook artifact.
`notebooks/08_jupyter_gui.ipynb`	Removes legacy notebook artifact.
`notebooks/07_cli.ipynb`	Removes legacy notebook artifact.
`notebooks/05_ranker.ipynb`	Removes legacy notebook artifact.
`notebooks/04_indexer.ipynb`	Removes legacy notebook artifact.
`notebooks/03_encoder.ipynb`	Removes legacy notebook artifact.
`notebooks/02_crafter.ipynb`	Removes legacy notebook artifact.
`notebooks/01_loader.ipynb`	Removes legacy notebook artifact.
`notebooks/00_core.ipynb`	Removes legacy notebook artifact.
`notebooks/_working_pipeline.ipynb`	Removes legacy notebook artifact.
`notebooks/_visualize.ipynb`	Removes legacy notebook artifact.

Comments suppressed due to low confidence (1)

.github/workflows/python-app.yml:30

CI installs flake8 via pip, but the lint step runs poetry run flake8. Unless Poetry is configured to use the system environment, poetry run executes inside Poetry’s virtualenv where flake8 is not installed (it’s not in pyproject.toml dev-dependencies), so this step is likely to fail. Either add flake8 to [tool.poetry.dev-dependencies] or run flake8 without poetry run.

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install flake8 pytest poetry
        poetry install

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-    new_files = [(str(path), hash) for path, hash in filepaths if hash not in archive_hashes and verify_image(path)]
-
-    return(archive_db, new_files)
+def archive_loader(filepaths: list[str], db: Any) -> tuple[set[str], list[str]]:


+            batch_features.append(model.encode_image(images.to(device)))

+    if not batch_features:
+        return torch.empty((0, 512), device=device)



-def crafter(new_files: list[str], device: device, batch_size: int=128, num_workers: int=4):
+
+def crafter(new_files: list[str], device: device,


+    return img_loader
+

 def preproc(img: Tensor) -> Compose:


+                # safe_collate (in crafter) returns None when every image in a
+                # batch failed to decode — skip rather than crash.
+                continue
+            images, _ = batch
+            batch_features.append(model.encode_image(images.to(device)))


 def pil_loader(path: str) -> Image.Image:
-    ImageFile.LOAD_TRUNCATED_IMAGES = True  # Allow truncated images
+    """Open `path` and return an RGB PIL image, or None on failure.
+
+    Uses PIL's JPEG `draft` mode to decode at the smallest IDCT scale that's
+    still ≥ 256px. For large JPEGs this skips most of the inverse-DCT work


    ) -> list[str]:
    """Search recursively over a folder from the command line"""
    memery = Memery()
-    ranked = memery.query_flow(root, query=text, image_query=image)
+    ranked = memery.query_flow(root, query=text, negative_query=negative, image_query=image)
    print(ranked[:number])


Caught by Copilot review on #41. When safe_collate drops items from a partially-failing batch (an image that passed verify_image but failed at PIL decode time — truncated files, weird color spaces), the encoder used to discard the surviving labels and just concat the embeddings. Downstream, indexer.join_all correlates new_files and embeddings by position, so dropped items would shift every subsequent file's embedding by one — silent data corruption — and the last few files would IndexError. - encoder.image_encoder now returns (embeddings, surviving_labels). Both are CPU tensors. Empty-batch path also returns CPU tensors for consistency with the non-empty path. - core.index_flow filters new_files using surviving_labels before passing to join_all, so positions stay aligned. Logs how many were skipped. - New regression test in test_core: monkey-patches pil_loader to fail on one fixture and verifies the resulting db has correct fpath -> embed mapping with no IndexError. Marked integration (uses real CLIP).

Also from Copilot review on #41. Each is small but worth getting right since the type annotations are how readers and IDEs reason about the API. - loader.archive_loader: type hints now describe the actual shapes (filepaths is list[(Path, hash)], not list[str]; returns (dict, list[(str, str)]) not (set, list)). - crafter.crafter: parameter is list[(str, str)], not list[str]. - crafter.preproc: returns Tensor, not Compose; takes a PIL Image. - crafter.pil_loader: returns Optional[Image.Image] since it returns None on decode failure. - cli.recall: annotated as -> None to match actual behavior (it prints rather than returns). CI dep hygiene: - Add flake8 to dev-dependencies so `poetry run flake8` actually finds the tool inside the poetry-managed venv (was previously pip-installed outside the venv and worked by accident). - Workflow installs flake8/pytest via `poetry install` instead of a separate `pip install`, so we stop risking version skew between the CI image's pip and what poetry resolves.

refactor: 🔥 remove unused code and old notebooks

abc476c

deepfates changed the title ~~(WIP) Memery 1.~~ (WIP) Memery 1.0 Oct 15, 2024

deepfates added 4 commits May 7, 2026 15:37

deepfates changed the title ~~(WIP) Memery 1.0~~ Memery 1.0: restore working state, fix MPS indexer, add tests May 7, 2026

deepfates requested review from Copilot May 7, 2026 22:42

Copilot started reviewing on behalf of deepfates May 7, 2026 22:43 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

deepfates added 2 commits May 7, 2026 15:53

deepfates merged commit aec9510 into main May 7, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memery 1.0: restore working state, fix MPS indexer, add tests#41

Memery 1.0: restore working state, fix MPS indexer, add tests#41
deepfates merged 8 commits into
mainfrom
rewrite

deepfates commented Oct 15, 2024 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		def crafter(new_files: list[str], device: device, batch_size: int=128, num_workers: int=4):

		def crafter(new_files: list[str], device: device,

Conversation

deepfates commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline numbers

The indexer story (commit 2)

The workers story (commit 3)

Test plan

Caveats

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deepfates commented Oct 15, 2024 •

edited

Loading