Conversation
The branch was unbuildable on a fresh install: streamlit was pinned to 1.3.1 (incompatible with everything else), `streamlit.cli` was removed upstream years ago, and `streamlit.report_thread` was renamed. - pyproject: relax pins (streamlit ^1.30, typer >=0.12, altair ^5, Pillow ^10), constrain python to <3.13 since torch 2.2 caps there, drop the no-longer-needed protobuf pin - memery/cli.py: import `streamlit.web.cli` (lazy, so --help stays fast) and expose -nt/--negative-text on `recall` so the CLI matches what the GUI already supported - memery/streamlit_app.py: drop the dead `report_thread` import and the 50-line stdout-redirect contraption it powered; replace with st.spinner / logbox.warn-success — same UX, much less surface area - .gitignore: ignore local-only .claude/, .venv/, .pytest_cache/
The dominant bottleneck on Apple Silicon at scale wasn't encoding —
it was annoy's `add_item` reading each 512-D embedding element-by-
element from a tensor still on the GPU, forcing a Metal
`waitUntilCompleted` per float. On an 89k-image library that's ~46
million serialized GPU syncs; profiling showed 98.5% of wall clock
in this single path. The indexer ran longer than the entire encoding
pass and looked silently hung from the user's side.
Fix at the source:
- encoder.py: cat batch features into a list and concat once at the
end (the previous in-loop torch.cat was O(n^2) memory churn; on MPS
each cat also paid a kernel-launch cost — measured 250x microbench
speedup at 702 batches, ~15s on a real run). Move the final tensor
to CPU before returning so nothing downstream pays the per-element
sync tax. Also call model.eval() — defensive correctness, free.
- indexer.py: defensive `_to_cpu_array` in case anything still passes
a GPU tensor; wrap the add_item loop in tqdm so the previously-
silent "Building treemap" phase is now visible.
While in the area, two related perf/robustness wins:
- crafter.py: pil_loader uses Image.draft('RGB', (256,256)) for fast
partial-scale JPEG decode (measured 1.63x preprocessing speedup
on a realistic Downloads sample); add safe_collate so a single
bad image no longer crashes a whole DataLoader batch; drop the
dead-code `slugs` parallel list make_dataset built and nobody read.
- loader.py: archive_loader was re-running verify_image on files
already filtered upstream, and used O(n) list lookups for hash
membership; both fixed.
End-to-end on a 700-image realistic-distribution corpus: 59.7s -> 6.7s
(9x). Indexer phase alone on 89k images: ~30+ minutes -> ~2 seconds.
Empirically determined on macOS/MPS with a 4000-image corpus sampled from a realistic Downloads folder. Three trials each: workers=0: 34.89s, 33.10s, 32.54s (mean 33.5s) workers=2: 28.54s, 28.61s, 28.25s (mean 28.5s) -15% Distributions don't overlap; the win is real, not noise. Single-trial sweep showed monotonic decline beyond 2 (workers=4 ties workers=0; workers=8 is 63% slower; workers=12 is 133% slower). Read: a single CPU thread can't quite keep the GPU fed, but ~2 can, and beyond that the GPU is a serial bottleneck while IPC overhead piles up. Trade-off: ~3% slower on tiny corpora (<1k images) where spawn cost isn't amortized. Bounded and not user-visible. Documented in --help along with the escape hatch (--workers 0) for environments where multiprocessing misbehaves. Untested on CUDA / Linux / Windows; the bottleneck shape should generalize but if anyone reports trouble, --workers 0 is the fix.
…in CI There were zero tests on this branch. The CI workflow even installed pytest but never invoked it — only flake8 ran. So a previous breakage (e.g. the `streamlit.cli` import that died years ago, or the `streamlit.report_thread` rename) would land silently. - tests/test_loader.py: hash_path format, extension filter, corrupt- file skipping, missing-archive fallbacks - tests/test_crafter.py: pil_loader RGB conversion + None-on-corrupt, safe_collate drops None items / returns None on all-bad batches - tests/test_cli.py: typer CliRunner smoke + verifies -nt is exposed on `recall` + `purge` is idempotent on empty dirs - tests/test_core.py: end-to-end index_flow / query_flow / clean against a 3-image fixture. Marked `integration` so it's opt-in (downloads ~338MB CLIP weights on first run); CI runs only the fast suite by default. - tests/conftest.py: copies the bundled images/ fixtures into tmp dirs per-test so index files never end up in the working tree - pyproject.toml [tool.pytest.ini_options]: addopts = "-m 'not integration'" so default `pytest` is fast - .github/workflows/python-app.yml: bump python to 3.12 to match the new pyproject constraint, also trigger on the rewrite branch, and actually run pytest after lint
Poetry can't resolve streamlit ^1.30 against python >=3.9 because the intermediate streamlit versions had a !=3.9.7 marker that poetry won't unify with our open lower bound. Modern streamlit requires 3.10+ anyway. Pinning to >=3.10,<3.13 lets the resolver pick streamlit 1.50+ cleanly.
There was a problem hiding this comment.
Pull request overview
This PR restores the rewrite branch to a buildable/runnable state, fixes a major Apple Silicon (MPS) indexing performance issue by ensuring downstream consumers receive CPU-resident embeddings, and adds a pytest suite + CI coverage to prevent regressions.
Changes:
- Modernizes dependencies and Streamlit integration, and updates the CLI surface (including negative text flag and default worker behavior).
- Fixes the MPS “indexer hang” by avoiding element-wise GPU syncs when feeding embeddings into Annoy; adds safer batching/decoding handling.
- Adds unit + integration tests, and updates GitHub Actions to run lint + pytest.
Reviewed changes
Copilot reviewed 27 out of 38 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Updates dependency constraints and configures pytest defaults/markers. |
.github/workflows/python-app.yml |
Updates CI to use newer actions/Python and to run lint + pytest. |
.gitignore |
Ignores local-only dev artifacts (venv, pytest cache, etc.). |
memery/cli.py |
Updates Typer CLI (negative text option, serve wiring, default workers). |
memery/streamlit_app.py |
Streamlit app cleanup (arg parsing, spinners/logging, UI tweaks). |
memery/loader.py |
Loader/archive selection performance tweak (set-based hash lookups). |
memery/crafter.py |
Faster PIL loading path, safe_collate, and DataLoader worker defaults. |
memery/encoder.py |
Improves encoder performance (avoid O(n²) cat) and returns CPU embeddings. |
memery/indexer.py |
Adds defensive CPU conversion for embeddings before Annoy indexing + tqdm progress. |
tests/conftest.py |
Adds fixtures (bundled images, tmp copies) to keep tests isolated from repo artifacts. |
tests/test_loader.py |
Unit tests for loader behaviors (hashing, extension filtering, corrupt skipping). |
tests/test_crafter.py |
Tests PIL loader behavior and safe_collate semantics. |
tests/test_core.py |
Integration tests for index/query/clean flows (marked integration). |
tests/test_cli.py |
CLI smoke tests for command wiring and help output. |
tests/__init__.py |
Initializes test package. |
memery/gui.py |
Removes deprecated/unused GUI code (commented-out widget implementation). |
notebooks/memery.ipynb |
Removes legacy notebook artifact. |
notebooks/09_streamlit_app.ipynb |
Removes legacy notebook artifact. |
notebooks/08_jupyter_gui.ipynb |
Removes legacy notebook artifact. |
notebooks/07_cli.ipynb |
Removes legacy notebook artifact. |
notebooks/05_ranker.ipynb |
Removes legacy notebook artifact. |
notebooks/04_indexer.ipynb |
Removes legacy notebook artifact. |
notebooks/03_encoder.ipynb |
Removes legacy notebook artifact. |
notebooks/02_crafter.ipynb |
Removes legacy notebook artifact. |
notebooks/01_loader.ipynb |
Removes legacy notebook artifact. |
notebooks/00_core.ipynb |
Removes legacy notebook artifact. |
notebooks/_working_pipeline.ipynb |
Removes legacy notebook artifact. |
notebooks/_visualize.ipynb |
Removes legacy notebook artifact. |
Comments suppressed due to low confidence (1)
.github/workflows/python-app.yml:30
- CI installs
flake8viapip, but the lint step runspoetry run flake8. Unless Poetry is configured to use the system environment,poetry runexecutes inside Poetry’s virtualenv whereflake8is not installed (it’s not inpyproject.tomldev-dependencies), so this step is likely to fail. Either addflake8to[tool.poetry.dev-dependencies]or runflake8withoutpoetry run.
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest poetry
poetry install
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| new_files = [(str(path), hash) for path, hash in filepaths if hash not in archive_hashes and verify_image(path)] | ||
|
|
||
| return(archive_db, new_files) | ||
| def archive_loader(filepaths: list[str], db: Any) -> tuple[set[str], list[str]]: |
| batch_features.append(model.encode_image(images.to(device))) | ||
|
|
||
| if not batch_features: | ||
| return torch.empty((0, 512), device=device) |
|
|
||
| def crafter(new_files: list[str], device: device, batch_size: int=128, num_workers: int=4): | ||
|
|
||
| def crafter(new_files: list[str], device: device, |
| return img_loader | ||
|
|
||
|
|
||
| def preproc(img: Tensor) -> Compose: |
Comment on lines
+28
to
+32
| # safe_collate (in crafter) returns None when every image in a | ||
| # batch failed to decode — skip rather than crash. | ||
| continue | ||
| images, _ = batch | ||
| batch_features.append(model.encode_image(images.to(device))) |
Comment on lines
+18
to
+22
| def pil_loader(path: str) -> Image.Image: | ||
| ImageFile.LOAD_TRUNCATED_IMAGES = True # Allow truncated images | ||
| """Open `path` and return an RGB PIL image, or None on failure. | ||
|
|
||
| Uses PIL's JPEG `draft` mode to decode at the smallest IDCT scale that's | ||
| still ≥ 256px. For large JPEGs this skips most of the inverse-DCT work |
Comment on lines
23
to
27
| ) -> list[str]: | ||
| """Search recursively over a folder from the command line""" | ||
| memery = Memery() | ||
| ranked = memery.query_flow(root, query=text, image_query=image) | ||
| ranked = memery.query_flow(root, query=text, negative_query=negative, image_query=image) | ||
| print(ranked[:number]) |
Caught by Copilot review on #41. When safe_collate drops items from a partially-failing batch (an image that passed verify_image but failed at PIL decode time — truncated files, weird color spaces), the encoder used to discard the surviving labels and just concat the embeddings. Downstream, indexer.join_all correlates new_files and embeddings by position, so dropped items would shift every subsequent file's embedding by one — silent data corruption — and the last few files would IndexError. - encoder.image_encoder now returns (embeddings, surviving_labels). Both are CPU tensors. Empty-batch path also returns CPU tensors for consistency with the non-empty path. - core.index_flow filters new_files using surviving_labels before passing to join_all, so positions stay aligned. Logs how many were skipped. - New regression test in test_core: monkey-patches pil_loader to fail on one fixture and verifies the resulting db has correct fpath -> embed mapping with no IndexError. Marked integration (uses real CLIP).
Also from Copilot review on #41. Each is small but worth getting right since the type annotations are how readers and IDEs reason about the API. - loader.archive_loader: type hints now describe the actual shapes (filepaths is list[(Path, hash)], not list[str]; returns (dict, list[(str, str)]) not (set, list)). - crafter.crafter: parameter is list[(str, str)], not list[str]. - crafter.preproc: returns Tensor, not Compose; takes a PIL Image. - crafter.pil_loader: returns Optional[Image.Image] since it returns None on decode failure. - cli.recall: annotated as -> None to match actual behavior (it prints rather than returns). CI dep hygiene: - Add flake8 to dev-dependencies so `poetry run flake8` actually finds the tool inside the poetry-managed venv (was previously pip-installed outside the venv and worked by accident). - Workflow installs flake8/pytest via `poetry install` instead of a separate `pip install`, so we stop risking version skew between the CI image's pip and what poetry resolves.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
rewritebranch was unbuildable on a fresh install (dead streamlit imports, deps from 2022 that no longer co-resolve) and even after fixing that, had a perf disaster on Apple Silicon at scale. This PR makes it run, makes it fast, and adds a test suite so future regressions don't slip in silently.Four commits in this order — each is independently reviewable:
streamlit.cli/streamlit.report_threadimports, expose-nton the CLI, gitignore local-only paths--workers 2— empirically determined sweet spot, with the data in the commit bodyHeadline numbers
End-to-end on a 700-image realistic-distribution corpus (sampled from a real Downloads folder):
memery buildwall clockThe indexer story (commit 2)
Profiling the live build at 89k images showed 98.5% of wall clock in one stack:
annoy.add_item→PyFloat_AsDouble→Tensor.item()→_local_scalar_dense_mps→ MetalwaitUntilCompleted. Annoy reads each 512-D vector element-by-element, and on MPS each element access forces a full GPU sync. ~46 million serialized GPU syncs.Fix at the source: encoder hands CPU tensors to everything downstream; indexer is defensive about whatever it receives and now shows a tqdm bar so the previously-silent "Building treemap" phase is visible.
The workers story (commit 3)
Three trials each on a 4000-image corpus:
```
workers=0: 34.89s, 33.10s, 32.54s → mean 33.5s
workers=2: 28.54s, 28.61s, 28.25s → mean 28.5s (-15%)
```
Distributions don't overlap. Single-trial sweep showed monotonic decline beyond 2 (workers=8 was 63% slower; workers=12 was 133% slower) — a single CPU thread can't quite keep MPS fed, but ~2 can, and beyond that the GPU becomes the serial bottleneck while IPC overhead piles up. `--workers 0` remains an escape hatch and is documented in `--help`.
Test plan
Caveats