Skip to content

Memery 1.0: restore working state, fix MPS indexer, add tests#41

Merged
deepfates merged 8 commits into
mainfrom
rewrite
May 7, 2026
Merged

Memery 1.0: restore working state, fix MPS indexer, add tests#41
deepfates merged 8 commits into
mainfrom
rewrite

Conversation

@deepfates
Copy link
Copy Markdown
Owner

@deepfates deepfates commented Oct 15, 2024

Summary

The rewrite branch was unbuildable on a fresh install (dead streamlit imports, deps from 2022 that no longer co-resolve) and even after fixing that, had a perf disaster on Apple Silicon at scale. This PR makes it run, makes it fast, and adds a test suite so future regressions don't slip in silently.

Four commits in this order — each is independently reviewable:

  1. fix: restore working state — modern deps, swap dead streamlit.cli / streamlit.report_thread imports, expose -nt on the CLI, gitignore local-only paths
  2. perf: eliminate the MPS indexer hang — root-cause the 30-min "hang" at scale, plus related encoder/crafter/loader cleanups
  3. feat: default --workers 2 — empirically determined sweet spot, with the data in the commit body
  4. test: add suite + run in CI — 16 tests (12 fast, 4 integration), CI now actually invokes pytest

Headline numbers

End-to-end on a 700-image realistic-distribution corpus (sampled from a real Downloads folder):

before after
memery build wall clock 59.7 s 6.7 s (9×)
Indexer phase on 89k images 30+ min and silent ~2 s with progress bar

The indexer story (commit 2)

Profiling the live build at 89k images showed 98.5% of wall clock in one stack: annoy.add_itemPyFloat_AsDoubleTensor.item()_local_scalar_dense_mps → Metal waitUntilCompleted. Annoy reads each 512-D vector element-by-element, and on MPS each element access forces a full GPU sync. ~46 million serialized GPU syncs.

Fix at the source: encoder hands CPU tensors to everything downstream; indexer is defensive about whatever it receives and now shows a tqdm bar so the previously-silent "Building treemap" phase is visible.

The workers story (commit 3)

Three trials each on a 4000-image corpus:

```
workers=0: 34.89s, 33.10s, 32.54s → mean 33.5s
workers=2: 28.54s, 28.61s, 28.25s → mean 28.5s (-15%)
```

Distributions don't overlap. Single-trial sweep showed monotonic decline beyond 2 (workers=8 was 63% slower; workers=12 was 133% slower) — a single CPU thread can't quite keep MPS fed, but ~2 can, and beyond that the GPU becomes the serial bottleneck while IPC overhead piles up. `--workers 0` remains an escape hatch and is documented in `--help`.

Test plan

  • `pytest` (12 fast tests) — passes locally in <1s
  • `pytest -m integration` (4 end-to-end tests with real CLIP) — passes locally in ~7s
  • `memery --help`, `memery build`, `memery recall -t … -nt …`, `memery purge` — all smoke-tested on a real folder
  • `memery serve` — boots streamlit and reaches the welcome screen
  • CI run on this PR (will fire on push)
  • Untested: CUDA, Linux, Windows. The bottleneck shape should generalize but flagging here in case anyone hits trouble; the documented fix is `--workers 0`.

Caveats

  • Skipped fixing `hash_path` (currently `stem + mtime`, can collide for two `IMG_1234.jpg` files in different subfolders saved at the same second). It's a real bug, but fixing it invalidates every existing cache, so I left it for a future change.
  • I left workers default at the value most likely to win on the typical use case (search a folder of thousands of images on Apple Silicon). If anyone reports regressions on a different platform, falling back to 0 is a one-line change.

@deepfates deepfates changed the title (WIP) Memery 1. (WIP) Memery 1.0 Oct 15, 2024
deepfates added 4 commits May 7, 2026 15:37
The branch was unbuildable on a fresh install: streamlit was pinned to
1.3.1 (incompatible with everything else), `streamlit.cli` was removed
upstream years ago, and `streamlit.report_thread` was renamed.

- pyproject: relax pins (streamlit ^1.30, typer >=0.12, altair ^5,
  Pillow ^10), constrain python to <3.13 since torch 2.2 caps there,
  drop the no-longer-needed protobuf pin
- memery/cli.py: import `streamlit.web.cli` (lazy, so --help stays
  fast) and expose -nt/--negative-text on `recall` so the CLI matches
  what the GUI already supported
- memery/streamlit_app.py: drop the dead `report_thread` import and
  the 50-line stdout-redirect contraption it powered; replace with
  st.spinner / logbox.warn-success — same UX, much less surface area
- .gitignore: ignore local-only .claude/, .venv/, .pytest_cache/
The dominant bottleneck on Apple Silicon at scale wasn't encoding —
it was annoy's `add_item` reading each 512-D embedding element-by-
element from a tensor still on the GPU, forcing a Metal
`waitUntilCompleted` per float. On an 89k-image library that's ~46
million serialized GPU syncs; profiling showed 98.5% of wall clock
in this single path. The indexer ran longer than the entire encoding
pass and looked silently hung from the user's side.

Fix at the source:
- encoder.py: cat batch features into a list and concat once at the
  end (the previous in-loop torch.cat was O(n^2) memory churn; on MPS
  each cat also paid a kernel-launch cost — measured 250x microbench
  speedup at 702 batches, ~15s on a real run). Move the final tensor
  to CPU before returning so nothing downstream pays the per-element
  sync tax. Also call model.eval() — defensive correctness, free.
- indexer.py: defensive `_to_cpu_array` in case anything still passes
  a GPU tensor; wrap the add_item loop in tqdm so the previously-
  silent "Building treemap" phase is now visible.

While in the area, two related perf/robustness wins:
- crafter.py: pil_loader uses Image.draft('RGB', (256,256)) for fast
  partial-scale JPEG decode (measured 1.63x preprocessing speedup
  on a realistic Downloads sample); add safe_collate so a single
  bad image no longer crashes a whole DataLoader batch; drop the
  dead-code `slugs` parallel list make_dataset built and nobody read.
- loader.py: archive_loader was re-running verify_image on files
  already filtered upstream, and used O(n) list lookups for hash
  membership; both fixed.

End-to-end on a 700-image realistic-distribution corpus: 59.7s -> 6.7s
(9x). Indexer phase alone on 89k images: ~30+ minutes -> ~2 seconds.
Empirically determined on macOS/MPS with a 4000-image corpus sampled
from a realistic Downloads folder. Three trials each:

  workers=0:  34.89s, 33.10s, 32.54s  (mean 33.5s)
  workers=2:  28.54s, 28.61s, 28.25s  (mean 28.5s)  -15%

Distributions don't overlap; the win is real, not noise.

Single-trial sweep showed monotonic decline beyond 2 (workers=4 ties
workers=0; workers=8 is 63% slower; workers=12 is 133% slower).
Read: a single CPU thread can't quite keep the GPU fed, but ~2 can,
and beyond that the GPU is a serial bottleneck while IPC overhead
piles up.

Trade-off: ~3% slower on tiny corpora (<1k images) where spawn cost
isn't amortized. Bounded and not user-visible. Documented in --help
along with the escape hatch (--workers 0) for environments where
multiprocessing misbehaves.

Untested on CUDA / Linux / Windows; the bottleneck shape should
generalize but if anyone reports trouble, --workers 0 is the fix.
…in CI

There were zero tests on this branch. The CI workflow even installed
pytest but never invoked it — only flake8 ran. So a previous breakage
(e.g. the `streamlit.cli` import that died years ago, or the
`streamlit.report_thread` rename) would land silently.

- tests/test_loader.py: hash_path format, extension filter, corrupt-
  file skipping, missing-archive fallbacks
- tests/test_crafter.py: pil_loader RGB conversion + None-on-corrupt,
  safe_collate drops None items / returns None on all-bad batches
- tests/test_cli.py: typer CliRunner smoke + verifies -nt is exposed
  on `recall` + `purge` is idempotent on empty dirs
- tests/test_core.py: end-to-end index_flow / query_flow / clean
  against a 3-image fixture. Marked `integration` so it's opt-in
  (downloads ~338MB CLIP weights on first run); CI runs only the
  fast suite by default.
- tests/conftest.py: copies the bundled images/ fixtures into tmp
  dirs per-test so index files never end up in the working tree
- pyproject.toml [tool.pytest.ini_options]: addopts = "-m 'not
  integration'" so default `pytest` is fast
- .github/workflows/python-app.yml: bump python to 3.12 to match
  the new pyproject constraint, also trigger on the rewrite branch,
  and actually run pytest after lint
@deepfates deepfates changed the title (WIP) Memery 1.0 Memery 1.0: restore working state, fix MPS indexer, add tests May 7, 2026
@deepfates deepfates requested review from Copilot May 7, 2026 22:42
Poetry can't resolve streamlit ^1.30 against python >=3.9 because the
intermediate streamlit versions had a !=3.9.7 marker that poetry won't
unify with our open lower bound. Modern streamlit requires 3.10+ anyway.
Pinning to >=3.10,<3.13 lets the resolver pick streamlit 1.50+ cleanly.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores the rewrite branch to a buildable/runnable state, fixes a major Apple Silicon (MPS) indexing performance issue by ensuring downstream consumers receive CPU-resident embeddings, and adds a pytest suite + CI coverage to prevent regressions.

Changes:

  • Modernizes dependencies and Streamlit integration, and updates the CLI surface (including negative text flag and default worker behavior).
  • Fixes the MPS “indexer hang” by avoiding element-wise GPU syncs when feeding embeddings into Annoy; adds safer batching/decoding handling.
  • Adds unit + integration tests, and updates GitHub Actions to run lint + pytest.

Reviewed changes

Copilot reviewed 27 out of 38 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pyproject.toml Updates dependency constraints and configures pytest defaults/markers.
.github/workflows/python-app.yml Updates CI to use newer actions/Python and to run lint + pytest.
.gitignore Ignores local-only dev artifacts (venv, pytest cache, etc.).
memery/cli.py Updates Typer CLI (negative text option, serve wiring, default workers).
memery/streamlit_app.py Streamlit app cleanup (arg parsing, spinners/logging, UI tweaks).
memery/loader.py Loader/archive selection performance tweak (set-based hash lookups).
memery/crafter.py Faster PIL loading path, safe_collate, and DataLoader worker defaults.
memery/encoder.py Improves encoder performance (avoid O(n²) cat) and returns CPU embeddings.
memery/indexer.py Adds defensive CPU conversion for embeddings before Annoy indexing + tqdm progress.
tests/conftest.py Adds fixtures (bundled images, tmp copies) to keep tests isolated from repo artifacts.
tests/test_loader.py Unit tests for loader behaviors (hashing, extension filtering, corrupt skipping).
tests/test_crafter.py Tests PIL loader behavior and safe_collate semantics.
tests/test_core.py Integration tests for index/query/clean flows (marked integration).
tests/test_cli.py CLI smoke tests for command wiring and help output.
tests/__init__.py Initializes test package.
memery/gui.py Removes deprecated/unused GUI code (commented-out widget implementation).
notebooks/memery.ipynb Removes legacy notebook artifact.
notebooks/09_streamlit_app.ipynb Removes legacy notebook artifact.
notebooks/08_jupyter_gui.ipynb Removes legacy notebook artifact.
notebooks/07_cli.ipynb Removes legacy notebook artifact.
notebooks/05_ranker.ipynb Removes legacy notebook artifact.
notebooks/04_indexer.ipynb Removes legacy notebook artifact.
notebooks/03_encoder.ipynb Removes legacy notebook artifact.
notebooks/02_crafter.ipynb Removes legacy notebook artifact.
notebooks/01_loader.ipynb Removes legacy notebook artifact.
notebooks/00_core.ipynb Removes legacy notebook artifact.
notebooks/_working_pipeline.ipynb Removes legacy notebook artifact.
notebooks/_visualize.ipynb Removes legacy notebook artifact.
Comments suppressed due to low confidence (1)

.github/workflows/python-app.yml:30

  • CI installs flake8 via pip, but the lint step runs poetry run flake8. Unless Poetry is configured to use the system environment, poetry run executes inside Poetry’s virtualenv where flake8 is not installed (it’s not in pyproject.toml dev-dependencies), so this step is likely to fail. Either add flake8 to [tool.poetry.dev-dependencies] or run flake8 without poetry run.
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install flake8 pytest poetry
        poetry install

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread memery/loader.py Outdated
new_files = [(str(path), hash) for path, hash in filepaths if hash not in archive_hashes and verify_image(path)]

return(archive_db, new_files)
def archive_loader(filepaths: list[str], db: Any) -> tuple[set[str], list[str]]:
Comment thread memery/encoder.py Outdated
batch_features.append(model.encode_image(images.to(device)))

if not batch_features:
return torch.empty((0, 512), device=device)
Comment thread memery/crafter.py Outdated

def crafter(new_files: list[str], device: device, batch_size: int=128, num_workers: int=4):

def crafter(new_files: list[str], device: device,
Comment thread memery/crafter.py Outdated
return img_loader


def preproc(img: Tensor) -> Compose:
Comment thread memery/encoder.py Outdated
Comment on lines +28 to +32
# safe_collate (in crafter) returns None when every image in a
# batch failed to decode — skip rather than crash.
continue
images, _ = batch
batch_features.append(model.encode_image(images.to(device)))
Comment thread memery/crafter.py Outdated
Comment on lines +18 to +22
def pil_loader(path: str) -> Image.Image:
ImageFile.LOAD_TRUNCATED_IMAGES = True # Allow truncated images
"""Open `path` and return an RGB PIL image, or None on failure.

Uses PIL's JPEG `draft` mode to decode at the smallest IDCT scale that's
still ≥ 256px. For large JPEGs this skips most of the inverse-DCT work
Comment thread memery/cli.py Outdated
Comment on lines 23 to 27
) -> list[str]:
"""Search recursively over a folder from the command line"""
memery = Memery()
ranked = memery.query_flow(root, query=text, image_query=image)
ranked = memery.query_flow(root, query=text, negative_query=negative, image_query=image)
print(ranked[:number])
deepfates added 2 commits May 7, 2026 15:53
Caught by Copilot review on #41.

When safe_collate drops items from a partially-failing batch (an image
that passed verify_image but failed at PIL decode time — truncated files,
weird color spaces), the encoder used to discard the surviving labels
and just concat the embeddings. Downstream, indexer.join_all correlates
new_files and embeddings by position, so dropped items would shift every
subsequent file's embedding by one — silent data corruption — and the
last few files would IndexError.

- encoder.image_encoder now returns (embeddings, surviving_labels). Both
  are CPU tensors. Empty-batch path also returns CPU tensors for
  consistency with the non-empty path.
- core.index_flow filters new_files using surviving_labels before passing
  to join_all, so positions stay aligned. Logs how many were skipped.
- New regression test in test_core: monkey-patches pil_loader to fail on
  one fixture and verifies the resulting db has correct fpath -> embed
  mapping with no IndexError. Marked integration (uses real CLIP).
Also from Copilot review on #41. Each is small but worth getting right
since the type annotations are how readers and IDEs reason about the API.

- loader.archive_loader: type hints now describe the actual shapes
  (filepaths is list[(Path, hash)], not list[str]; returns
  (dict, list[(str, str)]) not (set, list)).
- crafter.crafter: parameter is list[(str, str)], not list[str].
- crafter.preproc: returns Tensor, not Compose; takes a PIL Image.
- crafter.pil_loader: returns Optional[Image.Image] since it returns
  None on decode failure.
- cli.recall: annotated as -> None to match actual behavior (it prints
  rather than returns).

CI dep hygiene:
- Add flake8 to dev-dependencies so `poetry run flake8` actually finds
  the tool inside the poetry-managed venv (was previously pip-installed
  outside the venv and worked by accident).
- Workflow installs flake8/pytest via `poetry install` instead of a
  separate `pip install`, so we stop risking version skew between the
  CI image's pip and what poetry resolves.
@deepfates deepfates merged commit aec9510 into main May 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants