A CUDA-accelerated image similarity search engine. Give it a folder of query images and it returns the most visually similar pictures from a pre-encoded database.
CLIP turns each query into an embedding, a pair of custom CUDA kernels score it against the database and pick the top matches, and a small stream pool keeps the GPU busy while the next batch loads and the last batch's results are written out.
For a detailed report, see REPORT.
The database is built from the Oxford-IIIT Pet dataset:
~7,400 images of cats and dogs across 37 breeds. Each
image is encoded once with CLIP ViT-B/32 into a 512-dim float32 vector
and packed into data/embeddings.bin.
./setup_project.sh && ./run_project.shNote: the entire setup + run project might take more than 10 mins.
setup_project.sh is a one-time prerequisite: it setup the environment,
downloads the Oxford-IIIT Pet dataset, generates
data/embeddings.bin, and pre-builds the benchmark query batches under
data/query_batches/.
run_project.sh is the repeatable benchmark sweep: it builds the binaries,
regenerates data/query_batches_<size>/ from scratch for each size, runs
all profiling scripts at BATCH_SIZE=100 and BATCH_SIZE=500, and
routes artifacts to output/batch_100/ and output/batch_500/ before
printing a lookup table of every produced file.
Image dataset ──Python CLIP──▶ data/embeddings.bin (DB, [N][512] f32)
Per batch (interactive):
query folder ──Python CLIP──▶ data/query_embeddings.bin + query_paths.txt
│
▼
StreamManager.acquire() (waits on prior batch on this slot,
prints its GPU kernel time)
│
grow pinned host + device buffers if Q grew
cudaMemcpyAsync H→D
cosine_similarity_batch_kernel<<<N, 64, shmem>>>
topk_per_query_kernel<<<Q, 128>>> (1 block / query)
cudaMemcpyAsync D→H (only Q*K idx + Q*K scores)
cudaLaunchHostFunc → CSV + (HTML &)
Two stream slots in the round-robin pool let the next batch's H→D and kernel overlap with the previous batch's D→H + CSV write.
To view a demo for the final HTML output, open demo/demo.html in a browser.
Load the pet images you'd like to search in a folder.
The engine accpets .jpg/.jpeg/.png files.
Example images from https://www.pexels.com/ :
mkdir -p query/batch_0
curl -L "https://images.pexels.com/photos/9428235/pexels-photo-9428235.jpeg" -o query/batch_0/image1.jpeg
curl -L "https://images.pexels.com/photos/33205883/pexels-photo-33205883.jpeg" -o query/batch_0/image2.jpeg
curl -L "https://images.pexels.com/photos/32155881/pexels-photo-32155881.jpeg" -o query/batch_0/image3.jpegThe interactive REPL lives in the binary directly:
make
./bin/image_search # interactive REPL, default top_k=10At the prompt, enter the query dir:
> query/batch_0
You can also pass in a precompute embedding (see below).
Type quit or done to exit. Per batch the engine writes:
output/results_<batch_id>.csv
output/results_<batch_id>.html
--top-k K / -k K / positional K # default 10, max 16 (GPU kernel cap)
--streams N # CUDA stream pool size (default 2)
--block-size N # cosine kernel block size; power of two
# <= 1024 (default 64)
--no-embed # input is a precomputed batch dir
# (embeddings.bin + paths.txt) — skip CLIP
--no-output # suppress CSV/HTML + per-result logging
# (benchmark mode)
The profiling scripts under profiling/ consume per-batch directories under
data/query_batches/ so no bench pays for Python CLIP startup:
data/query_batches/
manifest.json
warmup_001/{embeddings.bin, paths.txt}
batch_001/{embeddings.bin, paths.txt}
...
Each embeddings.bin is Q x 512 float32 sliced from data/embeddings.bin
with a deterministic per-batch seed. Re-run the preprocess with --force
to change batch count, batch size, or warmup count. The bench scripts
validate manifest.json before running and fail fast on mismatch.
Note: because rows are sliced from the cached DB embeddings rather than
re-derived by running CLIP on the source JPEGs, benchmark results_*.csv
top-k rows will not be byte-identical to interactive (CLIP-driven) runs.
Kernel work measured is identical; only the input query vectors differ.
./run_project.sh invokes all four benchmark scripts under profiling/:
| Script | Purpose | Output |
|---|---|---|
bench_cpu_vs_gpu.sh |
times image_search_cpu vs image_search |
cpu_v_gpu.txt |
benchmark.sh |
Nsight Compute --set full (per block size) + Nsight Systems (per stream count) |
ncu_test_all_b{64,128,256}.*, nsys_image_search_s{1,2}.* |
bench_streams.sh |
engine sweep across CUDA stream counts 1/2/4 | bench_streams.txt |
bench_blocksize.sh |
engine sweep across cosine block sizes 32–512 | bench_blocksize.txt |
bench_streams.sh and bench_blocksize.sh are thin wrappers over the generic
bench_sweep.sh; all four scripts share helpers in profiling/lib.sh and
consume the same precomputed batches under data/query_batches/, so no CLIP
cost at benchmark time. Common env vars: TOP_K BATCHES BATCH_SIZE WARMUP QUERY_BATCH_DIR OUT_DIR, plus the sweep sets STREAM_COUNTS, BLOCK_SIZES,
NCU_BLOCK_SIZES, NSYS_STREAM_COUNTS. See REPORT.md for results and analysis.
src/
engine.cu CUDA engine + StreamManager + cosine/top-k kernels
engine_cpu.cpp CPU baseline (mirrors engine.cu I/O + CLI)
helper.h shared CLI parsing, I/O, config
scripts/
download_data.py Oxford-IIIT Pet
generate_embeddings.py DB embeddings
embed_query.py interactive-mode CLIP
preprocess_query_batches.py precomputed batches for benches
make_html.py CSV → results HTML
profiling/
benchmark.sh Nsight Compute (per block size) + Nsight Systems (per stream count)
bench_cpu_vs_gpu.sh CPU vs GPU timing
bench_streams.sh CUDA stream-count sweep (1, 2, 4) — wraps bench_sweep.sh
bench_blocksize.sh cosine block-size sweep (32–512) — wraps bench_sweep.sh
bench_sweep.sh generic single-flag sweep driver
lib.sh shared bench helpers (manifest check, queries, parsing)
tests/
self_match.sh, images.txt smoke test (query == DB row → rank 1)
demo/ example results HTML + assets
data/
embeddings.bin, image_paths.txt, labels.txt, embeddings_meta.json
query_batches{,_100,_500}/{manifest.json, warmup_*/, batch_*/}
bin/ image_search, image_search_cpu (built)
output/
batch_{100,500}/ per-size benchmark artifacts (cpu_v_gpu.txt,
ncu_test_all_b{64,128,256}.*, nsys_image_search_s{1,2}.*,
bench_streams.txt, bench_blocksize.txt)
results_<batch_id>.{csv,html} interactive-mode per-batch results
Makefile build image_search (nvcc, -lineinfo) + image_search_cpu (g++)
requirements.txt Python deps for setup + CLIP
setup_project.sh one-shot env + data + preprocess (run once)
run_project.sh two-size benchmark sweep (100 + 500), four bench scripts