Skip to content

ethanLin520/cuda-image-search-engine

Repository files navigation

Cuda Image Search Engine

A CUDA-accelerated image similarity search engine. Give it a folder of query images and it returns the most visually similar pictures from a pre-encoded database.

CLIP turns each query into an embedding, a pair of custom CUDA kernels score it against the database and pick the top matches, and a small stream pool keeps the GPU busy while the next batch loads and the last batch's results are written out.

For a detailed report, see REPORT.

Dataset

The database is built from the Oxford-IIIT Pet dataset: ~7,400 images of cats and dogs across 37 breeds. Each image is encoded once with CLIP ViT-B/32 into a 512-dim float32 vector and packed into data/embeddings.bin.

Quick Start

./setup_project.sh && ./run_project.sh

Note: the entire setup + run project might take more than 10 mins.

setup_project.sh is a one-time prerequisite: it setup the environment, downloads the Oxford-IIIT Pet dataset, generates data/embeddings.bin, and pre-builds the benchmark query batches under data/query_batches/.

run_project.sh is the repeatable benchmark sweep: it builds the binaries, regenerates data/query_batches_<size>/ from scratch for each size, runs all profiling scripts at BATCH_SIZE=100 and BATCH_SIZE=500, and routes artifacts to output/batch_100/ and output/batch_500/ before printing a lookup table of every produced file.

Pipeline

Image dataset ──Python CLIP──▶  data/embeddings.bin     (DB, [N][512] f32)

Per batch (interactive):
  query folder ──Python CLIP──▶ data/query_embeddings.bin + query_paths.txt
                                          │
                                          ▼
                  StreamManager.acquire()  (waits on prior batch on this slot,
                                            prints its GPU kernel time)
                                          │
                  grow pinned host + device buffers if Q grew
                  cudaMemcpyAsync H→D
                  cosine_similarity_batch_kernel<<<N, 64, shmem>>>
                  topk_per_query_kernel<<<Q, 128>>>         (1 block / query)
                  cudaMemcpyAsync D→H  (only Q*K idx + Q*K scores)
                  cudaLaunchHostFunc → CSV + (HTML &)

Two stream slots in the round-robin pool let the next batch's H→D and kernel overlap with the previous batch's D→H + CSV write.

Demo

To view a demo for the final HTML output, open demo/demo.html in a browser.

Run the Engine

Load the pet images you'd like to search in a folder. The engine accpets .jpg/.jpeg/.png files.

Example images from https://www.pexels.com/ :

mkdir -p query/batch_0

curl -L "https://images.pexels.com/photos/9428235/pexels-photo-9428235.jpeg" -o query/batch_0/image1.jpeg
curl -L "https://images.pexels.com/photos/33205883/pexels-photo-33205883.jpeg" -o query/batch_0/image2.jpeg
curl -L "https://images.pexels.com/photos/32155881/pexels-photo-32155881.jpeg" -o query/batch_0/image3.jpeg

The interactive REPL lives in the binary directly:

make
./bin/image_search                 # interactive REPL, default top_k=10

At the prompt, enter the query dir:

> query/batch_0

You can also pass in a precompute embedding (see below).

Type quit or done to exit. Per batch the engine writes:

output/results_<batch_id>.csv
output/results_<batch_id>.html

Engine flags

--top-k K / -k K / positional K   # default 10, max 16 (GPU kernel cap)
--streams N                       # CUDA stream pool size (default 2)
--block-size N                    # cosine kernel block size; power of two
                                  # <= 1024 (default 64)
--no-embed                        # input is a precomputed batch dir
                                  # (embeddings.bin + paths.txt) — skip CLIP
--no-output                       # suppress CSV/HTML + per-result logging
                                  # (benchmark mode)

Pre-computed Query Batches (benchmark fast-path)

The profiling scripts under profiling/ consume per-batch directories under data/query_batches/ so no bench pays for Python CLIP startup:

data/query_batches/
  manifest.json
  warmup_001/{embeddings.bin, paths.txt}
  batch_001/{embeddings.bin, paths.txt}
  ...

Each embeddings.bin is Q x 512 float32 sliced from data/embeddings.bin with a deterministic per-batch seed. Re-run the preprocess with --force to change batch count, batch size, or warmup count. The bench scripts validate manifest.json before running and fail fast on mismatch.

Note: because rows are sliced from the cached DB embeddings rather than re-derived by running CLIP on the source JPEGs, benchmark results_*.csv top-k rows will not be byte-identical to interactive (CLIP-driven) runs. Kernel work measured is identical; only the input query vectors differ.

Benchmarks

./run_project.sh invokes all four benchmark scripts under profiling/:

Script Purpose Output
bench_cpu_vs_gpu.sh times image_search_cpu vs image_search cpu_v_gpu.txt
benchmark.sh Nsight Compute --set full (per block size) + Nsight Systems (per stream count) ncu_test_all_b{64,128,256}.*, nsys_image_search_s{1,2}.*
bench_streams.sh engine sweep across CUDA stream counts 1/2/4 bench_streams.txt
bench_blocksize.sh engine sweep across cosine block sizes 32–512 bench_blocksize.txt

bench_streams.sh and bench_blocksize.sh are thin wrappers over the generic bench_sweep.sh; all four scripts share helpers in profiling/lib.sh and consume the same precomputed batches under data/query_batches/, so no CLIP cost at benchmark time. Common env vars: TOP_K BATCHES BATCH_SIZE WARMUP QUERY_BATCH_DIR OUT_DIR, plus the sweep sets STREAM_COUNTS, BLOCK_SIZES, NCU_BLOCK_SIZES, NSYS_STREAM_COUNTS. See REPORT.md for results and analysis.

Project Layout

src/
  engine.cu              CUDA engine + StreamManager + cosine/top-k kernels
  engine_cpu.cpp         CPU baseline (mirrors engine.cu I/O + CLI)
  helper.h               shared CLI parsing, I/O, config

scripts/
  download_data.py                  Oxford-IIIT Pet
  generate_embeddings.py            DB embeddings
  embed_query.py                    interactive-mode CLIP
  preprocess_query_batches.py       precomputed batches for benches
  make_html.py                      CSV → results HTML

profiling/
  benchmark.sh           Nsight Compute (per block size) + Nsight Systems (per stream count)
  bench_cpu_vs_gpu.sh    CPU vs GPU timing
  bench_streams.sh       CUDA stream-count sweep (1, 2, 4) — wraps bench_sweep.sh
  bench_blocksize.sh     cosine block-size sweep (32–512) — wraps bench_sweep.sh
  bench_sweep.sh         generic single-flag sweep driver
  lib.sh                 shared bench helpers (manifest check, queries, parsing)

tests/
  self_match.sh, images.txt         smoke test (query == DB row → rank 1)

demo/                    example results HTML + assets
data/
  embeddings.bin, image_paths.txt, labels.txt, embeddings_meta.json
  query_batches{,_100,_500}/{manifest.json, warmup_*/, batch_*/}
bin/                     image_search, image_search_cpu (built)
output/
  batch_{100,500}/       per-size benchmark artifacts (cpu_v_gpu.txt,
                         ncu_test_all_b{64,128,256}.*, nsys_image_search_s{1,2}.*,
                         bench_streams.txt, bench_blocksize.txt)
  results_<batch_id>.{csv,html}     interactive-mode per-batch results

Makefile                 build image_search (nvcc, -lineinfo) + image_search_cpu (g++)
requirements.txt         Python deps for setup + CLIP
setup_project.sh         one-shot env + data + preprocess (run once)
run_project.sh           two-size benchmark sweep (100 + 500), four bench scripts

About

CUDA-accelerated image similarity search engine. CLIP embeddings and custom CUDA kernels for top-k search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors