Cuda Image Search Engine

A CUDA-accelerated image similarity search engine. Give it a folder of query images and it returns the most visually similar pictures from a pre-encoded database.

CLIP turns each query into an embedding, a pair of custom CUDA kernels score it against the database and pick the top matches, and a small stream pool keeps the GPU busy while the next batch loads and the last batch's results are written out.

For a detailed report, see REPORT.

Dataset

The database is built from the Oxford-IIIT Pet dataset: ~7,400 images of cats and dogs across 37 breeds. Each image is encoded once with CLIP ViT-B/32 into a 512-dim float32 vector and packed into data/embeddings.bin.

Quick Start

./setup_project.sh && ./run_project.sh

Note: the entire setup + run project might take more than 10 mins.

setup_project.sh is a one-time prerequisite: it setup the environment, downloads the Oxford-IIIT Pet dataset, generates data/embeddings.bin, and pre-builds the benchmark query batches under data/query_batches/.

run_project.sh is the repeatable benchmark sweep: it builds the binaries, regenerates data/query_batches_<size>/ from scratch for each size, runs all profiling scripts at BATCH_SIZE=100 and BATCH_SIZE=500, and routes artifacts to output/batch_100/ and output/batch_500/ before printing a lookup table of every produced file.

Pipeline

Image dataset ──Python CLIP──▶  data/embeddings.bin     (DB, [N][512] f32)

Per batch (interactive):
  query folder ──Python CLIP──▶ data/query_embeddings.bin + query_paths.txt
                                          │
                                          ▼
                  StreamManager.acquire()  (waits on prior batch on this slot,
                                            prints its GPU kernel time)
                                          │
                  grow pinned host + device buffers if Q grew
                  cudaMemcpyAsync H→D
                  cosine_similarity_batch_kernel<<<N, 64, shmem>>>
                  topk_per_query_kernel<<<Q, 128>>>         (1 block / query)
                  cudaMemcpyAsync D→H  (only Q*K idx + Q*K scores)
                  cudaLaunchHostFunc → CSV + (HTML &)

Two stream slots in the round-robin pool let the next batch's H→D and kernel overlap with the previous batch's D→H + CSV write.

Demo

To view a demo for the final HTML output, open demo/demo.html in a browser.

Run the Engine

Load the pet images you'd like to search in a folder. The engine accpets .jpg/.jpeg/.png files.

Example images from https://www.pexels.com/ :

mkdir -p query/batch_0

curl -L "https://images.pexels.com/photos/9428235/pexels-photo-9428235.jpeg" -o query/batch_0/image1.jpeg
curl -L "https://images.pexels.com/photos/33205883/pexels-photo-33205883.jpeg" -o query/batch_0/image2.jpeg
curl -L "https://images.pexels.com/photos/32155881/pexels-photo-32155881.jpeg" -o query/batch_0/image3.jpeg

The interactive REPL lives in the binary directly:

make
./bin/image_search                 # interactive REPL, default top_k=10

At the prompt, enter the query dir:

> query/batch_0

You can also pass in a precompute embedding (see below).

Type quit or done to exit. Per batch the engine writes:

output/results_<batch_id>.csv
output/results_<batch_id>.html

Engine flags

--top-k K / -k K / positional K   # default 10, max 16 (GPU kernel cap)
--streams N                       # CUDA stream pool size (default 2)
--block-size N                    # cosine kernel block size; power of two
                                  # <= 1024 (default 64)
--no-embed                        # input is a precomputed batch dir
                                  # (embeddings.bin + paths.txt) — skip CLIP
--no-output                       # suppress CSV/HTML + per-result logging
                                  # (benchmark mode)

Pre-computed Query Batches (benchmark fast-path)

The profiling scripts under profiling/ consume per-batch directories under data/query_batches/ so no bench pays for Python CLIP startup:

data/query_batches/
  manifest.json
  warmup_001/{embeddings.bin, paths.txt}
  batch_001/{embeddings.bin, paths.txt}
  ...

Each embeddings.bin is Q x 512 float32 sliced from data/embeddings.bin with a deterministic per-batch seed. Re-run the preprocess with --force to change batch count, batch size, or warmup count. The bench scripts validate manifest.json before running and fail fast on mismatch.

Note: because rows are sliced from the cached DB embeddings rather than re-derived by running CLIP on the source JPEGs, benchmark results_*.csv top-k rows will not be byte-identical to interactive (CLIP-driven) runs. Kernel work measured is identical; only the input query vectors differ.

Benchmarks

./run_project.sh invokes all four benchmark scripts under profiling/:

Script	Purpose	Output
`bench_cpu_vs_gpu.sh`	times `image_search_cpu` vs `image_search`	`cpu_v_gpu.txt`
`benchmark.sh`	Nsight Compute `--set full` (per block size) + Nsight Systems (per stream count)	`ncu_test_all_b{64,128,256}.`, `nsys_image_search_s{1,2}.`
`bench_streams.sh`	engine sweep across CUDA stream counts 1/2/4	`bench_streams.txt`
`bench_blocksize.sh`	engine sweep across cosine block sizes 32–512	`bench_blocksize.txt`

bench_streams.sh and bench_blocksize.sh are thin wrappers over the generic bench_sweep.sh; all four scripts share helpers in profiling/lib.sh and consume the same precomputed batches under data/query_batches/, so no CLIP cost at benchmark time. Common env vars: TOP_K BATCHES BATCH_SIZE WARMUP QUERY_BATCH_DIR OUT_DIR, plus the sweep sets STREAM_COUNTS, BLOCK_SIZES, NCU_BLOCK_SIZES, NSYS_STREAM_COUNTS. See REPORT.md for results and analysis.

Project Layout

src/
  engine.cu              CUDA engine + StreamManager + cosine/top-k kernels
  engine_cpu.cpp         CPU baseline (mirrors engine.cu I/O + CLI)
  helper.h               shared CLI parsing, I/O, config

scripts/
  download_data.py                  Oxford-IIIT Pet
  generate_embeddings.py            DB embeddings
  embed_query.py                    interactive-mode CLIP
  preprocess_query_batches.py       precomputed batches for benches
  make_html.py                      CSV → results HTML

profiling/
  benchmark.sh           Nsight Compute (per block size) + Nsight Systems (per stream count)
  bench_cpu_vs_gpu.sh    CPU vs GPU timing
  bench_streams.sh       CUDA stream-count sweep (1, 2, 4) — wraps bench_sweep.sh
  bench_blocksize.sh     cosine block-size sweep (32–512) — wraps bench_sweep.sh
  bench_sweep.sh         generic single-flag sweep driver
  lib.sh                 shared bench helpers (manifest check, queries, parsing)

tests/
  self_match.sh, images.txt         smoke test (query == DB row → rank 1)

demo/                    example results HTML + assets
data/
  embeddings.bin, image_paths.txt, labels.txt, embeddings_meta.json
  query_batches{,_100,_500}/{manifest.json, warmup_*/, batch_*/}
bin/                     image_search, image_search_cpu (built)
output/
  batch_{100,500}/       per-size benchmark artifacts (cpu_v_gpu.txt,
                         ncu_test_all_b{64,128,256}.*, nsys_image_search_s{1,2}.*,
                         bench_streams.txt, bench_blocksize.txt)
  results_<batch_id>.{csv,html}     interactive-mode per-batch results

Makefile                 build image_search (nvcc, -lineinfo) + image_search_cpu (g++)
requirements.txt         Python deps for setup + CLIP
setup_project.sh         one-shot env + data + preprocess (run once)
run_project.sh           two-size benchmark sweep (100 + 500), four bench scripts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cuda Image Search Engine

Dataset

Quick Start

Pipeline

Demo

Run the Engine

Engine flags

Pre-computed Query Batches (benchmark fast-path)

Benchmarks

Project Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
demo		demo
output		output
profiling		profiling
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
REPORT.md		REPORT.md
requirements.txt		requirements.txt
run_project.sh		run_project.sh
setup_project.sh		setup_project.sh

Folders and files

Latest commit

History

Repository files navigation

Cuda Image Search Engine

Dataset

Quick Start

Pipeline

Demo

Run the Engine

Engine flags

Pre-computed Query Batches (benchmark fast-path)

Benchmarks

Project Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages