NVIDIA · charlesbluca · Mar 2, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
@@ -89,15 +89,17 @@ COPY src src
 COPY api api
 COPY client client
 
-# Use conda env's uv; create venv and install retriever in editable mode (path deps: ../src, ../api, ../client)
+# Use conda env's uv; create venv and install retriever with uv pip install (no lock; path deps: ../src, ../api, ../client).
+# INSTALL_ASR=1 installs the [asr] extra (transformers>=5, soundfile, scipy) for local Parakeet ASR; omit for vLLM/embedding-only images.
+ARG INSTALL_ASR=0
 SHELL ["/bin/bash", "-c"]
 RUN --mount=type=cache,target=/root/.cache/pip \
     --mount=type=cache,target=/root/.cache/uv \
     source /opt/conda/etc/profile.d/conda.sh \
     && conda activate retriever_libcudart \
     && uv venv .retriever \
     && . .retriever/bin/activate \
-    && uv pip install -e ./retriever
+    && if [ "$INSTALL_ASR" = "1" ]; then uv pip install -e "./retriever[asr]"; else uv pip install -e ./retriever; fi
 
 # Default: run in-process pipeline (help if no args)
 ENTRYPOINT ["/workspace/.retriever/bin/python", "-m", "retriever.examples.inprocess_pipeline"]

@@ -10,7 +10,7 @@ RAG ingestion pipeline for PDFs: extract structure (text, tables, charts, infogr
 
 ## Installation
 
-Installation is done with **UV** from the **nv-ingest root**. UV manages the environment and dependencies; pip is not supported.
+Installation is done with **UV** from the **nv-ingest root** using **uv pip install** (no lockfile/sync so optional extras stay independent). Pip is not supported.
 
 From the repo root:
 
@@ -21,7 +21,17 @@ source .retriever/bin/activate
 uv pip install -e ./nemo_retriever
 ```
 
-This installs the retriever in editable mode and its in-repo dependencies. Core dependencies (see `nemo_retriever/pyproject.toml`) include Ray, pypdfium2, pandas, LanceDB, PyYAML, torch, transformers, and the Nemotron packages (page-elements, graphic-elements, table-structure). The retriever also depends on the sibling packages `nv-ingest`, `nv-ingest-api`, and `nv-ingest-client` in this repo.
+This installs the retriever in editable mode and its in-repo dependencies. Core dependencies (see `nemo_retriever/pyproject.toml`) include Ray, pypdfium2, pandas, LanceDB, PyYAML, torch, transformers (4.x), vLLM 0.16, and the Nemotron packages (page-elements, graphic-elements, table-structure). The retriever also depends on the sibling packages `nv-ingest`, `nv-ingest-api`, and `nv-ingest-client` in this repo.
+
+### Optional: ASR extra (local Parakeet)
+
+For **local ASR** (nvidia/parakeet-ctc-1.1b with `audio_endpoints` unset), install the `[asr]` extra. This pulls in `transformers>=5`, `soundfile`, and `scipy` and is mutually exclusive with the default stack (vLLM 0.16 uses transformers&lt;5):
+
+```bash
+uv pip install -e "./nemo_retriever[asr]"
+```
+
+Docker: build with ASR support using `--build-arg INSTALL_ASR=1`.
 
 ### OCR and CUDA 13 runtime
 
@@ -122,3 +132,65 @@ To stop and remove both stacks:
 docker compose -p ingest-gpu0 down
 docker compose -p ingest-gpu1 down
 ```
+
+## Embedding backends
+
+Embeddings can be served by a **remote HTTP endpoint** (NIM, vLLM, or any OpenAI-compatible server) or by a **local HuggingFace model** when no endpoint is configured.
+
+- **Config**: Set `embedding_nim_endpoint` in `ingest-config.yaml` or stage config (e.g. `http://localhost:8000/v1`). Leave empty or null to use the local HF embedder.
+- **CLI**: Use `--embed-invoke-url` (inprocess/batch pipelines) or `--embedding-endpoint` / `--embedding-http-endpoint` (recall CLI) to point at a remote server.
+
+### Using vLLM for embeddings
+
+You can serve an embedding model with [vLLM](https://docs.vllm.ai/) and point the retriever at it. vLLM exposes an OpenAI-compatible `/v1/embeddings` API. Set the embedding endpoint to the vLLM base URL (e.g. `http://localhost:8000/v1`).
+
+**vLLM compatibility**: The default NIM-style client sends `input_type` and `truncate` in the request body; some vLLM versions or configs may not accept these. When using a **vLLM** server, enable the vLLM-compatible payload:
+
+- **Ingest**: `--embed-use-vllm-compat` (inprocess pipeline) or set `embed_use_vllm_compat: true` in `EmbedParams`.
+- **Recall**: `--embedding-use-vllm-compat` (recall CLI).
+
+This sends only `model`, `input`, and `encoding_format` (minimal OpenAI-compatible payload).
+
+### llama-nemotron-embed-1b-v2 with vLLM
+
+For **nvidia/llama-nemotron-embed-1b-v2**, follow the model’s official vLLM instructions:
+
+1. Use **vllm==0.11.0**.
+2. Clone the [model repo](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2) and **overwrite `config.json` with `config_vllm.json`** from that repo.
+3. Start the server (replace `<path_to_the_cloned_repository>` and `<num_gpus_to_use>`):
+
+   ```bash
+   vllm serve \
+       <path_to_the_cloned_repository> \
+       --trust-remote-code \
+       --runner pooling \
+       --model-impl vllm \
+       --override-pooler-config '{"pooling_type": "MEAN"}' \
+       --data-parallel-size <num_gpus_to_use> \
+       --dtype float32 \
+       --port 8000
+   ```
+
+4. Set the retriever embedding endpoint to `http://localhost:8000/v1` and use `--embed-use-vllm-compat` / `--embedding-use-vllm-compat` as above.
+
+See the [model README](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2) for the canonical vLLM setup and client example.
+
+### Using vLLM offline batched inference
+
+You can run the same embedding model (e.g. llama-nemotron-embed-1b-v2) **without a vLLM server** by using vLLM’s Python API for batched inference. This loads the model in-process and runs `LLM.embed()` in batches.
+
+- **When to use**: No server to run; same model and behavior as vLLM server; good for batch ingest or recall in a single process.
+- **Install**: vLLM is an optional dependency. Install with `pip install -e ".[vllm]"` or `uv pip install -e ".[vllm]"` (requires vllm>=0.11.0 for llama-nemotron-embed-1b-v2).
+- **Model path**: You can pass a HuggingFace model id (e.g. `nvidia/llama-nemotron-embed-1b-v2`) or a **local path**. For llama-nemotron-embed-1b-v2, a local clone with `config.json` replaced by `config_vllm.json` (from the model repo) may be required for vLLM to load it correctly.
+- **Ingest**: Set `embed_use_vllm_offline: true` in `EmbedParams` or use `--embed-use-vllm-offline` in the inprocess pipeline. Optionally set `embed_model_path` (or `--embed-model-path`) to a local model path.
+- **Recall**: Use `--embedding-use-vllm-offline` (recall CLI). Optionally `--embedding-vllm-model-path` to override the model path.
+
+### Optional: vLLM attention extras (flash-attn, flashinfer-cubin, xformers)
+
+To add prebuilt attention-related packages without changing the project’s torch version, install the `[vllm-attention]` extra:
+
+```bash
+uv pip install -e "./nemo_retriever[vllm-attention]"
+```
+
+This installs **flashinfer-cubin** (match version to flashinfer-python), **flash-attn** (prebuilt wheel for torch 2.9 + cu12, Linux py312), and **xformers** (0.0.33.x compatible with torch 2.9). It can reduce vLLM startup time (e.g. CUDA graph capture). The `flash-attn` wheel is sourced from GitHub releases; on other platforms you may need to install it separately.
@@ -44,10 +44,9 @@ dependencies = [
   "numpy>=1.26.0",
   "debugpy>=1.8.0",
   "python-multipart>=0.0.9",
-  # transformers>=5 enables loading nvidia/parakeet-ctc-1.1b via pipeline (see
-  # parakeet-ctc-1.1b README). If using llama_nemotron_embed_1b_v2, verify
-  # compatibility with transformers 5 (it previously relied on HybridCache).
-  "transformers>=5.0.0",
+  # transformers 4.x for vLLM 0.16 and embedding models (e.g. llama_nemotron_embed_1b_v2).
+  # Versions 4.54.0-4.55.x have a flash attention bug; exclude that range.
+  "transformers>=4.49.0,<5.0.0,!=4.54.*,!=4.55.*",
   "tokenizers>=0.20.3",
   "accelerate>=1.1.0",
   "torch~=2.9.1",
@@ -64,12 +63,23 @@ dependencies = [
   "accelerate==1.12.0",
   "albumentations==2.0.8",
   "open-clip-torch==3.2.0",
-  # Local ASR (Parakeet): read chunk files and resample to 16 kHz mono
-  "soundfile>=0.12.0",
-  "scipy>=1.11.0",
+  "vllm==0.16.0",
 ]
 
 [project.optional-dependencies]
+asr = [
+  # Local ASR (Parakeet nvidia/parakeet-ctc-1.1b): transformers>=5 required by model.
+  "transformers>=5.0.0",
+  "soundfile>=0.12.0",
+  "scipy>=1.11.0",
+]
+# Optional: prebuilt FlashInfer cubins + flash-attn + xformers for vLLM (faster startup / attention).
+# Keeps existing torch~=2.9.1; flash-attn uses a Linux py3.12 wheel from GitHub.
+vllm-attention = [
+  "flashinfer-cubin==0.6.3",
+  "flash-attn>=2.8.3",
+  "xformers>=0.0.33,<0.0.34",
+]
 dev = [
   "build>=1.2.2",
   "pytest>=8.0.2",
@@ -89,8 +99,10 @@ nemotron-page-elements-v3 = { index = "test-pypi" }
 nemotron-graphic-elements-v1 = { index = "test-pypi" }
 nemotron-table-structure-v1 = { index = "test-pypi" }
 nemotron-ocr = { index = "test-pypi" }
-torch = { index = "torch-cuda"}
-torchvision = { index ="torch-cuda"}
+torch = { index = "torch-cuda" }
+torchvision = { index = "torch-cuda" }
+# Prebuilt wheel for torch 2.9 + cu12 (Linux x86_64, py312). Avoids source build and keeps torch as-is.
+flash-attn = { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl" }
 
 [[tool.uv.index]]
 name = "test-pypi"

@@ -0,0 +1,142 @@
+# vLLM vs baseline embedding comparison scripts
+
+Scripts for comparing **local HF (baseline)** and **vLLM offline** embedding on pre-embedded parquet: ingest time and recall. Use them to run a single comparison or a parameter sweep and plot results.
+
+**Assumptions:** Run from the **repository root** (parent of `nemo_retriever/`) with the nemo_retriever env (e.g. `uv run`). The machine (or Ray cluster) must have **at least one GPU** for the embed service.
+
+---
+
+## 1. Single-run comparison (`vllm_embedding_comparison.py`)
+
+Compares baseline (HF) and vLLM offline on a pre-embed parquet dir: starts a long-lived embed service per backend, warms it up, runs a **timed** ingest (model load excluded), then recall.
+
+### Usage
+
+```bash
+# From repo root; unset RAY_ADDRESS so we start our own Ray cluster
+unset RAY_ADDRESS
+
+uv run python nemo_retriever/scripts/vllm_embedding_comparison.py compare-from-pre-embed \
+  --pre-embed-dir /path/to/pre_embed_dir \
+  --query-csv /path/to/query_gt.csv \
+  --embed-model-path /path/to/llama-nemotron-embed-1b-v2/main
+```
+
+### Options (often used)
+
+| Option | Description |
+|--------|-------------|
+| `--max-rows N` | Use only N rows from pre-embed (faster; e.g. 1000). |
+| `--gpu-memory-utilization 0.55` | vLLM GPU memory fraction (default 0.55). |
+| `--enforce-eager` | vLLM: disable CUDA graphs (slower, avoids some env issues). |
+| `--sort-key COL` | Sort by column before limit so baseline and vLLM see the same rows. |
+| `--output-csv FILE` | Append one row of metrics to CSV for later plotting. |
+
+### Example (1K rows, with output for sweep)
+
+```bash
+unset RAY_ADDRESS
+uv run python nemo_retriever/scripts/vllm_embedding_comparison.py compare-from-pre-embed \
+  --pre-embed-dir /path/to/bo767_pre_embed \
+  --query-csv data/bo767_query_gt.csv \
+  --embed-model-path /path/to/llama-nemotron-embed-1b-v2/main \
+  --max-rows 1000 \
+  --output-csv comparison_sweep.csv
+```
+
+---
+
+## 2. Sweep: grid of runs (`run_comparison_sweep.py`)
+
+Runs `compare-from-pre-embed` over a grid of **gpu_memory_utilization** × **max_rows**, appending one CSV row per run. Useful for collecting data to plot ingest time vs scale and GPU util.
+
+### Grid (defaults)
+
+- **gpu_utils:** 0.4, 0.5, 0.6, 0.7, 0.8  
+- **max_rows:** 1000, 2000, 5000, 10000  
+
+Override with `--gpu-utils 0.4,0.6,0.8` and `--max-rows-list 1000,5000`. To sweep **embed_batch_size** (e.g. 256, 512, 768) with an env that has **flashinfer-cubin**, see [embed_batch_size_sweep_setup.md](embed_batch_size_sweep_setup.md).
+
+### FlashInfer / flashinfer-cubin
+
+By default the sweep **requires** an environment with **flashinfer-cubin** (e.g. `uv sync --extra vllm-attention` in `nemo_retriever`). If it is not installed, the script exits with install instructions. Use `--no-require-flashinfer-cubin` to run anyway (e.g. to collect flashinfer_cubin=false rows).
+
+`flashinfer_cubin` is **detected at runtime** and written to the CSV. To get both true/false in one CSV:
+
+1. **With FlashInfer:** run the sweep with `flashinfer-cubin` installed (e.g. `[vllm-attention]` extra).  
+2. **Without FlashInfer:** `uv pip uninstall flashinfer-cubin`, then run the **same** command and **same** `--output-csv` (with `--no-require-flashinfer-cubin`) so rows are appended.
+
+### Usage
+
+```bash
+unset RAY_ADDRESS
+
+uv run python nemo_retriever/scripts/run_comparison_sweep.py \
+  --pre-embed-dir /path/to/pre_embed_dir \
+  --query-csv /path/to/query_gt.csv \
+  --embed-model-path /path/to/llama-nemotron-embed-1b-v2/main \
+  --output-csv comparison_sweep.csv
+```
+
+### Tracking progress (log file)
+
+```bash
+unset RAY_ADDRESS
+uv run python nemo_retriever/scripts/run_comparison_sweep.py \
+  --pre-embed-dir /path/to/pre_embed_dir \
+  --query-csv /path/to/query_gt.csv \
+  --embed-model-path /path/to/llama-nemotron-embed-1b-v2/main \
+  --output-csv comparison_sweep.csv \
+  2>&1 | tee comparison_sweep.log
+```
+
+Then in another terminal: `tail -f comparison_sweep.log`.
+
+---
+
+## 3. Plotting (`plot_comparison_sweep.py`)
+
+Reads the CSV produced by the sweep and writes figures (ingest time vs max_rows, ingest vs gpu_util, recall@10, optional FlashInfer impact).
+
+### Usage
+
+```bash
+uv run python nemo_retriever/scripts/plot_comparison_sweep.py run \
+  --input-csv comparison_sweep.csv \
+  --output-dir ./comparison_plots
+```
+
+Requires **pandas** and **matplotlib** (usually already in the env).
+
+---
+
+## 4. Optional: Ray GPU check (`check_ray_gpu.py`)
+
+Prints Ray cluster resources and actors that use or are waiting for GPU. Useful when the comparison is “stuck pending” (no GPU placement).
+
+- To see the comparison’s actors, the comparison must use a **persistent** Ray cluster (`ray start --head --num-gpus=1`) and you must set **RAY_ADDRESS** for both the comparison and this script.  
+- If the comparison is run **without** `--ray-address` (default), it uses `ray.init("local")` and no separate Ray server exists, so this script cannot attach.
+
+```bash
+RAY_ADDRESS=127.0.0.1:6379 uv run python nemo_retriever/scripts/check_ray_gpu.py
+```
+
+---
+
+## Running on other machines
+
+1. **Paths:** Replace `/path/to/pre_embed_dir`, `/path/to/query_gt.csv`, and `--embed-model-path` with paths valid on that machine.  
+2. **Environment:** Use the same Python env as for development (e.g. `uv run` from repo root, or activate the venv and run `python nemo_retriever/scripts/...`).  
+3. **RAY_ADDRESS:** Run `unset RAY_ADDRESS` before the comparison/sweep so the script starts its own Ray cluster; otherwise you may attach to another user’s cluster and see 0 GPUs.  
+4. **GPU:** The embed service needs 1 GPU. If Ray reports 0 GPUs, the script will error; ensure the node has a GPU and Ray can see it (e.g. `ray start --head --num-gpus=1` if using a persistent cluster).
+
+---
+
+## Script summary
+
+| Script | Purpose |
+|--------|--------|
+| `vllm_embedding_comparison.py` | Single compare-from-pre-embed run (baseline + vLLM); optional `--output-csv`. |
+| `run_comparison_sweep.py` | Sweep over gpu_util × max_rows; appends to `--output-csv`. |
+| `plot_comparison_sweep.py` | Read sweep CSV, write figures to `--output-dir`. |
+| `check_ray_gpu.py` | Print Ray GPU/resources and GPU actors (for debugging). |