Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions nemo_retriever/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -89,15 +89,17 @@ COPY src src
COPY api api
COPY client client

# Use conda env's uv; create venv and install retriever in editable mode (path deps: ../src, ../api, ../client)
# Use conda env's uv; create venv and install retriever with uv pip install (no lock; path deps: ../src, ../api, ../client).
# INSTALL_ASR=1 installs the [asr] extra (transformers>=5, soundfile, scipy) for local Parakeet ASR; omit for vLLM/embedding-only images.
ARG INSTALL_ASR=0
SHELL ["/bin/bash", "-c"]
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/root/.cache/uv \
source /opt/conda/etc/profile.d/conda.sh \
&& conda activate retriever_libcudart \
&& uv venv .retriever \
&& . .retriever/bin/activate \
&& uv pip install -e ./retriever
&& if [ "$INSTALL_ASR" = "1" ]; then uv pip install -e "./retriever[asr]"; else uv pip install -e ./retriever; fi

# Default: run in-process pipeline (help if no args)
ENTRYPOINT ["/workspace/.retriever/bin/python", "-m", "retriever.examples.inprocess_pipeline"]
Expand Down
76 changes: 74 additions & 2 deletions nemo_retriever/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ RAG ingestion pipeline for PDFs: extract structure (text, tables, charts, infogr

## Installation

Installation is done with **UV** from the **nv-ingest root**. UV manages the environment and dependencies; pip is not supported.
Installation is done with **UV** from the **nv-ingest root** using **uv pip install** (no lockfile/sync so optional extras stay independent). Pip is not supported.

From the repo root:

Expand All @@ -21,7 +21,17 @@ source .retriever/bin/activate
uv pip install -e ./nemo_retriever
```

This installs the retriever in editable mode and its in-repo dependencies. Core dependencies (see `nemo_retriever/pyproject.toml`) include Ray, pypdfium2, pandas, LanceDB, PyYAML, torch, transformers, and the Nemotron packages (page-elements, graphic-elements, table-structure). The retriever also depends on the sibling packages `nv-ingest`, `nv-ingest-api`, and `nv-ingest-client` in this repo.
This installs the retriever in editable mode and its in-repo dependencies. Core dependencies (see `nemo_retriever/pyproject.toml`) include Ray, pypdfium2, pandas, LanceDB, PyYAML, torch, transformers (4.x), vLLM 0.16, and the Nemotron packages (page-elements, graphic-elements, table-structure). The retriever also depends on the sibling packages `nv-ingest`, `nv-ingest-api`, and `nv-ingest-client` in this repo.

### Optional: ASR extra (local Parakeet)

For **local ASR** (nvidia/parakeet-ctc-1.1b with `audio_endpoints` unset), install the `[asr]` extra. This pulls in `transformers>=5`, `soundfile`, and `scipy` and is mutually exclusive with the default stack (vLLM 0.16 uses transformers<5):

```bash
uv pip install -e "./nemo_retriever[asr]"
```

Docker: build with ASR support using `--build-arg INSTALL_ASR=1`.

### OCR and CUDA 13 runtime

Expand Down Expand Up @@ -122,3 +132,65 @@ To stop and remove both stacks:
docker compose -p ingest-gpu0 down
docker compose -p ingest-gpu1 down
```

## Embedding backends

Embeddings can be served by a **remote HTTP endpoint** (NIM, vLLM, or any OpenAI-compatible server) or by a **local HuggingFace model** when no endpoint is configured.

- **Config**: Set `embedding_nim_endpoint` in `ingest-config.yaml` or stage config (e.g. `http://localhost:8000/v1`). Leave empty or null to use the local HF embedder.
- **CLI**: Use `--embed-invoke-url` (inprocess/batch pipelines) or `--embedding-endpoint` / `--embedding-http-endpoint` (recall CLI) to point at a remote server.

### Using vLLM for embeddings

You can serve an embedding model with [vLLM](https://docs.vllm.ai/) and point the retriever at it. vLLM exposes an OpenAI-compatible `/v1/embeddings` API. Set the embedding endpoint to the vLLM base URL (e.g. `http://localhost:8000/v1`).

**vLLM compatibility**: The default NIM-style client sends `input_type` and `truncate` in the request body; some vLLM versions or configs may not accept these. When using a **vLLM** server, enable the vLLM-compatible payload:

- **Ingest**: `--embed-use-vllm-compat` (inprocess pipeline) or set `embed_use_vllm_compat: true` in `EmbedParams`.
- **Recall**: `--embedding-use-vllm-compat` (recall CLI).

This sends only `model`, `input`, and `encoding_format` (minimal OpenAI-compatible payload).

### llama-nemotron-embed-1b-v2 with vLLM

For **nvidia/llama-nemotron-embed-1b-v2**, follow the model’s official vLLM instructions:

1. Use **vllm==0.11.0**.
2. Clone the [model repo](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2) and **overwrite `config.json` with `config_vllm.json`** from that repo.
3. Start the server (replace `<path_to_the_cloned_repository>` and `<num_gpus_to_use>`):

```bash
vllm serve \
<path_to_the_cloned_repository> \
--trust-remote-code \
--runner pooling \
--model-impl vllm \
--override-pooler-config '{"pooling_type": "MEAN"}' \
--data-parallel-size <num_gpus_to_use> \
--dtype float32 \
--port 8000
```

4. Set the retriever embedding endpoint to `http://localhost:8000/v1` and use `--embed-use-vllm-compat` / `--embedding-use-vllm-compat` as above.

See the [model README](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2) for the canonical vLLM setup and client example.

### Using vLLM offline batched inference

You can run the same embedding model (e.g. llama-nemotron-embed-1b-v2) **without a vLLM server** by using vLLM’s Python API for batched inference. This loads the model in-process and runs `LLM.embed()` in batches.

- **When to use**: No server to run; same model and behavior as vLLM server; good for batch ingest or recall in a single process.
- **Install**: vLLM is an optional dependency. Install with `pip install -e ".[vllm]"` or `uv pip install -e ".[vllm]"` (requires vllm>=0.11.0 for llama-nemotron-embed-1b-v2).
- **Model path**: You can pass a HuggingFace model id (e.g. `nvidia/llama-nemotron-embed-1b-v2`) or a **local path**. For llama-nemotron-embed-1b-v2, a local clone with `config.json` replaced by `config_vllm.json` (from the model repo) may be required for vLLM to load it correctly.
- **Ingest**: Set `embed_use_vllm_offline: true` in `EmbedParams` or use `--embed-use-vllm-offline` in the inprocess pipeline. Optionally set `embed_model_path` (or `--embed-model-path`) to a local model path.
- **Recall**: Use `--embedding-use-vllm-offline` (recall CLI). Optionally `--embedding-vllm-model-path` to override the model path.

### Optional: vLLM attention extras (flash-attn, flashinfer-cubin, xformers)

To add prebuilt attention-related packages without changing the project’s torch version, install the `[vllm-attention]` extra:

```bash
uv pip install -e "./nemo_retriever[vllm-attention]"
```

This installs **flashinfer-cubin** (match version to flashinfer-python), **flash-attn** (prebuilt wheel for torch 2.9 + cu12, Linux py312), and **xformers** (0.0.33.x compatible with torch 2.9). It can reduce vLLM startup time (e.g. CUDA graph capture). The `flash-attn` wheel is sourced from GitHub releases; on other platforms you may need to install it separately.
30 changes: 21 additions & 9 deletions nemo_retriever/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,9 @@ dependencies = [
"numpy>=1.26.0",
"debugpy>=1.8.0",
"python-multipart>=0.0.9",
# transformers>=5 enables loading nvidia/parakeet-ctc-1.1b via pipeline (see
# parakeet-ctc-1.1b README). If using llama_nemotron_embed_1b_v2, verify
# compatibility with transformers 5 (it previously relied on HybridCache).
"transformers>=5.0.0",
# transformers 4.x for vLLM 0.16 and embedding models (e.g. llama_nemotron_embed_1b_v2).
# Versions 4.54.0-4.55.x have a flash attention bug; exclude that range.
"transformers>=4.49.0,<5.0.0,!=4.54.*,!=4.55.*",
"tokenizers>=0.20.3",
"accelerate>=1.1.0",
"torch~=2.9.1",
Expand All @@ -64,12 +63,23 @@ dependencies = [
"accelerate==1.12.0",
"albumentations==2.0.8",
"open-clip-torch==3.2.0",
# Local ASR (Parakeet): read chunk files and resample to 16 kHz mono
"soundfile>=0.12.0",
"scipy>=1.11.0",
"vllm==0.16.0",
]

[project.optional-dependencies]
asr = [
# Local ASR (Parakeet nvidia/parakeet-ctc-1.1b): transformers>=5 required by model.
"transformers>=5.0.0",
"soundfile>=0.12.0",
"scipy>=1.11.0",
]
# Optional: prebuilt FlashInfer cubins + flash-attn + xformers for vLLM (faster startup / attention).
# Keeps existing torch~=2.9.1; flash-attn uses a Linux py3.12 wheel from GitHub.
vllm-attention = [
"flashinfer-cubin==0.6.3",
"flash-attn>=2.8.3",
"xformers>=0.0.33,<0.0.34",
]
dev = [
"build>=1.2.2",
"pytest>=8.0.2",
Expand All @@ -89,8 +99,10 @@ nemotron-page-elements-v3 = { index = "test-pypi" }
nemotron-graphic-elements-v1 = { index = "test-pypi" }
nemotron-table-structure-v1 = { index = "test-pypi" }
nemotron-ocr = { index = "test-pypi" }
torch = { index = "torch-cuda"}
torchvision = { index ="torch-cuda"}
torch = { index = "torch-cuda" }
torchvision = { index = "torch-cuda" }
# Prebuilt wheel for torch 2.9 + cu12 (Linux x86_64, py312). Avoids source build and keeps torch as-is.
flash-attn = { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl" }

[[tool.uv.index]]
name = "test-pypi"
Expand Down
142 changes: 142 additions & 0 deletions nemo_retriever/scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# vLLM vs baseline embedding comparison scripts

Scripts for comparing **local HF (baseline)** and **vLLM offline** embedding on pre-embedded parquet: ingest time and recall. Use them to run a single comparison or a parameter sweep and plot results.

**Assumptions:** Run from the **repository root** (parent of `nemo_retriever/`) with the nemo_retriever env (e.g. `uv run`). The machine (or Ray cluster) must have **at least one GPU** for the embed service.

---

## 1. Single-run comparison (`vllm_embedding_comparison.py`)

Compares baseline (HF) and vLLM offline on a pre-embed parquet dir: starts a long-lived embed service per backend, warms it up, runs a **timed** ingest (model load excluded), then recall.

### Usage

```bash
# From repo root; unset RAY_ADDRESS so we start our own Ray cluster
unset RAY_ADDRESS

uv run python nemo_retriever/scripts/vllm_embedding_comparison.py compare-from-pre-embed \
--pre-embed-dir /path/to/pre_embed_dir \
--query-csv /path/to/query_gt.csv \
--embed-model-path /path/to/llama-nemotron-embed-1b-v2/main
```

### Options (often used)

| Option | Description |
|--------|-------------|
| `--max-rows N` | Use only N rows from pre-embed (faster; e.g. 1000). |
| `--gpu-memory-utilization 0.55` | vLLM GPU memory fraction (default 0.55). |
| `--enforce-eager` | vLLM: disable CUDA graphs (slower, avoids some env issues). |
| `--sort-key COL` | Sort by column before limit so baseline and vLLM see the same rows. |
| `--output-csv FILE` | Append one row of metrics to CSV for later plotting. |

### Example (1K rows, with output for sweep)

```bash
unset RAY_ADDRESS
uv run python nemo_retriever/scripts/vllm_embedding_comparison.py compare-from-pre-embed \
--pre-embed-dir /path/to/bo767_pre_embed \
--query-csv data/bo767_query_gt.csv \
--embed-model-path /path/to/llama-nemotron-embed-1b-v2/main \
--max-rows 1000 \
--output-csv comparison_sweep.csv
```

---

## 2. Sweep: grid of runs (`run_comparison_sweep.py`)

Runs `compare-from-pre-embed` over a grid of **gpu_memory_utilization** × **max_rows**, appending one CSV row per run. Useful for collecting data to plot ingest time vs scale and GPU util.

### Grid (defaults)

- **gpu_utils:** 0.4, 0.5, 0.6, 0.7, 0.8
- **max_rows:** 1000, 2000, 5000, 10000

Override with `--gpu-utils 0.4,0.6,0.8` and `--max-rows-list 1000,5000`. To sweep **embed_batch_size** (e.g. 256, 512, 768) with an env that has **flashinfer-cubin**, see [embed_batch_size_sweep_setup.md](embed_batch_size_sweep_setup.md).

### FlashInfer / flashinfer-cubin

By default the sweep **requires** an environment with **flashinfer-cubin** (e.g. `uv sync --extra vllm-attention` in `nemo_retriever`). If it is not installed, the script exits with install instructions. Use `--no-require-flashinfer-cubin` to run anyway (e.g. to collect flashinfer_cubin=false rows).

`flashinfer_cubin` is **detected at runtime** and written to the CSV. To get both true/false in one CSV:

1. **With FlashInfer:** run the sweep with `flashinfer-cubin` installed (e.g. `[vllm-attention]` extra).
2. **Without FlashInfer:** `uv pip uninstall flashinfer-cubin`, then run the **same** command and **same** `--output-csv` (with `--no-require-flashinfer-cubin`) so rows are appended.

### Usage

```bash
unset RAY_ADDRESS

uv run python nemo_retriever/scripts/run_comparison_sweep.py \
--pre-embed-dir /path/to/pre_embed_dir \
--query-csv /path/to/query_gt.csv \
--embed-model-path /path/to/llama-nemotron-embed-1b-v2/main \
--output-csv comparison_sweep.csv
```

### Tracking progress (log file)

```bash
unset RAY_ADDRESS
uv run python nemo_retriever/scripts/run_comparison_sweep.py \
--pre-embed-dir /path/to/pre_embed_dir \
--query-csv /path/to/query_gt.csv \
--embed-model-path /path/to/llama-nemotron-embed-1b-v2/main \
--output-csv comparison_sweep.csv \
2>&1 | tee comparison_sweep.log
```

Then in another terminal: `tail -f comparison_sweep.log`.

---

## 3. Plotting (`plot_comparison_sweep.py`)

Reads the CSV produced by the sweep and writes figures (ingest time vs max_rows, ingest vs gpu_util, recall@10, optional FlashInfer impact).

### Usage

```bash
uv run python nemo_retriever/scripts/plot_comparison_sweep.py run \
--input-csv comparison_sweep.csv \
--output-dir ./comparison_plots
```

Requires **pandas** and **matplotlib** (usually already in the env).

---

## 4. Optional: Ray GPU check (`check_ray_gpu.py`)

Prints Ray cluster resources and actors that use or are waiting for GPU. Useful when the comparison is “stuck pending” (no GPU placement).

- To see the comparison’s actors, the comparison must use a **persistent** Ray cluster (`ray start --head --num-gpus=1`) and you must set **RAY_ADDRESS** for both the comparison and this script.
- If the comparison is run **without** `--ray-address` (default), it uses `ray.init("local")` and no separate Ray server exists, so this script cannot attach.

```bash
RAY_ADDRESS=127.0.0.1:6379 uv run python nemo_retriever/scripts/check_ray_gpu.py
```

---

## Running on other machines

1. **Paths:** Replace `/path/to/pre_embed_dir`, `/path/to/query_gt.csv`, and `--embed-model-path` with paths valid on that machine.
2. **Environment:** Use the same Python env as for development (e.g. `uv run` from repo root, or activate the venv and run `python nemo_retriever/scripts/...`).
3. **RAY_ADDRESS:** Run `unset RAY_ADDRESS` before the comparison/sweep so the script starts its own Ray cluster; otherwise you may attach to another user’s cluster and see 0 GPUs.
4. **GPU:** The embed service needs 1 GPU. If Ray reports 0 GPUs, the script will error; ensure the node has a GPU and Ray can see it (e.g. `ray start --head --num-gpus=1` if using a persistent cluster).

---

## Script summary

| Script | Purpose |
|--------|--------|
| `vllm_embedding_comparison.py` | Single compare-from-pre-embed run (baseline + vLLM); optional `--output-csv`. |
| `run_comparison_sweep.py` | Sweep over gpu_util × max_rows; appends to `--output-csv`. |
| `plot_comparison_sweep.py` | Read sweep CSV, write figures to `--output-dir`. |
| `check_ray_gpu.py` | Print Ray GPU/resources and GPU actors (for debugging). |
Loading
Loading