Skip to content

vishalkakadiya/docwizard

Repository files navigation

DocWizard

A fully local Retrieval-Augmented Generation (RAG) system that lets you chat with your PDF documents. No cloud, no API keys, no data leaving your machine.

Built with ChromaDB, nomic-embed-text, and Ollama.


What is RAG?

Large Language Models (LLMs) are trained on general knowledge — they know nothing about your documents. RAG solves this by retrieving relevant content from your files at query time and injecting it into the LLM's prompt as context, so the model answers based on your data rather than hallucinating.

PDF → chunk → embed → vector store     (ingestion)
Question → embed → similarity search → top chunks → LLM → answer   (query)

DocWizard implements this pipeline entirely locally using Ollama.


Architecture

Ingestion pipeline

PDF file
  │
  ▼
PyPDFLoader          — extracts raw text page by page
  │
  ▼
_dedupe_text()       — removes duplicate text layers (common in low-quality PDFs)
  │
  ▼
RecursiveCharacterTextSplitter  — splits into 500-char chunks with 50-char overlap
  │                               overlap preserves context across chunk boundaries
  ▼
nomic-embed-text     — converts each chunk into a 768-dimensional vector
  │                    prefixed with "search_document:" for task-aware embedding
  ▼
ChromaDB             — persists vectors with cosine similarity index (HNSW)
                       MD5 hash of filename:page:index used as chunk ID to prevent duplicates

Query pipeline

User question
  │
  ▼
nomic-embed-text     — embeds the question (prefixed "search_query:" for asymmetric retrieval)
  │
  ▼
ChromaDB ANN search  — approximate nearest-neighbour search over stored vectors
  │                    returns TOP_K=8 most semantically similar chunks
  ▼
Keyword re-rank      — lightweight second pass: boosts chunks that contain
  │                    the question's keywords, compensating for embedding model
  │                    limitations on topically similar documents
  ▼
Top 3 chunks         — only the most relevant chunks sent to the LLM
  │                    (small models degrade with too much context)
  ▼
llama3.2:1b          — generates a grounded answer from the retrieved context

Key design decisions

Decision Choice Reason
Distance metric Cosine similarity nomic-embed-text vectors are not unit-normalised (L2 norm ≈ 19–22), so Euclidean distance gives wrong rankings
Task-aware embeddings search_query: / search_document: prefixes nomic-embed-text is an asymmetric model — query and document embeddings live in different sub-spaces without prefixes
Hybrid retrieval Semantic (ChromaDB) + keyword re-rank Compensates for quantised small embedding models that cluster similar topics too closely
Retrieval width vs LLM context TOP_K=8 retrieve, LLM_TOP_K=3 send Wide retrieval improves recall; narrow LLM context prevents 1B models from losing the relevant chunk
Module separation ingest.py, query.py, embeddings.py have zero Streamlit coupling Clean path to swap the UI for FastAPI without touching any business logic

Prerequisites

  • Python 3.10+
  • Ollama installed and running

Start the Ollama server (must be running before you use the app):

ollama serve

On macOS, Ollama also starts automatically if you launched the desktop app from Applications. On Linux you always need to run ollama serve manually. Leave this terminal open.

Pull the two required models (one-time, run in a new terminal):

ollama pull nomic-embed-text   # embeddings
ollama pull llama3.2:1b        # LLM for answers

Setup

1. Clone the repo

git clone git@github.com:vishalkakadiya/docwizard.git
cd docwizard

2. Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate       # Windows: .venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

Run

streamlit run app.py

Open http://localhost:8501 in your browser.


Usage

  1. Upload — drag a PDF into the sidebar and click Ingest PDFs
  2. Ask — type any question in the chat box
  3. Clear — use 🗑️ Clear all chunks in the sidebar to reset and ingest a new file

To inspect what is stored in ChromaDB:

python inspect_db.py

Run tests

pytest tests/

All 7 tests are fully mocked — no Ollama or ChromaDB instance required.


Project structure

docwizard/
├── app.py            # Streamlit UI (zero business logic)
├── config.py         # Single source of truth for all settings
├── embeddings.py     # Ollama embedding wrapper (task-type aware)
├── ingest.py         # PDF loading, deduplication, chunking, vector storage
├── query.py          # Semantic search, keyword re-rank, LLM answer generation
├── inspect_db.py     # Dev utility: inspect ChromaDB contents
├── requirements.txt
└── tests/            # Unit tests (mocked, no Ollama needed)

Configuration

All tuneable settings live in config.py:

Setting Default Description
EMBED_MODEL nomic-embed-text Ollama embedding model
LLM_MODEL llama3.2:1b Ollama LLM for answers
CHUNK_SIZE 500 Characters per chunk
CHUNK_OVERLAP 50 Overlap between chunks
TOP_K 8 Chunks retrieved from ChromaDB (semantic search)
LLM_TOP_K 3 Chunks passed to the LLM after re-ranking

Tech stack

Layer Technology
UI Streamlit
Vector store ChromaDB (persistent, cosine similarity)
Embeddings nomic-embed-text via Ollama
LLM llama3.2:1b via Ollama
PDF parsing LangChain PyPDFLoader + RecursiveCharacterTextSplitter
Runtime 100% local — no external API calls

About

Document wizard - A fully local Retrieval-Augmented Generation (RAG) system that lets you chat with your PDF documents. No cloud, no API keys, no data leaving your machine.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages