Skip to content
/ rag-cli Public

Local-first RAG CLI that ingests documents, stores embeddings in Qdrant, and answers queries via LMStudio models using LangGraph orchestration.

Notifications You must be signed in to change notification settings

am2998/rag-cli

Repository files navigation

RAG System

A Retrieval-Augmented Generation (RAG) system that combines document retrieval with language model generation using LMStudio models and LangGraph orchestration.

Features

  • Document Processing: Support for PDF, TXT, Markdown, Word (.docx), and CSV files
  • Vector Storage: Qdrant integration for efficient similarity search
  • Local LLM: LMStudio integration for privacy-preserving generation
  • Workflow Orchestration: LangGraph-based pipeline management
  • Configurable: Flexible configuration system with environment variable support
  • Modular Architecture: Clean separation of concerns for maintainability

Quick Start

Prerequisites

  • Python 3.8 or higher
  • Docker (for Qdrant vector database)
  • LMStudio (for local LLM inference)

Installation

1. Set up Qdrant Vector Database

Run Qdrant using Docker with persistent storage:

# Create a directory for Qdrant data persistence
mkdir qdrant_storage

# Run Qdrant with Docker (with data persistence)
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -p 6334:6334 \
  -v ./qdrant_storage:/qdrant/storage:z \
  qdrant/qdrant

Verify Qdrant is running:

curl http://localhost:6333/health
# Should return: {"title":"qdrant - vector search engine","version":"1.x.x"}

2. Install and Configure LMStudio

Download and Install LMStudio:

  1. Visit LMStudio website
  2. Download LMStudio for your operating system (Windows/macOS/Linux)
  3. Run the installer and follow the setup wizard

Download the Required Models:

This project uses Google's Gemma 3 4B Instruct model. Download it using the CLI:

# Download the specific models used in this project
lms get unsloth/gemma-3-4b-it-GGUF

lms get second-state/All-MiniLM-L6-v2-Embedding-GGUF

# Verify models were downloaded
lms ls

Start the LMStudio Server via CLI:

# Load the model into memory
lms load unsloth/gemma-3-4b-it-GGUF

# Start the server on default port (1234)
lms server start

# Or start with specific configuration
lms server start --port 1234 --cors

# Check server status
lms server status

# View loaded models
lms ps

# Stop the server
lms server stop

3. Install Python Dependencies

  1. Clone the repository:
git clone <repository-url>
cd <repository>
  1. Install the RAG CLI system-wide using pipx (recommended):
# Install pipx if not already installed
# On macOS:
brew install pipx

# On other systems, see: https://pipx.pypa.io/stable/installation/

# Install the RAG CLI
pipx install .

# Ensure pipx is in your PATH
pipx ensurepath

# Restart your terminal or source your shell config
source ~/.zshrc  # or ~/.bashrc

Alternative: Virtual Environment Installation

If you prefer using a virtual environment:

# Create a virtual environment
python -m venv rag-env

# Activate it
# On macOS/Linux:
source rag-env/bin/activate
# On Windows:
# rag-env\Scripts\activate

# Install dependencies and package
pip install -r requirements.txt
pip install -e .

Configuration

Configuration Setup

  1. The system uses a unified configuration file with performance optimizations built-in:
# The main configuration file is already optimized for performance
config/rag_config.yaml
  1. Edit the configuration file to match your setup:
# config/rag_config.yaml
qdrant_host: "localhost"
qdrant_port: 6333
collection_name: "documents"

lmstudio_endpoint: "http://localhost:1234"
model_name: "unsloth/gemma-3-4b-it-GGUF"  # The specific model used in this project

embedding_model: "sentence-transformers/all-MiniLM-L6-v2"

# Performance optimizations are already configured
embedding_batch_size: 64          # Optimized for better GPU utilization
retrieval_enable_cache: true      # Enable query caching
qdrant_prefer_grpc: true          # Use gRPC for better performance

Troubleshooting Setup

If Qdrant won't start:

# Check if port is in use
# On macOS/Linux:
netstat -an | grep :6333
# On Windows:
netstat -an | findstr :6333

# Stop existing container
docker stop qdrant
docker rm qdrant

# Restart with fresh container
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 -v ./qdrant_storage:/qdrant/storage:z qdrant/qdrant

If LMStudio connection fails:

  1. Ensure LMStudio server is running:
    lms server status
    lms ps  # Check if model is loaded
  2. Check if the correct model is loaded:
    lms ls  # List available models
    lms load unsloth/gemma-3-4b-it-GGUF  # Load the required model
  3. Restart the server if needed:
    lms server stop
    lms server start --port 1234 --cors
  4. Verify model endpoint:
    curl http://localhost:1234/v1/models

Usage

Command Line Interface

# Interactive mode
rag-cli

# Single query
rag-cli query "What is machine learning?"

# Batch processing
rag-cli batch queries.txt --output results.json

# Custom configuration
rag-cli --config production_config.yaml query "Your question"

Docker Management Commands

# Start Qdrant
docker start qdrant

# Stop Qdrant
docker stop qdrant

# View Qdrant logs
docker logs qdrant

# Remove Qdrant container (keeps data in qdrant_storage/)
docker rm qdrant

# Backup Qdrant data
# On macOS/Linux:
cp -r qdrant_storage qdrant_backup
# On Windows:
xcopy qdrant_storage qdrant_backup /E /I

# Restore Qdrant data
# On macOS/Linux:
cp -r qdrant_backup/* qdrant_storage/
# On Windows:
xcopy qdrant_backup qdrant_storage /E /I /Y

Project Structure

rag_system/
├── __init__.py              # Package initialization
├── config.py                # Configuration management
├── models.py                # Data models
├── logging_config.py        # Logging setup
├── document_processing/     # Document ingestion and processing
├── storage/                 # Vector database integration
├── retrieval/              # Query processing and retrieval
├── generation/             # LMStudio integration and generation
├── orchestration/          # LangGraph workflow management
└── interfaces/             # CLI and API interfaces

Configuration

The system uses a unified configuration file that includes both basic settings and performance optimizations:

Key Configuration Sections:

  1. Qdrant Settings: Vector database connection and performance settings
  2. LMStudio Settings: Local LLM configuration with streaming and connection pooling
  3. Embedding Settings: Model configuration with GPU optimization and batch processing
  4. Retrieval Settings: Search parameters with caching and concurrency limits
  5. Document Processing: File handling with streaming and concurrent processing
  6. Performance Settings: Memory management, async configuration, and optimization flags

Configuration Methods

The system supports configuration through:

  1. YAML file: config/rag_config.yaml (recommended)
  2. Environment variables: Prefix with RAG_ (e.g., RAG_QDRANT_HOST)
  3. Programmatic configuration: Use the RAGConfig class directly

Performance Optimizations Included

The unified configuration includes built-in performance optimizations:

  • Embedding batch size: Increased to 64 for better GPU utilization
  • Query caching: LRU cache with 128 query capacity
  • gRPC connections: Enabled for Qdrant for better performance
  • Concurrent processing: Optimized limits for file processing and searches
  • Memory management: Garbage collection hints and monitoring thresholds

Development

Installation Methods

The RAG CLI can be installed in several ways:

Method 1: pipx (Recommended for CLI tools)

# Install pipx if not already installed
# On macOS:
brew install pipx
# On other systems: https://pipx.pypa.io/stable/installation/

# Install RAG CLI
pipx install .

# Ensure pipx is in PATH
pipx ensurepath

# Restart terminal or source shell config
source ~/.zshrc  # or ~/.bashrc

Method 2: Virtual Environment

python -m venv rag-env
source rag-env/bin/activate  # On Windows: rag-env\Scripts\activate
pip install -e .

Method 3: User Installation

pip install --user .

Note: On modern Python installations (following PEP 668), system-wide pip installation may be restricted. Use pipx or virtual environments instead.

Setup Development Environment

# Install development dependencies
pip install -r requirements-dev.txt

Code Quality Checks

Before committing code, run quality checks:

# Run all quality checks (formatting, linting, type checking, tests)
python pre-commit-check.py

# Auto-fix formatting and import issues
python pre-commit-check.py --fix

# Quick check without tests
python pre-commit-check.py --skip-tests

# Show detailed output
python pre-commit-check.py --verbose

See docs/QUALITY_CHECKS.md for detailed information.

Performance Benchmarking

The system includes comprehensive benchmarking tools to measure and validate performance improvements:

Available Benchmark Scripts

  1. Quick Benchmark (quick_benchmark.py)

    • Fast performance test (< 2 minutes)
    • Tests core functionality and optimizations
    • Provides immediate performance assessment
    python quick_benchmark.py
  2. Comprehensive Benchmark (benchmark_rag_performance.py)

    • Complete performance analysis (5-10 minutes)
    • Tests all system components
    • Detailed metrics and system monitoring
    • Memory usage analysis
    python benchmark_rag_performance.py

Benchmark Results

The benchmarks test:

  • Embedding Performance: Single vs batch processing, throughput metrics
  • Retrieval Performance: Query response times, result quality
  • Cache Effectiveness: Hit/miss ratios, speedup factors
  • Batch Operations: Concurrent vs sequential processing
  • Memory Usage: Peak usage, cleanup efficiency
  • End-to-End Performance: Complete RAG pipeline timing

Running Tests

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=rag_system --cov-report=html

# Run specific test file
pytest tests/test_config.py -v

# Run specific test
pytest tests/test_config.py::TestRAGConfig::test_default_config -v

Code Style

# Format code
ruff format rag_system/ tests/

# Lint code
ruff check rag_system/ tests/

# Auto-fix linting issues
ruff check --fix rag_system/ tests/

# Type checking
pyright rag_system/

About

Local-first RAG CLI that ingests documents, stores embeddings in Qdrant, and answers queries via LMStudio models using LangGraph orchestration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages