A Retrieval-Augmented Generation (RAG) system that combines document retrieval with language model generation using LMStudio models and LangGraph orchestration.
- Document Processing: Support for PDF, TXT, Markdown, Word (.docx), and CSV files
- Vector Storage: Qdrant integration for efficient similarity search
- Local LLM: LMStudio integration for privacy-preserving generation
- Workflow Orchestration: LangGraph-based pipeline management
- Configurable: Flexible configuration system with environment variable support
- Modular Architecture: Clean separation of concerns for maintainability
- Python 3.8 or higher
- Docker (for Qdrant vector database)
- LMStudio (for local LLM inference)
Run Qdrant using Docker with persistent storage:
# Create a directory for Qdrant data persistence
mkdir qdrant_storage
# Run Qdrant with Docker (with data persistence)
docker run -d \
--name qdrant \
-p 6333:6333 \
-p 6334:6334 \
-v ./qdrant_storage:/qdrant/storage:z \
qdrant/qdrantVerify Qdrant is running:
curl http://localhost:6333/health
# Should return: {"title":"qdrant - vector search engine","version":"1.x.x"}Download and Install LMStudio:
- Visit LMStudio website
- Download LMStudio for your operating system (Windows/macOS/Linux)
- Run the installer and follow the setup wizard
Download the Required Models:
This project uses Google's Gemma 3 4B Instruct model. Download it using the CLI:
# Download the specific models used in this project
lms get unsloth/gemma-3-4b-it-GGUF
lms get second-state/All-MiniLM-L6-v2-Embedding-GGUF
# Verify models were downloaded
lms lsStart the LMStudio Server via CLI:
# Load the model into memory
lms load unsloth/gemma-3-4b-it-GGUF
# Start the server on default port (1234)
lms server start
# Or start with specific configuration
lms server start --port 1234 --cors
# Check server status
lms server status
# View loaded models
lms ps
# Stop the server
lms server stop- Clone the repository:
git clone <repository-url>
cd <repository>- Install the RAG CLI system-wide using pipx (recommended):
# Install pipx if not already installed
# On macOS:
brew install pipx
# On other systems, see: https://pipx.pypa.io/stable/installation/
# Install the RAG CLI
pipx install .
# Ensure pipx is in your PATH
pipx ensurepath
# Restart your terminal or source your shell config
source ~/.zshrc # or ~/.bashrcAlternative: Virtual Environment Installation
If you prefer using a virtual environment:
# Create a virtual environment
python -m venv rag-env
# Activate it
# On macOS/Linux:
source rag-env/bin/activate
# On Windows:
# rag-env\Scripts\activate
# Install dependencies and package
pip install -r requirements.txt
pip install -e .- The system uses a unified configuration file with performance optimizations built-in:
# The main configuration file is already optimized for performance
config/rag_config.yaml- Edit the configuration file to match your setup:
# config/rag_config.yaml
qdrant_host: "localhost"
qdrant_port: 6333
collection_name: "documents"
lmstudio_endpoint: "http://localhost:1234"
model_name: "unsloth/gemma-3-4b-it-GGUF" # The specific model used in this project
embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
# Performance optimizations are already configured
embedding_batch_size: 64 # Optimized for better GPU utilization
retrieval_enable_cache: true # Enable query caching
qdrant_prefer_grpc: true # Use gRPC for better performanceIf Qdrant won't start:
# Check if port is in use
# On macOS/Linux:
netstat -an | grep :6333
# On Windows:
netstat -an | findstr :6333
# Stop existing container
docker stop qdrant
docker rm qdrant
# Restart with fresh container
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 -v ./qdrant_storage:/qdrant/storage:z qdrant/qdrantIf LMStudio connection fails:
- Ensure LMStudio server is running:
lms server status lms ps # Check if model is loaded - Check if the correct model is loaded:
lms ls # List available models lms load unsloth/gemma-3-4b-it-GGUF # Load the required model
- Restart the server if needed:
lms server stop lms server start --port 1234 --cors
- Verify model endpoint:
curl http://localhost:1234/v1/models
# Interactive mode
rag-cli
# Single query
rag-cli query "What is machine learning?"
# Batch processing
rag-cli batch queries.txt --output results.json
# Custom configuration
rag-cli --config production_config.yaml query "Your question"# Start Qdrant
docker start qdrant
# Stop Qdrant
docker stop qdrant
# View Qdrant logs
docker logs qdrant
# Remove Qdrant container (keeps data in qdrant_storage/)
docker rm qdrant
# Backup Qdrant data
# On macOS/Linux:
cp -r qdrant_storage qdrant_backup
# On Windows:
xcopy qdrant_storage qdrant_backup /E /I
# Restore Qdrant data
# On macOS/Linux:
cp -r qdrant_backup/* qdrant_storage/
# On Windows:
xcopy qdrant_backup qdrant_storage /E /I /Yrag_system/
├── __init__.py # Package initialization
├── config.py # Configuration management
├── models.py # Data models
├── logging_config.py # Logging setup
├── document_processing/ # Document ingestion and processing
├── storage/ # Vector database integration
├── retrieval/ # Query processing and retrieval
├── generation/ # LMStudio integration and generation
├── orchestration/ # LangGraph workflow management
└── interfaces/ # CLI and API interfaces
The system uses a unified configuration file that includes both basic settings and performance optimizations:
Key Configuration Sections:
- Qdrant Settings: Vector database connection and performance settings
- LMStudio Settings: Local LLM configuration with streaming and connection pooling
- Embedding Settings: Model configuration with GPU optimization and batch processing
- Retrieval Settings: Search parameters with caching and concurrency limits
- Document Processing: File handling with streaming and concurrent processing
- Performance Settings: Memory management, async configuration, and optimization flags
The system supports configuration through:
- YAML file:
config/rag_config.yaml(recommended) - Environment variables: Prefix with
RAG_(e.g.,RAG_QDRANT_HOST) - Programmatic configuration: Use the
RAGConfigclass directly
The unified configuration includes built-in performance optimizations:
- Embedding batch size: Increased to 64 for better GPU utilization
- Query caching: LRU cache with 128 query capacity
- gRPC connections: Enabled for Qdrant for better performance
- Concurrent processing: Optimized limits for file processing and searches
- Memory management: Garbage collection hints and monitoring thresholds
The RAG CLI can be installed in several ways:
# Install pipx if not already installed
# On macOS:
brew install pipx
# On other systems: https://pipx.pypa.io/stable/installation/
# Install RAG CLI
pipx install .
# Ensure pipx is in PATH
pipx ensurepath
# Restart terminal or source shell config
source ~/.zshrc # or ~/.bashrcpython -m venv rag-env
source rag-env/bin/activate # On Windows: rag-env\Scripts\activate
pip install -e .pip install --user .Note: On modern Python installations (following PEP 668), system-wide pip installation may be restricted. Use pipx or virtual environments instead.
# Install development dependencies
pip install -r requirements-dev.txtBefore committing code, run quality checks:
# Run all quality checks (formatting, linting, type checking, tests)
python pre-commit-check.py
# Auto-fix formatting and import issues
python pre-commit-check.py --fix
# Quick check without tests
python pre-commit-check.py --skip-tests
# Show detailed output
python pre-commit-check.py --verboseSee docs/QUALITY_CHECKS.md for detailed information.
The system includes comprehensive benchmarking tools to measure and validate performance improvements:
-
Quick Benchmark (
quick_benchmark.py)- Fast performance test (< 2 minutes)
- Tests core functionality and optimizations
- Provides immediate performance assessment
python quick_benchmark.py
-
Comprehensive Benchmark (
benchmark_rag_performance.py)- Complete performance analysis (5-10 minutes)
- Tests all system components
- Detailed metrics and system monitoring
- Memory usage analysis
python benchmark_rag_performance.py
The benchmarks test:
- Embedding Performance: Single vs batch processing, throughput metrics
- Retrieval Performance: Query response times, result quality
- Cache Effectiveness: Hit/miss ratios, speedup factors
- Batch Operations: Concurrent vs sequential processing
- Memory Usage: Peak usage, cleanup efficiency
- End-to-End Performance: Complete RAG pipeline timing
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=rag_system --cov-report=html
# Run specific test file
pytest tests/test_config.py -v
# Run specific test
pytest tests/test_config.py::TestRAGConfig::test_default_config -v# Format code
ruff format rag_system/ tests/
# Lint code
ruff check rag_system/ tests/
# Auto-fix linting issues
ruff check --fix rag_system/ tests/
# Type checking
pyright rag_system/