An intelligent SAT Reading comprehension tutor powered by a fine-tuned, quantized LLaMA 3.2 3B model with LoRA adapters, deployed with Open-WebUI, FastAPI, LangChain, and vLLM. This project demonstrates end-to-end machine learning deployment using modern containerization and microservices architecture for educational AI applications.
- Project Status
- Overview
- Quick Start
- Architecture
- Model configuration for fine-tuning
- API Usage
- License & Citation
- Model Development: LLaMA 3.2 3B + 4-bit quantization + LoRA fine-tuning (73.68% accuracy)
- Core Architecture: FastAPI backend, Gradio frontend, VLLM inference
- Containerization: Docker + Docker Compose orchestration
- API Integration: OpenAI-compatible endpoints
- Performance: prefix caching and vLLM optimizations for throughput and latency
- Benchmarking: integration and performance testing
- Monitoring: Logging, metrics, and error tracking with LangSmith, Loki, Prometheus, and Grafana
- Answer Explanations: Enhance prompts to generate detailed explanations with step-by-step reasoning
- Enhanced Features: Multi-model support, Multi-LoRA
- CI/CD: Kubernetes deployment and cloud integration via ArgoCD
- Model registry: Implement a model registry for versioning various fine-tuned models
- Question bank with RAG: Implement a question bank with Retrieval-Augmented Generation (RAG) for dynamic question retrieval
- Import and Export Features: Allow users to import questions and export practice session results and progress reports
- Data Persistence: Database integration for storing user data and question responses
- Timed Practice: Implement realistic SAT timing constraints and practice modes
- Authentication: Simple user authentication and authorization system
- Rate Limiting: API rate limiting and request throttling to prevent abuse
This system implements a specialized AI tutor that achieves:
- 73.68% accuracy on SAT Reading comprehension tasks (compared to 47.37% baseline), demonstrating a +26.31 percentage point improvement through domain-specific fine-tuning
- 82% VRAM savings through 4-bit quantization for efficient inference, reducing model size to 2.28 GB VRAM while maintaining performance
- 6.3x higher throughput with vLLM as opposed to static batching, and 40% latency reduction via prefix caching, resulting in ~1.8s per question
- 99.25% parameter efficiency with LoRA fine-tuning using only 24.3M trainable parameters (0.75% of total)
- Interactive web interface for SAT practice using Gradio
- RESTful, OpenAI-compliant backend processing with vLLM
- Real-time model comparison (
arenamode) and preference feedback with Open WebUI - Accelerated development using LangChain for LLM orchestration
- Production-ready deployment with Docker containerization and microservices architecture
- GPU: NVIDIA GPU with CUDA support (minimum 8GB VRAM recommended)
- RAM: 16GB+ system memory for optimal performance
- Storage: 60GB+ available disk space for models and containers
- Docker and Docker Compose installed on your system
- NVIDIA Container Toolkit installed and configured
- CUDA-compatible GPU drivers
Copy the example environment file and customize configuration:
copy .env.example .envEdit the .env file to customize your deployment:
# Host Configuration
HOST=0.0.0.0 # Set to 0.0.0.0 for Docker deployment
# Port Configuration
VLLM_PORT=8000 # VLLM inference server
OPEN_WEBUI_PORT=8080 # Open WebUI interface
BACKEND_PORT=8090 # FastAPI backend
FRONTEND_PORT=7860 # Gradio frontend
# Model Configuration
BASE_MODEL=meta-llama/Llama-3.2-3B-Instruct
LORA_MODEL=tiviluson/Llama-3.2-3B-SAT
SAT_LORA_MODEL_NAME=sat-lora
MAX_MODEL_LEN=2048 # Maximum context length
# API Configuration
VLLM_API_KEY=EMPTY
VLLM_API_BASE_URL=http://vllm_app:8000/v1
OPENAI_API_BASE_URL=http://vllm_app:8000/v1
BACKEND_URL=http://backend:8090/submit_mcqBuild and start all services:
docker-compose upOnce running, access the services:
- Frontend (Gradio UI): http://localhost:7860 - Main SAT practice interface
- Backend API: http://localhost:8090 - REST API endpoints
- Open WebUI: http://localhost:8080 - Chat-based model interaction
- VLLM API: http://localhost:8000/docs - OpenAI-compatible API docs
The application consists of four containerized microservices orchestrated via Docker Compose. Services communicate via internal Docker network sat-network for security and performance.
├── backend_server.py # FastAPI backend implementation
├── frontend.py # Gradio frontend interface
├── sat_chain.py # LangChain orchestration logic
├── docker-compose.yaml # Service orchestration
├── Dockerfile.backend # Backend container definition
├── Dockerfile.frontend # Frontend container definition
├── requirements-backend.txt # Backend Python dependencies
├── requirements-frontend.txt # Frontend Python dependencies
├── .env.example # Environment configuration template
└── LLM_FineTuning_SAT_Reading_042325.ipynb # Fine-tuning notebook
- Image:
vllm/vllm-openai:latest - Purpose: High-performance LLM inference server
- Model: LLaMA 3.2 3B with SAT-specific LoRA adapters
- Features:
- GPU acceleration with NVIDIA runtime
- LoRA adapter support with runtime updating
- OpenAI-compatible API endpoints
- Prefix caching for improved throughput
- Port: 8000
- Framework: FastAPI with Uvicorn ASGI server
- Language: Python 3.11
- Purpose: prompt processing and LLM orchestration
- Features:
- RESTful API with automatic OpenAPI documentation
- CORS middleware for cross-origin requests
- Pydantic models for request/response validation
- Prompt formatting with LangChain
- Health check endpoints
- Port: 8090
- Framework: Gradio
- Purpose: Interactive web interface for SAT practice
- Features:
- Reading passage input (multi-line text area)
- Question input field
- Four multiple-choice options (A, B, C, D)
- Real-time API communication with backend
- Port: 7860
- Image:
ghcr.io/open-webui/open-webui:main - Purpose: Alternative chat-based interface for direct model interaction
- Features:
- ChatGPT-like interface
- Direct model conversation
- Conversation history
- Model comparison (
arenamode) - User feedback collection
- Port: 8080
- Base Model:
meta-llama/Llama-3.2-3B-Instruct(3.24B parameters) - Fine-Tuning: LoRA (Low-Rank Adaptation) with 24.3M trainable parameters (0.75% of total)
- Quantization: 4-bit NF4 with double quantization (bfloat16 compute)
- Fine-Tuned Model:
tiviluson/Llama-3.2-3B-SAT
# Quantization Configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
){
"r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"target_modules": ["up_proj", "o_proj", "q_proj", "gate_proj", "down_proj", "k_proj", "v_proj"]
}training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=2,
num_train_epochs=2,
learning_rate=2e-4,
fp16=True,
optim="paged_adamw_8bit",
lr_scheduler_type="cosine",
warmup_ratio=0.05
)curl -X POST "http://localhost:8090/submit_mcq" \
-H "Content-Type: application/json" \
-d '{
"text": "Reading passage here...",
"question": "What is the main idea?",
"choices": ["A) Option 1", "B) Option 2", "C) Option 3", "D) Option 4"]
}'curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "sat-lora",
"prompt": "Your prompt here"
}'POST /submit_mcq- Process SAT reading comprehension questionsGET /health- Service health checkGET /- Root endpoint with service information
GET /docs- OpenAPI documentationPOST /v1/completions- OpenAI-compatible text completionPOST /v1/chat/completions- Chat-based completionsGET /v1/models- Available model information
# MCQ Request Model
class MCQRequest(BaseModel):
text: str # Reading passage
question: str # Question text
choices: List[str] # Multiple choice options
# MCQ Response Model
class MCQResponse(BaseModel):
answer: str # Selected answer with explanationThis project is licensed under the MIT License - see the LICENSE file for details.
If you use this project in your research, please cite:
@misc{llm-finetuning-sat-reading,
title={LLM Fine-Tuning for SAT Reading Comprehension},
author={tiviluson},
year={2024},
url={https://github.com/tiviluson/LLM-FineTuning-SAT-Reading}
}


