Skip to content

smart-models/Hypothetical-QA

Repository files navigation

LLM Powered Python 3.10+ FastAPI Docker Ollama

Hypothetical Questions Answer

Hypothetical Questions Answer

Hypothetical Questions Answer is a specialized tool designed for Retrieval-Augmented Generation (RAG) systems that transforms text chunks into high-quality hypothetical question-answer pairs using Large Language Models (LLMs). The system intelligently analyzes input text and generates contextually relevant questions with comprehensive answers, enhancing RAG systems by creating synthetic data that improves retrieval performance and query understanding.

Built on modern FastAPI architecture with seamless Ollama integration, this production-ready solution leverages state-of-the-art language models to understand content semantics and generate relevant question-answer pairs. The system employs sophisticated prompt engineering and quality control mechanisms to ensure generated content maintains semantic accuracy and contextual relevance to the source material.

The system is designed for scalable deployment in RAG pipelines, vector database enrichment, and AI assistant. Whether you're processing knowledge base documents, technical documentation, or domain-specific content, the Hypothetical QA Generator provides a robust solution for creating synthetic question-answer pairs that improve retrieval performance at scale.

Key Features

  • Hypothetical Q&A Generation: Creates contextually relevant questions with comprehensive answers from text chunks for RAG systems
  • Content-Aware Processing: The LLM automatically adapts question style and complexity to the domain and content type of each text chunk (scientific, technical, narrative, business, etc.)
  • LLM Integration: Seamless integration with Ollama for local LLM deployment and model flexibility
  • Quality Control System: Advanced parsing, validation, and quality scoring for generated content
  • Parallel Processing: Concurrent chunk processing with configurable worker pools for optimal performance
  • Production-Ready API: FastAPI-based REST interface with automatic documentation, validation, and error handling
  • Docker Support: Complete containerization with CPU, GPU, and external-Ollama deployment profiles
  • API Token Authentication: Optional Bearer token protection for POST /process-chunks/ via the API_TOKEN environment variable
  • External Ollama Support: Connect to any existing Ollama instance with optional API key forwarding (cpu-external / gpu-external profiles)
  • Comprehensive Testing: Full test suite with pytest integration for reliability assurance
  • Flexible Configuration: Extensive parameter control for model selection, generation settings, and processing options
  • Two-Level Retry System: Independent retry loops for network/connection failures (QA_APP_LLM_MAX_RETRIES) and for wrong Q&A pair counts (max_retries API parameter), providing robust error handling at both levels
  • Detailed Metadata: Comprehensive processing statistics, quality metrics, and performance analytics
  • Custom Prompt Support: Ability to supply custom instructions for specialized RAG use cases

Table of Contents

How the Q&A Generation Algorithm Works

The Processing Pipeline

The Hypothetical Questions Answer implements a sophisticated multi-stage pipeline that combines advanced prompt engineering with LLM-powered content generation:

  1. Input Processing: The API accepts JSON documents containing text chunks through the /process-chunks/ endpoint
  2. Content Analysis: Each text chunk is analyzed for content type, complexity, and RAG enhancement potential
  3. Intelligent Prompting: The system applies specialized prompts based on content characteristics (scientific, narrative, technical, etc.)
  4. Parallel Generation: Multiple chunks are processed concurrently using configurable worker pools for optimal throughput
  5. Q&A Extraction: Generated content is parsed using sophisticated regex patterns to extract clean question-answer pairs
  6. Quality Validation: Each Q&A pair undergoes comprehensive quality checks including completeness, relevance, and retrieval effectiveness
  7. Retry Logic: Two independent retry loops handle failures — one retries the LLM call when the wrong number of Q&A pairs is produced (max_retries API parameter), another retries on network/connection errors with exponential backoff (QA_APP_LLM_MAX_RETRIES)
  8. Metadata Collection: Detailed statistics are collected including processing times, quality scores, and generation success rates
  9. Response Assembly: All Q&A pairs are combined with comprehensive metadata for client consumption

Intelligent Content Analysis

The following is an illustrative pseudocode sketch of the processing flow. The functions analyze_content_type() and get_specialized_prompt() do not exist as separate functions in the codebase — content adaptation happens implicitly through the LLM responding to rule #2 of the default instructions ("Adapt questions to the domain and content type of the text").

# Illustrative pseudocode — not actual source code
def qa_generation_process(chunks):
    processed_chunks = []
    
    # Parallel processing of chunks
    for chunk in chunks:
        # Analyze content type and apply appropriate prompt
        content_type = analyze_content_type(chunk.text)
        specialized_prompt = get_specialized_prompt(content_type)
        
        # Generate Q&A pairs with retry logic
        for attempt in range(max_retries):
            try:
                # Send to LLM with optimized prompt
                response = generate_with_llm(
                    prompt=specialized_prompt.format(
                        chunk=chunk.text,
                        num_pairs=n_qa_pairs
                    ),
                    model=llm_model,
                    temperature=temperature
                )
                
                # Parse and validate Q&A pairs
                qa_pairs = parse_qa_response(response)
                quality_score = calculate_quality_score(qa_pairs)
                
                if quality_score >= quality_threshold:
                    processed_chunks.extend(qa_pairs)
                    break
                    
            except (NetworkError, ParseError) as e:
                if attempt == max_retries - 1:
                    log_failed_chunk(chunk, e)
                else:
                    time.sleep(exponential_backoff(attempt))
    
    return {
        "chunks": processed_chunks,
        "metadata": compile_processing_metadata()
    }

This approach ensures high-quality synthetic data generation for RAG systems while maintaining robustness and performance in production environments.

Quality Control and Validation

The Hypothetical Questions Answer implements comprehensive quality assurance mechanisms:

  • Content Validation: Ensures generated questions are complete, grammatically correct, and retrieval-optimized
  • Answer Verification: Validates that answers are comprehensive, accurate, and properly formatted
  • Semantic Consistency: Checks that Q&A pairs maintain semantic relationship with source content
  • RAG Standards: Applies retrieval principles to ensure questions improve search performance
  • Format Compliance: Validates proper question/answer formatting and structure

Comparison with Traditional Q&A Generation

Feature Traditional Q&A Tools Hypothetical Chunks QA Generator
Content Understanding Rule-based or template-driven LLM-powered semantic analysis
Question Quality Basic extraction or simple patterns RAG-optimized with quality scoring
Content Adaptation One-size-fits-all approach Specialized prompts for different content types
Error Handling Limited retry mechanisms Robust retry logic with exponential backoff
Scalability Sequential processing Parallel processing with worker pools
Customization Fixed templates Flexible prompt engineering and parameter control
Production Readiness Basic functionality Comprehensive monitoring, logging, and health checks

Advantages of the Solution

RAG Enhancement Value

The Hypothetical Questions Answer creates high-quality synthetic data for RAG systems:

  • Contextual Relevance: Questions are generated based on actual content meaning, not just keyword matching
  • RAG Optimization: Prompts are designed to generate retrieval-friendly questions with comprehensive answers
  • Content Type Awareness: The LLM adapts question style to narrative, scientific, technical, and other content types based on prompt instructions
  • Semantic Focus: Emphasizes understanding and analysis to improve retrieval performance
  • Quality Assurance: Multi-layered validation ensures retrieval effectiveness

Production Performance

The system is optimized for enterprise-grade deployment:

  • Concurrent Processing: Parallel chunk processing with configurable worker pools
  • Robust Error Handling: Comprehensive retry mechanisms and graceful failure management
  • Resource Optimization: Intelligent allocation of CPU cores and memory usage
  • Health Monitoring: Built-in health checks and comprehensive logging for production monitoring
  • Scalable Architecture: Stateless API design enables horizontal scaling

Flexibility and Customization

The Hypothetical Questions Answer adapts to diverse RAG use cases:

  • Model Selection: Support for various LLM models through Ollama integration
  • Parameter Control: Fine-tune generation temperature, context windows, and retry behavior
  • Custom Prompting: Ability to provide custom prompt templates for specialized content
  • Integration Options: REST API allows integration with any programming language or platform
  • Deployment Flexibility: Run locally, in containers, or in cloud environments

Installation and Deployment

Prerequisites

  • Docker and Docker Compose (for Docker deployment)
  • Python 3.10+ (for local installation)
  • Ollama installed and running (for LLM functionality)
  • 4GB+ RAM recommended for optimal performance
  • Optional: NVIDIA GPU for accelerated Ollama performance

Getting the Code

Before proceeding with any installation method, clone or download the project:

# If using git
git clone https://github.com/smart-models/Hypothetical-QA.git
cd Hypothetical-QA

# Or download and extract the project files

Local Installation with Uvicorn

  1. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Linux/Mac

    For Windows users:

    • Using Command Prompt:
    venv\Scripts\activate.bat
    • Using PowerShell:
    # If you encounter execution policy restrictions, run this once per session:
    Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process
    
    # Then activate the virtual environment:
    venv\Scripts\Activate.ps1
  2. Install dependencies:

    pip install -r requirements.txt
  3. Run the FastAPI server:

    uvicorn hypothetical_qa_api:app --reload --port 8003
  4. The API will be available at http://localhost:8003.

    Access the API documentation and interactive testing interface at http://localhost:8003/docs.

Docker Deployment (Recommended)

Option A: Pre-built Image from GitHub Container Registry

The easiest way to deploy is using our pre-built Docker images published to GitHub Container Registry.

Pull the latest image:

docker pull ghcr.io/smart-models/hypothetical-qa:latest

Run with GPU acceleration (recommended, requires NVIDIA GPU + drivers):

docker run -d \
  --name hypothetical-qa \
  --gpus all \
  -p 8003:8000 \
  -v $(pwd)/logs:/app/logs \
  ghcr.io/smart-models/hypothetical-qa:latest

Windows PowerShell:

docker run -d `
  --name hypothetical-qa `
  --gpus all `
  -p 8003:8000 `
  -v ${PWD}/logs:/app/logs `
  ghcr.io/smart-models/hypothetical-qa:latest

Run on CPU only (fallback for systems without GPU):

docker run -d \
  --name hypothetical-qa \
  -p 8003:8000 \
  -v $(pwd)/logs:/app/logs \
  ghcr.io/smart-models/hypothetical-qa:latest

Use a specific version (recommended for production):

# Replace v1.0.0 with your desired version
docker pull ghcr.io/smart-models/hypothetical-qa:v1.0.0
docker run -d --gpus all -p 8003:8000 \
  -v $(pwd)/logs:/app/logs \
  ghcr.io/smart-models/hypothetical-qa:v1.0.0

Verify the service is running:

curl http://localhost:8003/

Stop and remove the container:

docker stop hypothetical-qa
docker rm hypothetical-qa

Option B: Build from Source with Docker Compose

  1. Create required directories for persistent storage:

    # Linux/macOS
    mkdir -p logs
    
    # Windows CMD
    mkdir logs
    
    # Windows PowerShell
    New-Item -ItemType Directory -Path logs -Force
  2. Deploy with Docker Compose:

    CPU-only deployment:

    cd docker
    docker compose --profile cpu up -d

    GPU-accelerated deployment (requires NVIDIA GPU and Docker GPU support):

    cd docker
    docker compose --profile gpu up -d

    Stopping the service:

    # To stop CPU deployment
    docker compose --profile cpu down
    
    # To stop GPU deployment
    docker compose --profile gpu down
    
    # To stop external-Ollama deployments
    docker compose --profile cpu-external down
    docker compose --profile gpu-external down
  3. The API will be available at http://localhost:8003.

Using an external Ollama instance

If you already have Ollama running (local network, cloud VM, managed service, etc.), use the cpu-external or gpu-external profile so the bundled Ollama containers are not started.

Set QA_APP_OLLAMA_BASE_URL in docker/.env:

QA_APP_OLLAMA_BASE_URL=http://192.168.1.10:11434

Then start the QA API:

cd docker

# CPU (QA API runs on CPU, Ollama is external)
docker compose --profile cpu-external up -d

# GPU (QA API uses GPU, Ollama is external)
docker compose --profile gpu-external up -d

If your external Ollama instance requires authentication, also set QA_APP_OLLAMA_API_KEY in docker/.env:

QA_APP_OLLAMA_API_KEY=your-token-here

The API key is forwarded as Authorization: Bearer <key> on every request to Ollama.

Ollama Setup

For Docker Deployment

If you're using the Docker deployment method, Ollama is automatically included in the docker-compose configuration. The docker-compose.yml file defines dedicated ollama-cpu and ollama-gpu services that:

  • Use the official ollama/ollama:latest image
  • Are optimized for CPU or GPU operation respectively
  • Have automatic model management and caching
  • Are configured to work seamlessly with the QA API service

No additional Ollama setup is required when using Docker deployment.

For Local Installation

If you're using the local installation method with Uvicorn, you must set up Ollama separately before running the QA Generator:

  1. Install Ollama:

    # Linux
    curl -fsSL https://ollama.ai/install.sh | sh
    
    # macOS
    brew install ollama
    
    # Windows
    # Download from https://ollama.ai/download
  2. Start Ollama service:

    ollama serve
  3. Pull required model (default: gemma3:4b):

    ollama pull gemma3:4b

The QA API will connect to Ollama at http://localhost:11434 by default. You can change this by setting the QA_APP_OLLAMA_BASE_URL environment variable.

Using the API

API Endpoints

  • POST /process-chunks/ Processes text chunks and generates question-answer pairs.

    Authentication: When API_TOKEN is set, requests must include Authorization: Bearer <token>. Omitting or supplying a wrong token returns HTTP 403. Authentication is disabled when API_TOKEN is empty.

    File size limit: Maximum upload size is 50 MB. Larger files are rejected with HTTP 413.

    Parameters:

    • file: JSON file containing text chunks to be processed
    • llm_model: LLM model to use for Q&A generation (string, default: "gemma3:4b")
    • temperature: Controls randomness in LLM output (float, default: 0.2)
    • context_window: Maximum context window size for LLM (integer, default: 16384)
    • custom_instructions: Optional custom instructions for Q&A generation; the text chunk is appended automatically (string, optional)
    • n_qa_pairs: Number of Q&A pairs to generate per chunk (integer, default: 3, range: 1-10)
    • max_retries: Maximum retry attempts for wrong Q&A pair counts (integer, default: 3, range: 1-10; the actual floor/ceiling can be overridden server-side via QA_APP_MIN_RETRIES / QA_APP_MAX_RETRIES_LIMIT)
    • chunk_metadata_json: JSON string containing custom metadata to merge into each generated QA item (string, optional)

    Expected JSON Input Format:

    {
      "chunks": [
        {
          "text": "First chunk of text content to generate questions from...",
          "id": 1,
          "token_count": 150
        },
        {
          "text": "Second chunk of text content...",
          "id": 2,
          "token_count": 200
        }
      ]
    }
  • GET /
    Health check endpoint returning service status, API version, and Ollama connectivity status.

Example API Call

Note: All parameters (except file) are passed as query string parameters, not as a JSON body.

Using cURL:

# Basic usage
curl -X POST "http://localhost:8003/process-chunks/" \
  -F "file=@document.json" \
  -H "accept: application/json"

# With custom parameters
curl -X POST "http://localhost:8003/process-chunks/?llm_model=gemma3:4b&temperature=0.3&n_qa_pairs=5" \
  -F "file=@document.json" \
  -H "accept: application/json"

# With API token authentication
curl -X POST "http://localhost:8003/process-chunks/" \
  -F "file=@document.json" \
  -H "accept: application/json" \
  -H "Authorization: Bearer your-token-here"

Using Python:

import requests
import json

# API endpoint
api_url = 'http://localhost:8003/process-chunks/'
file_path = 'document.json'

# Prepare the document
document = {
    "chunks": [
        {
            "text": "Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed.",
            "id": 1,
            "token_count": 25
        },
        {
            "text": "Deep learning uses neural networks with multiple layers to model and understand complex patterns in data.",
            "id": 2,
            "token_count": 20
        }
    ]
}

# Save to file
with open(file_path, 'w') as f:
    json.dump(document, f)

# Make the request
try:
    with open(file_path, 'rb') as f:
        files = {'file': (file_path, f, 'application/json')}
        params = {
            'llm_model': 'gemma3:4b',
            'temperature': 0.2,
            'n_qa_pairs': 3,
            'max_retries': 2
        }
        
        response = requests.post(api_url, files=files, params=params)
        response.raise_for_status()
        
        result = response.json()
        
        print(f"Generated {result['metadata']['total_qa_pairs']} Q&A pairs from {result['metadata']['total_chunks']} chunks")
        print(f"Average quality score: {result['metadata']['average_quality_score']:.2f}")
        print(f"Processing time: {result['metadata']['processing_time']:.2f}s")
        
        # Display sample Q&A pairs
        for i, qa in enumerate(result['chunks'][:3]):
            print(f"\nQ&A Pair {i+1}:")
            print(f"Source Chunk ID: {qa['id_source']}")
            print(f"Content: {qa['text'][:200]}...")
        
except Exception as e:
    print(f"Error: {e}")

Response Format

A successful Q&A generation returns the following structure:

{
  "chunks": [
    {
      "text": "What is machine learning?\nMachine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed, allowing systems to automatically improve their performance on a specific task through data analysis.",
      "id": 1,
      "token_count": 45,
      "id_source": 1
    },
    {
      "text": "How does deep learning differ from traditional machine learning?\nDeep learning uses neural networks with multiple layers to model and understand complex patterns in data, enabling automatic feature extraction and representation learning, whereas traditional machine learning often requires manual feature engineering.",
      "id": 2,
      "token_count": 52,
      "id_source": 2
    }
  ],
  "metadata": {
    "total_chunks": 2,
    "number_qa": 3,
    "total_qa_pairs": 6,
    "failed_parses": 0,
    "average_quality_score": 0.92,
    "quality_issues": {
      "empty_chunks": 0,
      "short_chunks": 0,
      "minimal_content": 0,
      "special_chars_only": 0,
      "no_output_generated": 0
    },
    "llm_model": "gemma3:4b",
    "temperature": 0.2,
    "context_window": 16384,
    "custom_prompt_used": false,
    "source": "document.json",  // reflects the uploaded file's name
    "processing_time": 15.7
  }
}

Configuration

The QA Generator can be configured through environment variables (for Docker deployments) or a local .env file. The table below lists configuration options with their default values:

Variable Description Default
API_TOKEN Bearer token for POST /process-chunks/ authentication. Leave empty to disable. (disabled)
QA_APP_OLLAMA_BASE_URL Base URL of the Ollama API server http://localhost:11434
QA_APP_OLLAMA_API_KEY API key for authenticated external Ollama instances (cpu-external / gpu-external profiles). Forwarded as Authorization: Bearer <key> to Ollama. Leave empty if not required. (disabled)
QA_APP_LLM_MODEL Default LLM model used for Q&A generation gemma3:4b
QA_APP_TEMPERATURE Default sampling temperature for the LLM 0.2
QA_APP_CONTEXT_WINDOW Maximum token window supplied to the LLM 16384
QA_APP_LLM_MAX_RETRIES Default maximum retry attempts for failed generations 3
QA_APP_LLM_BASE_DELAY Base delay in seconds for exponential backoff between retries 2.0
QA_APP_LLM_TIMEOUT Timeout in seconds for each LLM request 360
QA_APP_N_QA_PAIRS Default number of Q&A pairs to generate per chunk 3
QA_APP_LLM_MAX_WORKERS Number of concurrent LLM worker threads 2
QA_APP_OLLAMA_NUM_THREAD CPU threads used by Ollama for inference 8
QA_APP_OLLAMA_NUM_GPU GPU layers offloaded to the GPU (99 = all) 99
QA_APP_OLLAMA_NUM_PREDICT Max output tokens per LLM generation 4096
QA_APP_MIN_RETRIES Minimum allowed value for the max_retries API parameter (floor clamp) 1
QA_APP_MAX_RETRIES_LIMIT Maximum allowed value for the max_retries API parameter (ceiling clamp) 10

Custom Prompt Templates

The QA Generator allows you to customize the Q&A generation behavior by providing your own instructions through the custom_instructions parameter. The text chunk is automatically appended, so you only need to describe how the LLM should generate questions and answers.

Default Prompt Template

Below is the default prompt template used by the QA Generator. You can use this as a starting point for creating your own custom prompts:

DEFAULT_INSTRUCTIONS = """You are an expert content analyst. Generate question-answer pairs from the provided text.

RULES:
1. Derive ALL information strictly from the text. No external knowledge. No hallucinations.
2. Adapt questions to the domain and content type of the text.
3. Prioritize "why" and "how" questions that test understanding over simple recall.
4. Each question must test a DISTINCT concept — no redundancy.
5. Answers must be self-contained (2-4 sentences), accurate, and include specific details from the text.
6. Write in the SAME LANGUAGE as the source text.

QUESTION MIX:
- Conceptual: principles, frameworks, key ideas
- Analytical: cause-effect, comparisons, implications
- Applied: examples, techniques, practical applications
- Factual: definitions, key data points (use sparingly)

AVOID:
- Trivial or overly specific details
- Yes/no or single-word answers
- Questions requiring the exact text to answer

OUTPUT FORMAT:
Respond ONLY with <Q></Q> and <A></A> tag pairs. No other text.

<Q>Why does the lean startup methodology emphasize rapid iteration over extensive planning?</Q>
<A>The methodology prioritizes rapid iteration because it minimizes wasted resources on untested assumptions. By building small experiments, measuring real customer responses, and learning from results, teams can validate or pivot their approach before committing significant investment.</A>"""

CHUNK_WRAPPER = """
IMPORTANT: Generate EXACTLY {num_pairs} question-answer pairs (no more, no less).

<text>
{chunk}
</text>
"""

Using Custom Instructions

Pass your instructions as the custom_instructions parameter. The system automatically appends the text chunk to be processed — you do not need to include a {chunk} placeholder.

Testing

The QA Generator includes a comprehensive test suite to ensure reliability and quality.

Running Tests

# Run all tests
python -m pytest test/test_qa_api.py -v

# Run specific test
python -m pytest test/test_qa_api.py::test_qa_generation_basic_functionality -v

# Run with coverage report
python -m pytest test/test_qa_api.py --cov=hypothetical_qa_api --cov-report=html

Test Coverage

The test suite includes:

  • Core Functionality Tests: Basic Q&A generation workflow validation
  • Input Validation Tests: Parameter validation and error handling
  • Feature Tests: Custom instructions support and configuration options
  • Quality Assurance Tests: Q&A format validation and quality scoring
  • Health Check Tests: API health and service status validation
  • Metadata Tests: Processing metadata accuracy and completeness
  • Authentication Tests: Bearer token enforcement, bypass on empty API_TOKEN, and health check exemption
  • Ollama API Key Tests: Header injection at all Ollama call sites, correct omission when key is unset

For detailed testing information, see the test documentation.

Contributing

Hypothetical Chunks Questions Answer Generator is an open-source project that welcomes contributions from the community. Your involvement helps improve RAG technology for everyone.

We value contributions of all kinds:

  • Bug fixes and performance improvements
  • Documentation enhancements
  • New features and RAG-focused capabilities
  • Test coverage improvements
  • Integration examples and tutorials
  • RAG-optimized prompt templates

If you're interested in contributing:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add or update tests as appropriate
  5. Ensure all tests pass
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

Please ensure your code follows the existing style conventions and includes appropriate documentation.

For major changes, please open an issue first to discuss what you would like to change.

Happy RAG Enhancement with QA Generator!


About

Cutting-edge semantic text processing system that uses prompt engineering and advanced language models to generate high-quality hypothetical Q&A pairs for enhancing RAG pipelines and knowledge retrieval.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages