Skip to content

Implement Configurable RAG Pipeline with JSON/YAML Configuration #4

@brylie

Description

@brylie

Objective

Develop a flexible, configurable Retrieval Augmented Generation (RAG) pipeline that can be easily customized for different projects through a JSON or YAML configuration file.

Description

We need to refactor our current RAG system into a more generalized, configuration-driven pipeline. This will allow users to easily set up and customize RAG chatbots for different knowledge bases and use cases without modifying the core code.

Configuration Structure

The configuration file (in JSON or YAML) should include the following fields:

project_id: str  # Unique identifier for the project
endpoint_slug: str  # Unique slug for the chat endpoint
system_prompt: str  # Prompt defining chatbot behavior
knowledge_sources:  # List of document sources
  - type: str  # e.g., 'file', 'url'
    path: str  # File path or URL
    metadata:
      label: str
      tags: list[str]
keep_chat_memory: bool
collection_name: str  # Vector DB collection name
embedding_model: str  # Embedding model identifier

Tasks

  1. Configuration Management:

    • Implement a configuration parser for either JSON or YAML format
    • Create a validation script to check configuration validity
    • Develop a system to store and retrieve configurations (e.g., SQLite with ORM)
  2. Document Ingestion Pipeline:

    • Create a script to ingest documents from various sources (local files, URLs)
    • Implement document parsing for different file types (PDF, TXT, HTML, etc.)
      • use existing libraries where possible to avoid complexities of document parsing
    • Develop a vectorization process using the specified embedding model
    • Implement upsert functionality to avoid duplicates in the vector store
  3. Vector Store Integration:

    • Enhance vector store integration to support multiple collections
    • Implement metadata storage for embedding model information
    • Create a migration system for updating existing collections when configurations change
  4. Chatbot Endpoint Generation:

    • Develop a dynamic endpoint generation system based on endpoint_slug
    • Implement configuration-based chat processing (system prompt, chat memory, etc.)
  5. Runtime Processing:

    • Implement dynamic retrieval based on the configured collection and embedding model
    • Develop a flexible message construction system based on configuration
  6. ORM and Database Management:

    • Set up an ORM (e.g., SQLAlchemy) for configuration and metadata storage
    • Implement a migration system (e.g., Alembic) for database schema changes
  7. Configuration Change Management:

    • Develop a system to detect configuration changes (possible just re-running the validation script(s))
    • Implement processes to apply changes (e.g., re-vectorizing documents, creating new collections)
  8. Admin Interface (optional and probably a separate issue):

    • Create a basic admin interface for managing configurations
    • Implement functionality to view and edit configurations
    • Add features to monitor the status of document ingestion and vectorization

Technical Considerations

  • Use a robust ORM like SQLAlchemy for database interactions
  • Implement async operations where possible for better performance
  • Ensure proper error handling and logging throughout the pipeline
  • Consider using a task queue (e.g., Celery) for long-running operations like document ingestion
  • Implement proper security measures, especially for the admin interface

Acceptance Criteria

  • Users can create and modify RAG chatbots through configuration files
  • The system correctly ingests and vectorizes documents from various sources
  • Chatbot endpoints are dynamically created based on configuration
  • Configuration changes are detected and applied correctly
  • The admin interface provides a clear overview of all configured chatbots and their statuses
  • The system handles errors gracefully and provides clear feedback
  • Performance remains acceptable even with multiple configured chatbots

Additional Notes

  • Consider implementing a versioning system for configurations to allow rollbacks
  • Plan for future extensibility, such as supporting multiple LLM providers or vector stores
  • Ensure thorough documentation of the configuration format and system capabilities
  • Consider developing a simple web interface for non-technical users to create and manage configurations

Future Enhancements

  • Support for real-time document updates and incremental vectorization
  • Integration with popular document management systems or cloud storage providers
  • Advanced analytics and usage tracking for each configured chatbot
  • A/B testing capabilities for different configurations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions