Skip to content

Phase 5: LLM transforms and RAG data preparation#7

Closed
netsirius wants to merge 2 commits intomainfrom
feature/phase5-llm-transforms-rag
Closed

Phase 5: LLM transforms and RAG data preparation#7
netsirius wants to merge 2 commits intomainfrom
feature/phase5-llm-transforms-rag

Conversation

@netsirius
Copy link
Copy Markdown
Owner

Summary

Adds LLM-as-transformation capabilities and RAG data preparation transforms.

LLM Transforms

  • LLMTransformPlugin: generic transform that calls any LLM API with prompt templating, structured JSON output, batching, content-hash caching, and retry with exponential backoff
  • Supports claude, openai, local (Ollama), and any OpenAI-compatible API via custom baseUrl

RAG Transforms

  • ChunkingPlugin: split documents into chunks (fixed, sentence, recursive strategies)
  • EmbeddingPlugin: generate vector embeddings via OpenAI/Vertex AI/Cohere APIs
  • GraphExtractionPlugin: extract entities and relationships using LLM

Local Model Support

All LLM plugins support provider: local for Ollama (http://localhost:11434). No API key required. Custom baseUrl supported for self-hosted endpoints.

Example YAML

transformations:
  - id: classify
    type: LLMTransform
    sources: [data]
    config:
      provider: local
      model: llama3
      prompt: "Classify: {text}"
      outputSchema: "category:string|confidence:double"
      batchSize: 5
      cache: true

Test plan

  • sbt compile — All modules compile
  • CI green
  • Manual: LLMTransform with Ollama local model

New source connectors:
- Kafka (batch + streaming modes, configurable offsets)
- MongoDB (collection reads, aggregation pipeline support)
- REST API (generic, with offset/cursor pagination, bearer/api-key auth)
- BigQuery source (table reads + SQL queries)

New sink connectors:
- Kafka (batch + streaming with checkpointing)
- Elasticsearch (Spark ES connector with REST bulk API fallback)

All connectors include healthCheck for weaver doctor.

Dependencies:
- kafka-clients 3.7.0 (provided, for health checks)

Schema:
- Updated pipeline.schema.json with new connector types

Total connectors: 8 sources + 6 sinks = 14 connectors
LLM Transforms:
- LLMTransformPlugin: generic LLM-as-transformation with prompt templating,
  structured JSON output parsing, batching, caching, retry with backoff
- Supports: claude, openai, local (Ollama), any OpenAI-compatible API
- Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl

RAG Transforms:
- ChunkingPlugin: fixed, sentence, recursive chunking strategies
  with configurable size and overlap
- EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere
  APIs, returns embedding column (array of doubles)
- GraphExtractionPlugin: entity/relationship extraction using LLM,
  returns source_entity, source_type, relation, target_entity, target_type

Local model support:
- All LLM plugins support provider: "local" for Ollama
  (http://localhost:11434/v1/chat/completions)
- No API key required for local models
- Custom baseUrl for self-hosted endpoints

Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking,
Embedding, GraphExtraction)
@netsirius netsirius closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant