Phase 5: LLM transforms and RAG data preparation by netsirius · Pull Request #7 · netsirius/data-weaver

netsirius · 2026-04-06T09:12:48Z

Summary

Adds LLM-as-transformation capabilities and RAG data preparation transforms.

LLM Transforms

LLMTransformPlugin: generic transform that calls any LLM API with prompt templating, structured JSON output, batching, content-hash caching, and retry with exponential backoff
Supports claude, openai, local (Ollama), and any OpenAI-compatible API via custom baseUrl

RAG Transforms

ChunkingPlugin: split documents into chunks (fixed, sentence, recursive strategies)
EmbeddingPlugin: generate vector embeddings via OpenAI/Vertex AI/Cohere APIs
GraphExtractionPlugin: extract entities and relationships using LLM

Local Model Support

All LLM plugins support provider: local for Ollama (http://localhost:11434). No API key required. Custom baseUrl supported for self-hosted endpoints.

Example YAML

transformations:
  - id: classify
    type: LLMTransform
    sources: [data]
    config:
      provider: local
      model: llama3
      prompt: "Classify: {text}"
      outputSchema: "category:string|confidence:double"
      batchSize: 5
      cache: true

Test plan

sbt compile — All modules compile
CI green
Manual: LLMTransform with Ollama local model

New source connectors: - Kafka (batch + streaming modes, configurable offsets) - MongoDB (collection reads, aggregation pipeline support) - REST API (generic, with offset/cursor pagination, bearer/api-key auth) - BigQuery source (table reads + SQL queries) New sink connectors: - Kafka (batch + streaming with checkpointing) - Elasticsearch (Spark ES connector with REST bulk API fallback) All connectors include healthCheck for weaver doctor. Dependencies: - kafka-clients 3.7.0 (provided, for health checks) Schema: - Updated pipeline.schema.json with new connector types Total connectors: 8 sources + 6 sinks = 14 connectors

LLM Transforms: - LLMTransformPlugin: generic LLM-as-transformation with prompt templating, structured JSON output parsing, batching, caching, retry with backoff - Supports: claude, openai, local (Ollama), any OpenAI-compatible API - Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl RAG Transforms: - ChunkingPlugin: fixed, sentence, recursive chunking strategies with configurable size and overlap - EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere APIs, returns embedding column (array of doubles) - GraphExtractionPlugin: entity/relationship extraction using LLM, returns source_entity, source_type, relation, target_entity, target_type Local model support: - All LLM plugins support provider: "local" for Ollama (http://localhost:11434/v1/chat/completions) - No API key required for local models - Custom baseUrl for self-hosted endpoints Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking, Embedding, GraphExtraction)

netsirius added 2 commits April 6, 2026 11:08

netsirius closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 5: LLM transforms and RAG data preparation#7

Phase 5: LLM transforms and RAG data preparation#7
netsirius wants to merge 2 commits intomainfrom
feature/phase5-llm-transforms-rag

netsirius commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

netsirius commented Apr 6, 2026

Summary

LLM Transforms

RAG Transforms

Local Model Support

Example YAML

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant