Phases 4-7: Streaming, LLM/RAG, Deployment, Ecosystem by netsirius · Pull Request #10 · netsirius/data-weaver

netsirius · 2026-04-06T09:19:50Z

Summary

Consolidated PR for Phases 4 through 7, completing the Data Weaver v2 roadmap.

Phase 4: Streaming + Advanced Connectors

New source connectors:

Kafka (batch + streaming, configurable offsets)
MongoDB (collection reads, aggregation pipeline)
REST API (generic, offset/cursor pagination, bearer/api-key auth)
BigQuery source (table + SQL queries)

New sink connectors:

Kafka (batch + streaming with checkpointing)
Elasticsearch (Spark ES + REST bulk API fallback)

Totals: 8 sources + 6 sinks = 14 connectors

Phase 5: LLM Transforms + RAG

LLMTransformPlugin: generic LLM-as-transformation with prompt templating, structured JSON output, batching, content-hash caching, retry with exponential backoff
ChunkingPlugin: fixed, sentence, recursive strategies
EmbeddingPlugin: vector embeddings via OpenAI/Vertex AI/Cohere
GraphExtractionPlugin: entity/relationship extraction using LLM
Local model support: all LLM plugins support provider: local for Ollama (no API key required)
Custom endpoints: baseUrl for self-hosted LLM servers

Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking, Embedding, GraphExtraction)

Phase 6: Deployment + Production

SubmitCommand: spark-submit wrapper for standalone, Kubernetes, AWS EMR, GCP Dataproc
Dockerfile: JDK 17 + Spark 4.0.2 base image
install.sh: one-line installer (curl | bash)
Airflow operator: DataWeaverOperator Python package with pyproject.toml

Phase 7: Ecosystem

Connector SDK: complete developer guide for building custom connectors
3 tutorials with sample data:
1. CSV to Parquet with quality checks
2. Multi-source join with parallel transforms
3. RAG pipeline (document chunking)
Reusable CI workflow: weaver-pipeline-ci.yml for user projects

Complete CLI

weaver init <project>                  # Scaffold project
weaver init --interactive              # Step-by-step wizard
weaver generate \"<description>\"       # AI pipeline generation
weaver doctor <pipeline>               # System diagnostic
weaver validate <pipeline>             # YAML + schema + DAG
weaver plan <pipeline>                 # Dry-run
weaver explain <pipeline>              # DAG visualization
weaver inspect <pipeline> <id>         # Source/transform details
weaver test <pipeline>                 # Run inline tests
weaver test --coverage                 # Coverage report
weaver apply <pipeline> [--env prod]   # Execute

Test plan

sbt compile — All modules compile
47 core tests pass locally
CI green on GitHub Actions

New source connectors: - Kafka (batch + streaming modes, configurable offsets) - MongoDB (collection reads, aggregation pipeline support) - REST API (generic, with offset/cursor pagination, bearer/api-key auth) - BigQuery source (table reads + SQL queries) New sink connectors: - Kafka (batch + streaming with checkpointing) - Elasticsearch (Spark ES connector with REST bulk API fallback) All connectors include healthCheck for weaver doctor. Dependencies: - kafka-clients 3.7.0 (provided, for health checks) Schema: - Updated pipeline.schema.json with new connector types Total connectors: 8 sources + 6 sinks = 14 connectors

LLM Transforms: - LLMTransformPlugin: generic LLM-as-transformation with prompt templating, structured JSON output parsing, batching, caching, retry with backoff - Supports: claude, openai, local (Ollama), any OpenAI-compatible API - Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl RAG Transforms: - ChunkingPlugin: fixed, sentence, recursive chunking strategies with configurable size and overlap - EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere APIs, returns embedding column (array of doubles) - GraphExtractionPlugin: entity/relationship extraction using LLM, returns source_entity, source_type, relation, target_entity, target_type Local model support: - All LLM plugins support provider: "local" for Ollama (http://localhost:11434/v1/chat/completions) - No API key required for local models - Custom baseUrl for self-hosted endpoints Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking, Embedding, GraphExtraction)

Deployment targets: - SubmitCommand: spark-submit wrapper for standalone, k8s, EMR, Dataproc - Dockerfile: JDK 17 + Spark 4.0.2 + data-weaver.jar - docker-entrypoint.sh: container entry point - install.sh: one-line installer (curl | bash) Airflow integration: - DataWeaverOperator: Airflow operator for pipeline execution - Python package (data-weaver-airflow) with pyproject.toml - Template fields for pipeline, env, extra_args Usage: docker run data-weaver:latest apply pipeline.yaml --env prod weaver apply --submit k8s --master k8s://host:6443 --image img:tag weaver apply --submit emr --cluster-id j-XXXXX curl -fsSL .../install.sh | bash

Connector SDK: - CONNECTOR_SDK.md: complete guide for building custom connectors - Source and Sink trait documentation with examples - ServiceLoader registration instructions - Best practices for health checks and config handling Tutorials (3 real-world pipelines with sample data): - 01-csv-to-parquet: CSV read, SQL transform, quality checks, Parquet write - 02-multi-source-join: parallel transforms, multi-source join, JSON output - 03-rag-pipeline: document chunking for RAG preparation - Sample data: customers.csv, orders.csv, docs.json Reusable CI workflow: - weaver-pipeline-ci.yml: GitHub Actions workflow for pipeline validation - Auto-validates all YAML files in pipelines/ directory - Runs inline tests for each pipeline - Reusable via workflow_call for user projects

…ronously Multi-node levels still use Future for parallelism. Single-node levels execute directly to avoid Spark classloader conflicts in test environments.

…ronously Multi-node levels use Future for parallelism. Single-node levels execute directly to avoid Spark classloader conflicts in test environments.

PluginRegistry.safeLoad skips connectors whose dependencies are not on the classpath (e.g., KafkaSourceConnector without kafka-clients). This prevents ServiceConfigurationError in test environments.

netsirius added 7 commits April 6, 2026 11:08

fix: avoid classloader conflict by executing single-node levels synch…

2de0330

…ronously Multi-node levels still use Future for parallelism. Single-node levels execute directly to avoid Spark classloader conflicts in test environments.

fix: avoid classloader conflict by executing single-node levels synch…

7f637ca

…ronously Multi-node levels use Future for parallelism. Single-node levels execute directly to avoid Spark classloader conflicts in test environments.

fix: graceful ServiceLoader handling for connectors with missing deps

55cda78

PluginRegistry.safeLoad skips connectors whose dependencies are not on the classpath (e.g., KafkaSourceConnector without kafka-clients). This prevents ServiceConfigurationError in test environments.

netsirius merged commit d0d9f8f into main Apr 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phases 4-7: Streaming, LLM/RAG, Deployment, Ecosystem#10

Phases 4-7: Streaming, LLM/RAG, Deployment, Ecosystem#10
netsirius merged 7 commits intomainfrom
feature/phases-4-to-7

netsirius commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

netsirius commented Apr 6, 2026

Summary

Phase 4: Streaming + Advanced Connectors

Phase 5: LLM Transforms + RAG

Phase 6: Deployment + Production

Phase 7: Ecosystem

Complete CLI

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant