Skip to content

Phases 4-7: Streaming, LLM/RAG, Deployment, Ecosystem#10

Merged
netsirius merged 7 commits intomainfrom
feature/phases-4-to-7
Apr 6, 2026
Merged

Phases 4-7: Streaming, LLM/RAG, Deployment, Ecosystem#10
netsirius merged 7 commits intomainfrom
feature/phases-4-to-7

Conversation

@netsirius
Copy link
Copy Markdown
Owner

Summary

Consolidated PR for Phases 4 through 7, completing the Data Weaver v2 roadmap.


Phase 4: Streaming + Advanced Connectors

New source connectors:

  • Kafka (batch + streaming, configurable offsets)
  • MongoDB (collection reads, aggregation pipeline)
  • REST API (generic, offset/cursor pagination, bearer/api-key auth)
  • BigQuery source (table + SQL queries)

New sink connectors:

  • Kafka (batch + streaming with checkpointing)
  • Elasticsearch (Spark ES + REST bulk API fallback)

Totals: 8 sources + 6 sinks = 14 connectors


Phase 5: LLM Transforms + RAG

  • LLMTransformPlugin: generic LLM-as-transformation with prompt templating, structured JSON output, batching, content-hash caching, retry with exponential backoff
  • ChunkingPlugin: fixed, sentence, recursive strategies
  • EmbeddingPlugin: vector embeddings via OpenAI/Vertex AI/Cohere
  • GraphExtractionPlugin: entity/relationship extraction using LLM
  • Local model support: all LLM plugins support provider: local for Ollama (no API key required)
  • Custom endpoints: baseUrl for self-hosted LLM servers

Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking, Embedding, GraphExtraction)


Phase 6: Deployment + Production

  • SubmitCommand: spark-submit wrapper for standalone, Kubernetes, AWS EMR, GCP Dataproc
  • Dockerfile: JDK 17 + Spark 4.0.2 base image
  • install.sh: one-line installer (curl | bash)
  • Airflow operator: DataWeaverOperator Python package with pyproject.toml

Phase 7: Ecosystem

  • Connector SDK: complete developer guide for building custom connectors
  • 3 tutorials with sample data:
    1. CSV to Parquet with quality checks
    2. Multi-source join with parallel transforms
    3. RAG pipeline (document chunking)
  • Reusable CI workflow: weaver-pipeline-ci.yml for user projects

Complete CLI

weaver init <project>                  # Scaffold project
weaver init --interactive              # Step-by-step wizard
weaver generate \"<description>\"       # AI pipeline generation
weaver doctor <pipeline>               # System diagnostic
weaver validate <pipeline>             # YAML + schema + DAG
weaver plan <pipeline>                 # Dry-run
weaver explain <pipeline>              # DAG visualization
weaver inspect <pipeline> <id>         # Source/transform details
weaver test <pipeline>                 # Run inline tests
weaver test --coverage                 # Coverage report
weaver apply <pipeline> [--env prod]   # Execute

Test plan

  • sbt compile — All modules compile
  • 47 core tests pass locally
  • CI green on GitHub Actions

New source connectors:
- Kafka (batch + streaming modes, configurable offsets)
- MongoDB (collection reads, aggregation pipeline support)
- REST API (generic, with offset/cursor pagination, bearer/api-key auth)
- BigQuery source (table reads + SQL queries)

New sink connectors:
- Kafka (batch + streaming with checkpointing)
- Elasticsearch (Spark ES connector with REST bulk API fallback)

All connectors include healthCheck for weaver doctor.

Dependencies:
- kafka-clients 3.7.0 (provided, for health checks)

Schema:
- Updated pipeline.schema.json with new connector types

Total connectors: 8 sources + 6 sinks = 14 connectors
LLM Transforms:
- LLMTransformPlugin: generic LLM-as-transformation with prompt templating,
  structured JSON output parsing, batching, caching, retry with backoff
- Supports: claude, openai, local (Ollama), any OpenAI-compatible API
- Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl

RAG Transforms:
- ChunkingPlugin: fixed, sentence, recursive chunking strategies
  with configurable size and overlap
- EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere
  APIs, returns embedding column (array of doubles)
- GraphExtractionPlugin: entity/relationship extraction using LLM,
  returns source_entity, source_type, relation, target_entity, target_type

Local model support:
- All LLM plugins support provider: "local" for Ollama
  (http://localhost:11434/v1/chat/completions)
- No API key required for local models
- Custom baseUrl for self-hosted endpoints

Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking,
Embedding, GraphExtraction)
Deployment targets:
- SubmitCommand: spark-submit wrapper for standalone, k8s, EMR, Dataproc
- Dockerfile: JDK 17 + Spark 4.0.2 + data-weaver.jar
- docker-entrypoint.sh: container entry point
- install.sh: one-line installer (curl | bash)

Airflow integration:
- DataWeaverOperator: Airflow operator for pipeline execution
- Python package (data-weaver-airflow) with pyproject.toml
- Template fields for pipeline, env, extra_args

Usage:
  docker run data-weaver:latest apply pipeline.yaml --env prod
  weaver apply --submit k8s --master k8s://host:6443 --image img:tag
  weaver apply --submit emr --cluster-id j-XXXXX
  curl -fsSL .../install.sh | bash
Connector SDK:
- CONNECTOR_SDK.md: complete guide for building custom connectors
- Source and Sink trait documentation with examples
- ServiceLoader registration instructions
- Best practices for health checks and config handling

Tutorials (3 real-world pipelines with sample data):
- 01-csv-to-parquet: CSV read, SQL transform, quality checks, Parquet write
- 02-multi-source-join: parallel transforms, multi-source join, JSON output
- 03-rag-pipeline: document chunking for RAG preparation
- Sample data: customers.csv, orders.csv, docs.json

Reusable CI workflow:
- weaver-pipeline-ci.yml: GitHub Actions workflow for pipeline validation
- Auto-validates all YAML files in pipelines/ directory
- Runs inline tests for each pipeline
- Reusable via workflow_call for user projects
…ronously

Multi-node levels still use Future for parallelism.
Single-node levels execute directly to avoid Spark classloader
conflicts in test environments.
…ronously

Multi-node levels use Future for parallelism.
Single-node levels execute directly to avoid Spark classloader
conflicts in test environments.
PluginRegistry.safeLoad skips connectors whose dependencies are not
on the classpath (e.g., KafkaSourceConnector without kafka-clients).
This prevents ServiceConfigurationError in test environments.
@netsirius netsirius merged commit d0d9f8f into main Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant