Phase 6: Deployment and production readiness by netsirius · Pull Request #8 · netsirius/data-weaver

netsirius · 2026-04-06T09:15:33Z

Summary

Makes Data Weaver deployable to any Spark target with Docker, Airflow, and cloud cluster support.

Deployment

SubmitCommand: spark-submit wrapper for standalone, Kubernetes, AWS EMR, GCP Dataproc
Dockerfile: JDK 17 + Spark 4.0.2 base image
install.sh: one-line installer (curl | bash)

Airflow Integration

DataWeaverOperator: Python Airflow operator
airflow-operator/ package with pyproject.toml

Usage

docker run data-weaver:latest apply pipeline.yaml --env prod
weaver apply --submit k8s --master k8s://host --image img:tag
curl -fsSL .../install.sh | bash

Test plan

sbt compile — compiles
CI green
Docker build works
install.sh runs successfully

New source connectors: - Kafka (batch + streaming modes, configurable offsets) - MongoDB (collection reads, aggregation pipeline support) - REST API (generic, with offset/cursor pagination, bearer/api-key auth) - BigQuery source (table reads + SQL queries) New sink connectors: - Kafka (batch + streaming with checkpointing) - Elasticsearch (Spark ES connector with REST bulk API fallback) All connectors include healthCheck for weaver doctor. Dependencies: - kafka-clients 3.7.0 (provided, for health checks) Schema: - Updated pipeline.schema.json with new connector types Total connectors: 8 sources + 6 sinks = 14 connectors

LLM Transforms: - LLMTransformPlugin: generic LLM-as-transformation with prompt templating, structured JSON output parsing, batching, caching, retry with backoff - Supports: claude, openai, local (Ollama), any OpenAI-compatible API - Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl RAG Transforms: - ChunkingPlugin: fixed, sentence, recursive chunking strategies with configurable size and overlap - EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere APIs, returns embedding column (array of doubles) - GraphExtractionPlugin: entity/relationship extraction using LLM, returns source_entity, source_type, relation, target_entity, target_type Local model support: - All LLM plugins support provider: "local" for Ollama (http://localhost:11434/v1/chat/completions) - No API key required for local models - Custom baseUrl for self-hosted endpoints Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking, Embedding, GraphExtraction)

Deployment targets: - SubmitCommand: spark-submit wrapper for standalone, k8s, EMR, Dataproc - Dockerfile: JDK 17 + Spark 4.0.2 + data-weaver.jar - docker-entrypoint.sh: container entry point - install.sh: one-line installer (curl | bash) Airflow integration: - DataWeaverOperator: Airflow operator for pipeline execution - Python package (data-weaver-airflow) with pyproject.toml - Template fields for pipeline, env, extra_args Usage: docker run data-weaver:latest apply pipeline.yaml --env prod weaver apply --submit k8s --master k8s://host:6443 --image img:tag weaver apply --submit emr --cluster-id j-XXXXX curl -fsSL .../install.sh | bash

netsirius added 3 commits April 6, 2026 11:08

netsirius closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 6: Deployment and production readiness#8

Phase 6: Deployment and production readiness#8
netsirius wants to merge 3 commits intomainfrom
feature/phase6-deployment

netsirius commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

netsirius commented Apr 6, 2026

Summary

Deployment

Airflow Integration

Usage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant