Skip to content

Phase 6: Deployment and production readiness#8

Closed
netsirius wants to merge 3 commits intomainfrom
feature/phase6-deployment
Closed

Phase 6: Deployment and production readiness#8
netsirius wants to merge 3 commits intomainfrom
feature/phase6-deployment

Conversation

@netsirius
Copy link
Copy Markdown
Owner

Summary

Makes Data Weaver deployable to any Spark target with Docker, Airflow, and cloud cluster support.

Deployment

  • SubmitCommand: spark-submit wrapper for standalone, Kubernetes, AWS EMR, GCP Dataproc
  • Dockerfile: JDK 17 + Spark 4.0.2 base image
  • install.sh: one-line installer (curl | bash)

Airflow Integration

  • DataWeaverOperator: Python Airflow operator
  • airflow-operator/ package with pyproject.toml

Usage

docker run data-weaver:latest apply pipeline.yaml --env prod
weaver apply --submit k8s --master k8s://host --image img:tag
curl -fsSL .../install.sh | bash

Test plan

  • sbt compile — compiles
  • CI green
  • Docker build works
  • install.sh runs successfully

New source connectors:
- Kafka (batch + streaming modes, configurable offsets)
- MongoDB (collection reads, aggregation pipeline support)
- REST API (generic, with offset/cursor pagination, bearer/api-key auth)
- BigQuery source (table reads + SQL queries)

New sink connectors:
- Kafka (batch + streaming with checkpointing)
- Elasticsearch (Spark ES connector with REST bulk API fallback)

All connectors include healthCheck for weaver doctor.

Dependencies:
- kafka-clients 3.7.0 (provided, for health checks)

Schema:
- Updated pipeline.schema.json with new connector types

Total connectors: 8 sources + 6 sinks = 14 connectors
LLM Transforms:
- LLMTransformPlugin: generic LLM-as-transformation with prompt templating,
  structured JSON output parsing, batching, caching, retry with backoff
- Supports: claude, openai, local (Ollama), any OpenAI-compatible API
- Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl

RAG Transforms:
- ChunkingPlugin: fixed, sentence, recursive chunking strategies
  with configurable size and overlap
- EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere
  APIs, returns embedding column (array of doubles)
- GraphExtractionPlugin: entity/relationship extraction using LLM,
  returns source_entity, source_type, relation, target_entity, target_type

Local model support:
- All LLM plugins support provider: "local" for Ollama
  (http://localhost:11434/v1/chat/completions)
- No API key required for local models
- Custom baseUrl for self-hosted endpoints

Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking,
Embedding, GraphExtraction)
Deployment targets:
- SubmitCommand: spark-submit wrapper for standalone, k8s, EMR, Dataproc
- Dockerfile: JDK 17 + Spark 4.0.2 + data-weaver.jar
- docker-entrypoint.sh: container entry point
- install.sh: one-line installer (curl | bash)

Airflow integration:
- DataWeaverOperator: Airflow operator for pipeline execution
- Python package (data-weaver-airflow) with pyproject.toml
- Template fields for pipeline, env, extra_args

Usage:
  docker run data-weaver:latest apply pipeline.yaml --env prod
  weaver apply --submit k8s --master k8s://host:6443 --image img:tag
  weaver apply --submit emr --cluster-id j-XXXXX
  curl -fsSL .../install.sh | bash
@netsirius netsirius closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant