Closed
Conversation
New source connectors: - Kafka (batch + streaming modes, configurable offsets) - MongoDB (collection reads, aggregation pipeline support) - REST API (generic, with offset/cursor pagination, bearer/api-key auth) - BigQuery source (table reads + SQL queries) New sink connectors: - Kafka (batch + streaming with checkpointing) - Elasticsearch (Spark ES connector with REST bulk API fallback) All connectors include healthCheck for weaver doctor. Dependencies: - kafka-clients 3.7.0 (provided, for health checks) Schema: - Updated pipeline.schema.json with new connector types Total connectors: 8 sources + 6 sinks = 14 connectors
LLM Transforms: - LLMTransformPlugin: generic LLM-as-transformation with prompt templating, structured JSON output parsing, batching, caching, retry with backoff - Supports: claude, openai, local (Ollama), any OpenAI-compatible API - Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl RAG Transforms: - ChunkingPlugin: fixed, sentence, recursive chunking strategies with configurable size and overlap - EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere APIs, returns embedding column (array of doubles) - GraphExtractionPlugin: entity/relationship extraction using LLM, returns source_entity, source_type, relation, target_entity, target_type Local model support: - All LLM plugins support provider: "local" for Ollama (http://localhost:11434/v1/chat/completions) - No API key required for local models - Custom baseUrl for self-hosted endpoints Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking, Embedding, GraphExtraction)
Deployment targets: - SubmitCommand: spark-submit wrapper for standalone, k8s, EMR, Dataproc - Dockerfile: JDK 17 + Spark 4.0.2 + data-weaver.jar - docker-entrypoint.sh: container entry point - install.sh: one-line installer (curl | bash) Airflow integration: - DataWeaverOperator: Airflow operator for pipeline execution - Python package (data-weaver-airflow) with pyproject.toml - Template fields for pipeline, env, extra_args Usage: docker run data-weaver:latest apply pipeline.yaml --env prod weaver apply --submit k8s --master k8s://host:6443 --image img:tag weaver apply --submit emr --cluster-id j-XXXXX curl -fsSL .../install.sh | bash
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes Data Weaver deployable to any Spark target with Docker, Airflow, and cloud cluster support.
Deployment
SubmitCommand: spark-submit wrapper for standalone, Kubernetes, AWS EMR, GCP DataprocDockerfile: JDK 17 + Spark 4.0.2 base imageinstall.sh: one-line installer (curl | bash)Airflow Integration
DataWeaverOperator: Python Airflow operatorairflow-operator/package withpyproject.tomlUsage
docker run data-weaver:latest apply pipeline.yaml --env prod weaver apply --submit k8s --master k8s://host --image img:tag curl -fsSL .../install.sh | bashTest plan
sbt compile— compiles