Skip to content

Phase 7: Connector SDK, tutorials, and CI workflow#9

Closed
netsirius wants to merge 4 commits intomainfrom
feature/phase7-ecosystem
Closed

Phase 7: Connector SDK, tutorials, and CI workflow#9
netsirius wants to merge 4 commits intomainfrom
feature/phase7-ecosystem

Conversation

@netsirius
Copy link
Copy Markdown
Owner

Summary

Adds developer documentation, real-world tutorials, and reusable CI workflow.

Connector SDK

  • Complete guide for building custom source and sink connectors
  • Code examples, ServiceLoader registration, best practices

Tutorials (3 pipelines with sample data)

  1. CSV to Parquet: read, transform, quality check, write
  2. Multi-source join: parallel transforms, aggregation
  3. RAG pipeline: document chunking for vector preparation

Reusable CI Workflow

  • weaver-pipeline-ci.yml: validates and tests all pipelines in a directory
  • Reusable via workflow_call for user projects

Test plan

  • All files created
  • CI green
  • Tutorials work with weaver validate

New source connectors:
- Kafka (batch + streaming modes, configurable offsets)
- MongoDB (collection reads, aggregation pipeline support)
- REST API (generic, with offset/cursor pagination, bearer/api-key auth)
- BigQuery source (table reads + SQL queries)

New sink connectors:
- Kafka (batch + streaming with checkpointing)
- Elasticsearch (Spark ES connector with REST bulk API fallback)

All connectors include healthCheck for weaver doctor.

Dependencies:
- kafka-clients 3.7.0 (provided, for health checks)

Schema:
- Updated pipeline.schema.json with new connector types

Total connectors: 8 sources + 6 sinks = 14 connectors
LLM Transforms:
- LLMTransformPlugin: generic LLM-as-transformation with prompt templating,
  structured JSON output parsing, batching, caching, retry with backoff
- Supports: claude, openai, local (Ollama), any OpenAI-compatible API
- Config: batchSize, maxConcurrent, retryOnError, cache, apiKey, baseUrl

RAG Transforms:
- ChunkingPlugin: fixed, sentence, recursive chunking strategies
  with configurable size and overlap
- EmbeddingPlugin: vector embedding generation via OpenAI/Vertex AI/Cohere
  APIs, returns embedding column (array of doubles)
- GraphExtractionPlugin: entity/relationship extraction using LLM,
  returns source_entity, source_type, relation, target_entity, target_type

Local model support:
- All LLM plugins support provider: "local" for Ollama
  (http://localhost:11434/v1/chat/completions)
- No API key required for local models
- Custom baseUrl for self-hosted endpoints

Total transform types: 6 (SQL, DataQuality, LLMTransform, Chunking,
Embedding, GraphExtraction)
Deployment targets:
- SubmitCommand: spark-submit wrapper for standalone, k8s, EMR, Dataproc
- Dockerfile: JDK 17 + Spark 4.0.2 + data-weaver.jar
- docker-entrypoint.sh: container entry point
- install.sh: one-line installer (curl | bash)

Airflow integration:
- DataWeaverOperator: Airflow operator for pipeline execution
- Python package (data-weaver-airflow) with pyproject.toml
- Template fields for pipeline, env, extra_args

Usage:
  docker run data-weaver:latest apply pipeline.yaml --env prod
  weaver apply --submit k8s --master k8s://host:6443 --image img:tag
  weaver apply --submit emr --cluster-id j-XXXXX
  curl -fsSL .../install.sh | bash
Connector SDK:
- CONNECTOR_SDK.md: complete guide for building custom connectors
- Source and Sink trait documentation with examples
- ServiceLoader registration instructions
- Best practices for health checks and config handling

Tutorials (3 real-world pipelines with sample data):
- 01-csv-to-parquet: CSV read, SQL transform, quality checks, Parquet write
- 02-multi-source-join: parallel transforms, multi-source join, JSON output
- 03-rag-pipeline: document chunking for RAG preparation
- Sample data: customers.csv, orders.csv, docs.json

Reusable CI workflow:
- weaver-pipeline-ci.yml: GitHub Actions workflow for pipeline validation
- Auto-validates all YAML files in pipelines/ directory
- Runs inline tests for each pipeline
- Reusable via workflow_call for user projects
@netsirius netsirius closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant