A simple end-to-end data pipeline built as part of a technical assessment, modeling ingestion, validation, and persistence of customer data.
This repository was created as part of a technical assessment.
The goal was to implement a small but complete data pipeline within a limited time frame, focusing on clarity, correctness, and structure rather than production-level completeness.
This repository implements a three-service customer data pipeline:
mock-server: Flask API serving customer records from a JSON datasetpipeline-service: FastAPI service responsible for ingestion, validation, and read APIspostgres: PostgreSQL for persistence
Flow:
Flask JSON API → FastAPI ingestion → PostgreSQL → FastAPI read API
- The pipeline is intentionally synchronous and minimal
- Error handling is kept simple given the scope
- The focus is on clarity of flow and separation of concerns over abstraction
In a production setting, this would likely evolve toward:
- streaming or incremental ingestion instead of full in-memory batches
- retry and failure handling strategies
- idempotent processing guarantees
- separation via queues or event-driven pipelines
- observability (logging, metrics, tracing)
The goal is to provide a clear baseline that can be extended, rather than a production-ready system.
Start all services:
docker compose up --build -d
Test endpoints:
curl "http://localhost:5000/api/health"
curl "http://localhost:5000/api/customers?page=1&limit=5"
curl -X POST "http://localhost:8000/api/ingest"
curl "http://localhost:8000/api/customers?page=1&limit=5"
curl "http://localhost:8000/api/customers/{customer_id}"
Install dev dependencies:
uv pip install --python .venv/bin/python -r requirements-dev.txt -r mock-server/requirements.txt -r pipeline-service/requirements.txt
Run tests:
env PYTHONPYCACHEPREFIX=/tmp .venv/bin/python -m pytest -q
Tests focus on:
- API contract validation
- pagination behavior
- ingestion validation (invalid data, conflicts)
- Dataset: ~10,000 customers
- Generated using
Faker - Deterministic UUIDs ensure reproducibility
- Generator script is included (
mock-server/scripts/generate_customers.py)
- Shared response shape:
data,total,page,limit - Defaults:
page=1,limit=10 - Invalid values return
400 limit > 100is rejected
- Uses
dltto load into PostgreSQL - Data is staged before merging into the main table
- Staging allows explicit validation and failure tracking
Notes:
- Source API is fetched page-by-page
- Current implementation accumulates data in memory before loading
Acceptable for the current dataset (~10k rows), but would need to shift to incremental ingestion for larger scale.
customer_idis the primary key (upsert)- Existing rows are updated while preserving
created_at updated_atis set only on updates- Email conflicts across different
customer_idvalues are skipped and logged - Invalid data is skipped and recorded in
ingest_failures
mock-serveris independent from PostgreSQLpipeline-serviceowns ingestion, validation, persistence, and read APIs- SQLAlchemy is used for schema modeling and queries
- The mock server loads the full dataset in memory at startup for simplicity
This repository is intentionally scoped to a small dataset and a simple pipeline, but reflects the trade-offs involved when evolving toward more robust data systems.