Insurance Analytics End-to-End Data Pipeline

Project Overview

This project implements an end-to-end insurance analytics data platform, built as a technical showcase for the insurance domain.

The system simulates the full analytical lifecycle of an insurance portfolio — from synthetic data generation to interactive dashboards — with the goal of demonstrating how modern data engineering and analytics capabilities can be applied in insurance environments.

Rather than delivering a production-grade actuarial model, the project focuses on showcasing architecture, data flows, validation logic, analytical schemas, and BI consumption patterns that are commonly required in real insurance analytics initiatives.

Purpose & Scope

The purpose of this project is to serve as a demonstrative analytical platform for insurance-focused use cases.

Specifically, it aims to:

Illustrate how insurance-related data could be structured for modern analytics
Provide a controlled environment to explore insurance KPIs and portfolio metrics
Demonstrate how raw datasets can be transformed into analytics-ready models
Enable interactive exploration of metrics through BI dashboards

The underlying data model and assumptions are intentionally simplified. They are not intended to represent a realistic or validated actuarial model, but rather to act as a technical showcase and discussion starting point.

In a real-world engagement, a proper actuarial study, business discovery phase, and domain-specific modeling would be required.

Business Context (Simulated Insurance Company)

The datasets simulate a fictional insurance carrier:

Company: Eagle Insurance Inc. Market: United States Time Horizon: 2024–2025 Portfolio Size: ~2,000 simulated clients Geographic Coverage: 4 regions, 18 states

The simulated context supports exploratory analysis around:

Portfolio composition
Policy and premium distribution
Claims activity
Expenses and taxes
Geographic and product segmentation

The business narrative exists to anchor technical and analytical demonstrations, not to model real insurance behavior.

High-Level Architecture

The platform follows a layered analytical architecture designed to resemble common insurance data platforms:

Configuration Layer
   ↓
Synthetic Data Generation (RAW)
   ↓
Data Validation & Health Checks
   ↓
Normalization & Transformation
   ↓
Analytical Schema (PostgreSQL)
   ↓
Analytical Views
   ↓
Interactive BI Dashboards

Each layer introduces explicit validation to ensure data consistency and analytical usability downstream.

End-to-End Data Pipeline

The pipeline is implemented as a sequence of modular Python scripts executed in a controlled order:

Dataset Configuration
- Generates configuration parameters for the simulated portfolio
- Defines scope, time window, geography, and volume assumptions
Synthetic Dataset Generation
- Produces RAW CSV datasets:
  - Clients
  - Policies
  - Claims
  - Expenses
  - Taxes
RAW Dataset Validation
- Validates file presence
- Checks row counts and value ranges
Normalization & Transformation
- Cleans and standardizes RAW data
- Produces analytics-ready dimension and fact tables
Database Readiness Validation
- Verifies database connectivity and permissions
- Confirms readiness for schema deployment
Normalized Data Validation
- Ensures structural and formatting consistency
- Validates expected distributions and integrity
Schema Deployment
- Executes a full SQL schema including:
  - Tables and facts
  - Indexes
  - Row-Level Security (RLS)
  - Grants
  - Analytical views
Schema Validation
- Confirms successful schema creation
Data Loading
- Loads normalized data into the warehouse
- Enforces correct loading order
Post-load Validation
- Confirms data completeness and consistency at database level

Execution details are documented in docs/pipeline.md.

Analytical & BI Layer

BI Platform

Google Looker Studio

The BI layer is built on top of dedicated SQL analytical views, each designed to support a specific dashboard or analytical use case.

Implemented Dashboard

Dashboard 1 — Executive Portfolio Overview

An executive-style overview focused on high-level portfolio indicators, including:

Portfolio size and evolution
Premium exposure
Policy distribution
Geographic and product segmentation

🔗 Public Interactive Dashboard: https://lookerstudio.google.com/s/ruPQKlKs2CA

📄 Static Snapshots:

assets/executive_dashboard_annual_nofilters_2024.pdf
assets/executive_dashboard_annual_nofilters_2025.pdf

Planned Analytical Dashboards (TBD)

Claims & Loss Performance
Operational Metrics
Risk & Underwriting Insights

Each dashboard is backed by one SQL view per dashboard, maintaining clear analytical boundaries.

Technology Stack & Design Rationale

Data Warehouse

Supabase (PostgreSQL) Chosen for:
PostgreSQL compatibility
Row-Level Security support
Structured schema and permission management
Cost-efficient experimentation

Analytics & BI

SQL-based analytical views
Google Looker Studio for interactive reporting

Data Engineering

Python
Pandas
Modular pipeline architecture
Explicit validation and health checks

The stack and design choices emphasize clarity, reproducibility, and analytical accessibility.

Repository Structure

.
├── docs
├── output
│   ├── config.json
│   ├── raw
│   └── normalized
├── schema
│   └── schema.sql
├── assets
│   └── executive_dashboard_annual_*.pdf
├── scripts
│   └── pipeline modules & healthchecks
└── utils
    └── database reset utilities

Documentation Index

The following supporting documentation will be available in /docs in next updates of this repository:

architecture.md — System architecture
pipeline.md — Pipeline execution flow
schema_contract.md — Warehouse schema contract
data_model_design.md — Analytical data model
data_dictionary.md — Column-level definitions
data_quality.md — Validation rules
dashboards.md — BI and dashboard design
business_assumptions.md — Modeling assumptions
supabase.md — Database setup
env_example.md — Environment variables reference

Data Disclaimer

All data used in this project is fully synthetic.

It does not represent real customers, policies, claims, or financial records
The model is simplified and illustrative
The project is intended for technical demonstration and exploratory analysis only

Closing note

This repository represents a technical showcase of an insurance analytics platform, designed to illustrate how data engineering and analytics practices can be applied in the insurance domain, while leaving room for future, domain-specific modeling and actuarial validation if required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Insurance Analytics End-to-End Data Pipeline

Table of Contents

Project Overview

Purpose & Scope

Business Context (Simulated Insurance Company)

High-Level Architecture

End-to-End Data Pipeline

Analytical & BI Layer

BI Platform

Implemented Dashboard

Dashboard 1 — Executive Portfolio Overview

Planned Analytical Dashboards (TBD)

Technology Stack & Design Rationale

Data Warehouse

Analytics & BI

Data Engineering

Repository Structure

Documentation Index

Data Disclaimer

Closing note

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
docs		docs
schema		schema
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Insurance Analytics End-to-End Data Pipeline

Table of Contents

Project Overview

Purpose & Scope

Business Context (Simulated Insurance Company)

High-Level Architecture

End-to-End Data Pipeline

Analytical & BI Layer

BI Platform

Implemented Dashboard

Dashboard 1 — Executive Portfolio Overview

Planned Analytical Dashboards (TBD)

Technology Stack & Design Rationale

Data Warehouse

Analytics & BI

Data Engineering

Repository Structure

Documentation Index

Data Disclaimer

Closing note

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages