- Project Overview
- Purpose & Scope
- Business Context (Simulated Insurance Company)
- High-Level Architecture
- End-to-End Data Pipeline
- Analytical & BI Layer
- Technology Stack & Design Rationale
- Repository Structure
- Documentation Index
- Data Disclaimer
This project implements an end-to-end insurance analytics data platform, built as a technical showcase for the insurance domain.
The system simulates the full analytical lifecycle of an insurance portfolio — from synthetic data generation to interactive dashboards — with the goal of demonstrating how modern data engineering and analytics capabilities can be applied in insurance environments.
Rather than delivering a production-grade actuarial model, the project focuses on showcasing architecture, data flows, validation logic, analytical schemas, and BI consumption patterns that are commonly required in real insurance analytics initiatives.
The purpose of this project is to serve as a demonstrative analytical platform for insurance-focused use cases.
Specifically, it aims to:
- Illustrate how insurance-related data could be structured for modern analytics
- Provide a controlled environment to explore insurance KPIs and portfolio metrics
- Demonstrate how raw datasets can be transformed into analytics-ready models
- Enable interactive exploration of metrics through BI dashboards
The underlying data model and assumptions are intentionally simplified. They are not intended to represent a realistic or validated actuarial model, but rather to act as a technical showcase and discussion starting point.
In a real-world engagement, a proper actuarial study, business discovery phase, and domain-specific modeling would be required.
The datasets simulate a fictional insurance carrier:
Company: Eagle Insurance Inc. Market: United States Time Horizon: 2024–2025 Portfolio Size: ~2,000 simulated clients Geographic Coverage: 4 regions, 18 states
The simulated context supports exploratory analysis around:
- Portfolio composition
- Policy and premium distribution
- Claims activity
- Expenses and taxes
- Geographic and product segmentation
The business narrative exists to anchor technical and analytical demonstrations, not to model real insurance behavior.
The platform follows a layered analytical architecture designed to resemble common insurance data platforms:
Configuration Layer
↓
Synthetic Data Generation (RAW)
↓
Data Validation & Health Checks
↓
Normalization & Transformation
↓
Analytical Schema (PostgreSQL)
↓
Analytical Views
↓
Interactive BI Dashboards
Each layer introduces explicit validation to ensure data consistency and analytical usability downstream.
The pipeline is implemented as a sequence of modular Python scripts executed in a controlled order:
-
Dataset Configuration
- Generates configuration parameters for the simulated portfolio
- Defines scope, time window, geography, and volume assumptions
-
Synthetic Dataset Generation
-
Produces RAW CSV datasets:
- Clients
- Policies
- Claims
- Expenses
- Taxes
-
-
RAW Dataset Validation
- Validates file presence
- Checks row counts and value ranges
-
Normalization & Transformation
- Cleans and standardizes RAW data
- Produces analytics-ready dimension and fact tables
-
Database Readiness Validation
- Verifies database connectivity and permissions
- Confirms readiness for schema deployment
-
Normalized Data Validation
- Ensures structural and formatting consistency
- Validates expected distributions and integrity
-
Schema Deployment
-
Executes a full SQL schema including:
- Tables and facts
- Indexes
- Row-Level Security (RLS)
- Grants
- Analytical views
-
-
Schema Validation
- Confirms successful schema creation
-
Data Loading
- Loads normalized data into the warehouse
- Enforces correct loading order
-
Post-load Validation
- Confirms data completeness and consistency at database level
Execution details are documented in docs/pipeline.md.
- Google Looker Studio
The BI layer is built on top of dedicated SQL analytical views, each designed to support a specific dashboard or analytical use case.
An executive-style overview focused on high-level portfolio indicators, including:
- Portfolio size and evolution
- Premium exposure
- Policy distribution
- Geographic and product segmentation
🔗 Public Interactive Dashboard: https://lookerstudio.google.com/s/ruPQKlKs2CA
📄 Static Snapshots:
assets/executive_dashboard_annual_nofilters_2024.pdfassets/executive_dashboard_annual_nofilters_2025.pdf
- Claims & Loss Performance
- Operational Metrics
- Risk & Underwriting Insights
Each dashboard is backed by one SQL view per dashboard, maintaining clear analytical boundaries.
- Supabase (PostgreSQL) Chosen for:
- PostgreSQL compatibility
- Row-Level Security support
- Structured schema and permission management
- Cost-efficient experimentation
- SQL-based analytical views
- Google Looker Studio for interactive reporting
- Python
- Pandas
- Modular pipeline architecture
- Explicit validation and health checks
The stack and design choices emphasize clarity, reproducibility, and analytical accessibility.
.
├── docs
├── output
│ ├── config.json
│ ├── raw
│ └── normalized
├── schema
│ └── schema.sql
├── assets
│ └── executive_dashboard_annual_*.pdf
├── scripts
│ └── pipeline modules & healthchecks
└── utils
└── database reset utilities
The following supporting documentation will be available in /docs in next updates of this repository:
architecture.md— System architecturepipeline.md— Pipeline execution flowschema_contract.md— Warehouse schema contractdata_model_design.md— Analytical data modeldata_dictionary.md— Column-level definitionsdata_quality.md— Validation rulesdashboards.md— BI and dashboard designbusiness_assumptions.md— Modeling assumptionssupabase.md— Database setupenv_example.md— Environment variables reference
All data used in this project is fully synthetic.
- It does not represent real customers, policies, claims, or financial records
- The model is simplified and illustrative
- The project is intended for technical demonstration and exploratory analysis only
This repository represents a technical showcase of an insurance analytics platform, designed to illustrate how data engineering and analytics practices can be applied in the insurance domain, while leaving room for future, domain-specific modeling and actuarial validation if required.
