data-spain — Azure Data Pipeline · Spanish Labor Market

End-to-end Azure data pipeline ingesting open data from Spain's National Statistics Institute (INE), processing employment and labor market data through a medallion architecture, and delivering insights via Power BI.

Architecture

flowchart LR
    INE[INE API<br/>servicios.ine.es] -->|REST/JSON| ADF[Azure Data Factory<br/>Orchestration]
    ADF -->|Raw Parquet| BRONZE[ADLS Gen2<br/>Bronze Layer]
    BRONZE -->|Spark| DBR[Azure Databricks<br/>PySpark + Delta Lake]
    DBR -->|Delta Tables| SILVER[ADLS Gen2<br/>Silver Layer]
    DBR -->|Aggregations| GOLD[ADLS Gen2<br/>Gold Layer]
    GOLD -->|SQL| SYN[Azure Synapse<br/>Analytics]
    SYN -->|DirectQuery| DBT[dbt<br/>Data Modeling]
    DBT -->|Marts| PBI[Power BI Service<br/>Public Dashboard]

Dataset

Source: INE (Instituto Nacional de Estadística) — public REST API, no authentication required.

Datasets:

EPA (Encuesta de Población Activa) — quarterly unemployment rate by autonomous community (2005–present)
Social Security Affiliates — monthly active workers by sector and region

Focus: Aragón autonomous community labor market trends, with national comparison.

Tech Stack

Layer	Technology
Orchestration	Azure Data Factory
Storage	Azure Data Lake Storage Gen2 (bronze / silver / gold)
Processing	Azure Databricks + PySpark + Delta Lake
Modeling	dbt (data build tool)
Warehouse	Azure Synapse Analytics
Visualization	Power BI Service (public dashboard)
IaC	Azure Bicep
CI/CD	GitHub Actions

DP-203 Coverage

This project covers the following DP-203 (Azure Data Engineer Associate) exam domains:

Design and implement data storage — ADLS Gen2, Delta Lake, Synapse dedicated SQL pool
Design and develop data processing — Databricks notebooks, ADF pipelines, PySpark transformations
Design and implement data security — RBAC, managed identities, Key Vault secret references
Monitor and optimize data storage and processing — Delta Lake optimization, Synapse performance tuning

Dashboard

Link placeholder — will be populated after Power BI Service deployment.

Local Development

pip install -r requirements.txt
cp .env.example .env
# Edit .env with your Azure credentials
python scripts/ine_api_client.py

Infrastructure Deploy

az login
az group create --name rg-data-spain --location westeurope
az deployment group create \
  --resource-group rg-data-spain \
  --template-file infra/main.bicep \
  --parameters infra/parameters.json

Project Structure

data-spain/
├── infra/                  # Infrastructure as Code (Bicep)
├── pipelines/              # Azure Data Factory pipeline definitions
├── notebooks/              # Databricks notebooks (bronze/silver/gold)
├── dbt/                    # dbt models, tests, and config
├── scripts/                # Python utilities (INE API client, schema validation)
├── docs/                   # Technical documentation
├── .github/workflows/      # CI/CD (GitHub Actions)
├── .gitignore
├── .env.example
├── requirements.txt
└── README.md

data-spain — Pipeline de Datos Azure · Mercado Laboral Español

Pipeline de datos end-to-end en Azure que ingiere datos abiertos del Instituto Nacional de Estadística (INE), procesa información de empleo y mercado laboral mediante arquitectura medallion, y entrega insights a través de Power BI.

Arquitectura

flowchart LR
    INE[INE API<br/>servicios.ine.es] -->|REST/JSON| ADF[Azure Data Factory<br/>Orquestación]
    ADF -->|Raw Parquet| BRONZE[ADLS Gen2<br/>Capa Bronze]
    BRONZE -->|Spark| DBR[Azure Databricks<br/>PySpark + Delta Lake]
    DBR -->|Delta Tables| SILVER[ADLS Gen2<br/>Capa Silver]
    DBR -->|Agregaciones| GOLD[ADLS Gen2<br/>Capa Gold]
    GOLD -->|SQL| SYN[Azure Synapse<br/>Analytics]
    SYN -->|DirectQuery| DBT[dbt<br/>Modelado de Datos]
    DBT -->|Marts| PBI[Power BI Service<br/>Dashboard Público]

Conjunto de Datos

Fuente: INE (Instituto Nacional de Estadística) — API REST pública, sin autenticación.

Datasets:

EPA (Encuesta de Población Activa) — tasa de desempleo trimestral por comunidad autónoma (2005–presente)
Afiliados a la Seguridad Social — trabajadores activos mensuales por sector y región

Foco: Tendencias del mercado laboral de la comunidad autónoma de Aragón, con comparativa nacional.

Stack Tecnológico

Capa	Tecnología
Orquestación	Azure Data Factory
Almacenamiento	Azure Data Lake Storage Gen2 (bronze / silver / gold)
Procesamiento	Azure Databricks + PySpark + Delta Lake
Modelado	dbt (data build tool)
Data Warehouse	Azure Synapse Analytics
Visualización	Power BI Service (dashboard público)
IaC	Azure Bicep
CI/CD	GitHub Actions

Cobertura DP-203

Este proyecto cubre los siguientes dominios del examen DP-203 (Azure Data Engineer Associate):

Diseño e implementación del almacenamiento de datos — ADLS Gen2, Delta Lake, Synapse SQL pool dedicado
Diseño y desarrollo del procesamiento de datos — Notebooks de Databricks, pipelines de ADF, transformaciones PySpark
Diseño e implementación de la seguridad de datos — RBAC, identidades administradas, referencias a Key Vault
Monitorización y optimización del almacenamiento y procesamiento de datos — Optimización Delta Lake, ajuste de rendimiento de Synapse

Dashboard

Enlace pendiente — se completará tras el despliegue en Power BI Service.

Desarrollo Local

pip install -r requirements.txt
cp .env.example .env
# Editar .env con tus credenciales de Azure
python scripts/ine_api_client.py

Despliegue de Infraestructura

az login
az group create --name rg-data-spain --location westeurope
az deployment group create \
  --resource-group rg-data-spain \
  --template-file infra/main.bicep \
  --parameters infra/parameters.json

Estructura del Proyecto

data-spain/
├── infra/                  # Infraestructura como Código (Bicep)
├── pipelines/              # Definiciones de pipelines de Azure Data Factory
├── notebooks/              # Notebooks de Databricks (bronze/silver/gold)
├── dbt/                    # Modelos, tests y configuración dbt
├── scripts/                # Utilidades Python (cliente API INE, validación de schema)
├── docs/                   # Documentación técnica
├── .github/workflows/      # CI/CD (GitHub Actions)
├── .gitignore
├── .env.example
├── requirements.txt
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-spain — Azure Data Pipeline · Spanish Labor Market

Architecture

Dataset

Tech Stack

DP-203 Coverage

Dashboard

Local Development

Infrastructure Deploy

Project Structure

data-spain — Pipeline de Datos Azure · Mercado Laboral Español

Arquitectura

Conjunto de Datos

Stack Tecnológico

Cobertura DP-203

Dashboard

Desarrollo Local

Despliegue de Infraestructura

Estructura del Proyecto

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
dbt		dbt
docs		docs
infra		infra
notebooks		notebooks
pipelines/ingestion		pipelines/ingestion
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

data-spain — Azure Data Pipeline · Spanish Labor Market

Architecture

Dataset

Tech Stack

DP-203 Coverage

Dashboard

Local Development

Infrastructure Deploy

Project Structure

data-spain — Pipeline de Datos Azure · Mercado Laboral Español

Arquitectura

Conjunto de Datos

Stack Tecnológico

Cobertura DP-203

Dashboard

Desarrollo Local

Despliegue de Infraestructura

Estructura del Proyecto

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages