Sparkify Data Pipeline

using Apache Airflow, AWS Redshift and AWS S3

Project Summary

In this project, I build off of a previous ETL Pipeline project, using more automated and better monitored pipelines, primarily through Apache Airflow. The data draws again from a fictitious music streaming service named Sparkify.

The pipeline channels data from Amazon Web Service's (AWS) Simple Storage Service (S3) into AWS Redshift data warehouses (in the form of staging tables). The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

The pipelines are dynamic and built from reusable tasks. They are monitored and allow for easy backfills. Data quality checks are also automated for analysis execution over the data warehouse, to catch any discrepancies in the datasets.

Apache Airflow

In Airflow, I create custom operators to perform tasks that stage the data, fill the data warehouse, and run data quality checks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparkify Data Pipeline

Project Summary

Apache Airflow

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Sparkify Data Pipeline

Project Summary

Apache Airflow