A comprehensive, end-to-end Data Lakehouse learning repository featuring production-grade data engineering and analytics projects built on Databricks. This bootcamp covers the complete data lifecycleβfrom raw data ingestion through advanced business intelligence reportingβusing industry-standard tools and architectural patterns.
This repository is designed for aspiring and current data engineers looking to build real-world skills with the Databricks platform. It demonstrates:
- Medallion Architecture (Bronze β Silver β Gold) for data quality progression
- ETL/ELT Pipeline Development with orchestration and error handling
- Data Quality Management with automated validation checks
- Business Intelligence & Analytics using advanced SQL patterns
- Production Best Practices including configuration-driven development, modular code, and comprehensive documentation
Perfect for:
- π Learning Databricks and modern data engineering
- πΌ Building a data engineering portfolio
- π€ Preparing for data engineering interviews
- π οΈ Understanding real-world lakehouse implementations
databricks_bootcamp_2026/
β
βββ bike_lakehouse_2026/ # ETL Pipeline & Medallion Architecture Project
β βββ bronze/ # Raw data ingestion layer
β βββ silver/ # Data cleaning & transformation layer
β βββ gold/ # Business-level aggregated data layer
β βββ docs/ # Architecture & design documentation
β βββ README.md # Detailed project documentation
β
βββ data_analytics_2026/ # Business Intelligence & SQL Analytics Project
β βββ 00-13 notebooks # Progressive SQL analytics tutorials
β βββ README.md # Detailed analytics documentation
β
βββ datasets/ # Sample data files (CSV format)
β
βββ LICENSE # MIT License
βββ README.md # This file
Focus: Data Engineering, ETL Pipelines, Medallion Architecture
A production-ready lakehouse implementation demonstrating end-to-end data engineering workflows for a bike sales business. This project simulates integrating data from multiple source systems (CRM and ERP) and transforming it into analytics-ready datasets.
- β Medallion Architecture implementation (Bronze, Silver, Gold layers)
- β Automated Pipeline Orchestration using Databricks Jobs
- β Data Quality Checks at Silver and Gold layers
- β Multi-Source Data Integration (CRM + ERP systems)
- β Configuration-Driven Development for scalability
- β Star Schema Design (Fact & Dimension tables)
- β Multiple Pipeline Configuration Formats (JSON, YAML, Python SDK)
- PySpark for data transformations
- Delta Lake for ACID transactions
- Databricks Workflows for orchestration
- Unity Catalog for data governance
- Building scalable ETL pipelines
- Implementing data quality frameworks
- Designing dimensional data models
- Orchestrating multi-notebook workflows
- Managing lakehouse metadata and governance
β‘οΈ View Bike Lakehouse Project Details
Focus: Business Intelligence, SQL Analytics, Data Analysis
A comprehensive SQL analytics module that leverages the Gold layer data from the Bike Lakehouse project. This project teaches advanced analytical patterns used by data analysts and analytics engineers in real business scenarios.
- β 14 Progressive Analytics Notebooks covering fundamental to advanced techniques
- β Window Functions & Aggregations for complex calculations
- β Time-Series Analysis (YoY growth, cohorts, trends)
- β Customer & Product Segmentation using SQL
- β Business Reporting with production-ready SQL queries
- β Analytical Patterns for magnitude, ranking, contribution, and cumulative analysis
- Apache Spark SQL
- SQL Window Functions
- Aggregation & Analytical Functions
- Common Table Expressions (CTEs)
- Advanced SQL analytical techniques
- Building business intelligence reports
- Cohort and growth analysis
- Customer segmentation strategies
- Data-driven decision making
β‘οΈ View Data Analytics Project Details
- Databricks Account (Community Edition or workspace access)
- Basic Knowledge Of:
- Python fundamentals
- SQL basics
- Data warehousing concepts (helpful but not required)
-
Clone the Repository:
git clone https://github.com/gooliverani/databricks_bootcamp_2026.git
-
Upload to Databricks:
- Navigate to your Databricks workspace
- Go to Workspace β Users β [Your User]
- Click Import and select the repository folder
- Choose Import as Git folder or upload individual notebooks
-
Initialize the Lakehouse:
- Open
bike_lakehouse_2026/init_lakehouse.ipynb - Run all cells to set up databases and schemas
- This creates the necessary Unity Catalog structures
- Open
-
Load Sample Data:
- Upload CSV files from
datasets/to Databricks DBFS or cloud storage - Update file paths in Bronze layer notebooks to match your storage location
- Upload CSV files from
-
Run the Pipeline:
- Manual Execution: Run notebooks sequentially (Bronze β Silver β Gold)
- Automated Execution: Use the provided
pipeline.json,pipeline.yaml, orpipeline.pyto create a Databricks Job
-
Explore Analytics:
- After Gold layer is populated, open
data_analytics_2026/notebooks - Run analytics notebooks sequentially to learn SQL patterns
- After Gold layer is populated, open
| Step | Project | Focus Area | Time Estimate |
|---|---|---|---|
| 1οΈβ£ | Bike Lakehouse - Bronze Layer | Raw data ingestion | 2-3 hours |
| 2οΈβ£ | Bike Lakehouse - Silver Layer | Data cleaning & transformation | 4-6 hours |
| 3οΈβ£ | Bike Lakehouse - Gold Layer | Dimensional modeling | 3-4 hours |
| 4οΈβ£ | Bike Lakehouse - Orchestration | Pipeline automation | 2-3 hours |
| 5οΈβ£ | Data Analytics - Foundations | SQL basics & exploration | 2-3 hours |
| 6οΈβ£ | Data Analytics - Advanced | Analytical patterns | 6-8 hours |
| 7οΈβ£ | Data Analytics - Reporting | Business intelligence | 3-4 hours |
Total Learning Time: ~25-35 hours
- Medallion Architecture (Bronze, Silver, Gold)
- ETL vs ELT patterns
- Data quality and validation frameworks
- Incremental data processing
- Pipeline orchestration and scheduling
- Error handling and logging
- Configuration-driven development
- Star schema dimensional modeling
- Dimensional vs measure analysis
- Magnitude, ranking, and contribution analysis
- Time-series and trend analysis
- Cohort analysis
- Customer segmentation
- Year-over-year growth calculations
- Cumulative metrics and moving averages
- Business reporting best practices
- PySpark DataFrames and SQL
- Delta Lake and ACID transactions
- Databricks Workflows and Jobs
- Unity Catalog for governance
- Databricks notebooks and widgets
- DBFS and cloud storage integration
Each project includes comprehensive documentation:
- Bike Lakehouse README - ETL pipeline architecture, layer details, orchestration guide
- Data Analytics README - Analytics patterns, SQL techniques, notebook descriptions
This bootcamp prepares you for real-world scenarios including:
- Building enterprise data lakes and lakehouses
- Implementing data quality frameworks
- Creating automated ETL/ELT pipelines
- Designing dimensional data models for analytics
- Developing business intelligence reports
- Performing customer and product analytics
- Conducting cohort and growth analysis
Contributions are welcome! If you'd like to improve the bootcamp:
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: Open an issue on GitHub for bugs or feature requests
- Discussions: Use GitHub Discussions for questions and community interaction
- Author: @gooliverani
This project is part of the comprehensive Databricks Bootcamp 2026 course by DataWithBaraa. Special thanks to Baraa for creating excellent educational content on data engineering, analytics, SQL, and Databricks!
Connect with DataWithBaraa:
- π₯ YouTube Channel
- π Website
- π» GitHub
- Databricks Documentation
- Apache Spark Documentation
- Delta Lake Documentation
- Medallion Architecture Guide
β If you find this bootcamp helpful, please star the repository!