Forecast weekly product sales to optimize stock replenishment for four selected stores in Istanbul. The aim is to reduce stockouts and overstocking using time series models and engineered features that reflect business and temporal dynamics.
The project was developed using a Lakehouse architecture in Microsoft Fabric, structured into:
- Bronze Layer: Raw data ingestion and conversion to Delta format
- Silver Layer: Data cleaning, normalization, and enrichment
- Gold Layer: Dimensional modeling using a star schema
- ML Area: Feature engineering and machine learning models
- Power BI: Visualization of results and reporting
- Notebook:
01_bronze_transform.ipynb - Converted raw data files into Delta format in
Projeto_II_Bronze_
- Notebook:
02_silver_cleaning.ipynb - Applied data cleaning: handled nulls, standardized formats, fixed inconsistent codes (e.g., store/city codes)
- Notebook:
03_gold_modeling.ipynb - Created a star schema:
dim_storesdim_productsdim_dates(with holiday/weekend flags)fact_sales
Notebooks:
04_eda_full_dataset.ipynb05_eda_define_poc.ipynb06_eda_initial_insights_poc.ipynb
Key insights:
- Defined the Proof of Concept (PoC) with 4 Istanbul stores (one per store type: ST01 to ST04)
- Focused on products from categories
hierarchy1_id∈ [H00, H01, H02, H03] - Detected seasonality, stockout patterns, outliers, and temporal trends
- Identified missing data patterns and selected relevant variables for modeling
Notebook: 07_feature_engineering.ipynb
Engineered variables:
weekly_sales– average product sales per weekavg_stock– weekly average available stockavg_price– average price during the weekpromo_bin_*_rate– proportion of time in promotionslag_sales_1w,lag_sales_2wlag_sales_3w– previous weeks' sales to capture autoregressive effectsis_weekend,is_holiday– calendar-based flagsnum_obs– number of transactions per week
These features were used as input for time series and regression models.
Notebooks:
08_model_arima.ipynb09_model_sarima.ipynb10_model_sarimax.ipynb11_model_linear_regression.ipynb
Models used:
- ARIMA – baseline model without seasonality or exogenous variables
- SARIMA – includes weekly seasonality
- ARIMAX/SARIMAX – adds exogenous variables (e.g.,
avg_stock,lag_sales_1w) - Linear Regression – serves as benchmark using manually engineered features
Models were tested by product_id across the selected stores and hierarchies. Evaluation results were logged using Microsoft Fabric’s ML Experiments.
Power BI Reports:
project_II_final_report.pdf– Final model forecasts and performance
Project_II Presentation
project_II_presentation.pdf- Final project presentation
Output files:
- Weekly forecasts and model evaluation results exported to
.csvfor further analysis
project_II_stock_forecasting/
│
├── notebooks/
│ ├── 01_bronze_transform.ipynb
│ ├── 02_silver_cleaning.ipynb
│ ├── 03_gold_modeling.ipynb
│ ├── 04_eda_full_dataset.ipynb
│ ├── 05_eda_define_poc.ipynb
│ ├── 06_eda_initial_insights_poc.ipynb
│ ├── 07_feature_engineering.ipynb
│ ├── 08_model_arima.ipynb
│ ├── 09_model_sarima.ipynb
│ ├── 10_model_sarimax.ipynb
│ ├── 11_model_linear_regression.ipynb
│
├── reports/
│ ├── project_II_presentation.pdf
│ └── project_II_final_report.pdf
│
├── outputs/
│ └── README.txt # Output files generated by the models
│
├── data/
│ └── README.md # How to access Fabric data (without uploading raw files)
│
├── requirements.txt
├── .gitignore
└── README.md
Models were evaluated using:
- MAE – Mean Absolute Error
- AIC / BIC – Model fit comparison for ARIMA-type models
Evaluation was done at the product level, grouped by store and hierarchy (store_id, hierarchy1_id).
-
Demand-Instead-of-Sales Target
- Detect
stock = 0situations and impute latent demand so the model learns true customer interest, not shelf availability.
- Detect
-
Marketplace-Seller Segmentation
- Enrich the dataset with third-party marketplace sellers.
- Cluster sellers via K-Means (or HDBSCAN for variable density) to uncover behaviour-based segments.
-
Gradient-Boosting Experiments
- Benchmark LightGBM / XGBoost against SARIMAX and Lasso to capture non-linear feature interactions.
- Use SHAP values for interpretability.
-
Continuous Forecasting Pipeline
- Containerise the model and schedule weekly retraining + scoring (GitHub Actions → Azure Functions).
- Push fresh forecasts to the
replenishment_tableand trigger Power BI refresh via REST API.
Developed as part of the Postgraduate Program in Analytics & Data Science.
- André Teixeira – Feature Engineering & Time Series Modeling
- Patrícia Pereira – Exploratory Data Analysis
- Rodrigo Diogo – Linear Regression Time Series Modeling
- Vitor Meirelles – Lakehouse Architecture Microsoft Fabric