This project showcases the use of Apache Spark (PySpark) for scalable customer analytics. It implements a full pipeline for:
- 📈 Customer segmentation using RFM metrics and K-Means clustering
- 🔍 Churn prediction using Random Forest and Gradient Boosting Trees The aim is to support data-driven marketing and customer retention strategies at scale.
Source: Online Retail (Chen, 2015)
Description:
- Transactional data from a UK-based online retailer (Dec 2010 – Dec 2011)
- 541,909 rows × 8 variables
- Focuses on wholesale customers with behavior-based signals
- Data Preprocessing & Cleaning
- Feature Engineering:
- Derivation of RFM (Recency, Frequency, Monetary) metrics
- Customer Segmentation:
- K-Means Clustering with WCSS and Silhouette Score optimization
- Churn Prediction:
- Random Forest and Gradient Boosting classifiers
- Includes hyperparameter tuning and cross-validation
- Apache Spark (PySpark)
- MLlib (for machine learning)
- Google Colab (for orchestration and analysis)
- Achieved 98% classification accuracy with high AUC ≈ 0.997
- Effective segmentation of customer groups for targeted strategies
- Demonstrated end-to-end scalable modeling on semi-large datasets
- Long training time (~4 hours) for ensemble models in PySpark on a single machine
- Lack of distributed infrastructure limited speed and experimentation
- ⚙️ Distributed Training using HDFS and Spark clusters to improve scalability
- 🔄 Real-Time Prediction via streaming pipelines for up-to-date churn detection
- 🧩 Feature Expansion to include customer demographics, browsing logs, and interaction metadata