Customer Segmentation and Churn Prediction with PySpark

📌 Overview

This project showcases the use of Apache Spark (PySpark) for scalable customer analytics. It implements a full pipeline for:

📈 Customer segmentation using RFM metrics and K-Means clustering
🔍 Churn prediction using Random Forest and Gradient Boosting Trees The aim is to support data-driven marketing and customer retention strategies at scale.

🗂️ Dataset

Source: Online Retail (Chen, 2015)
Description:

Transactional data from a UK-based online retailer (Dec 2010 – Dec 2011)
541,909 rows × 8 variables
Focuses on wholesale customers with behavior-based signals

⚙️ Workflow

Data Preprocessing & Cleaning
Feature Engineering:
- Derivation of RFM (Recency, Frequency, Monetary) metrics
Customer Segmentation:
- K-Means Clustering with WCSS and Silhouette Score optimization
Churn Prediction:
- Random Forest and Gradient Boosting classifiers
- Includes hyperparameter tuning and cross-validation

🛠️ Tools & Technologies

Apache Spark (PySpark)
MLlib (for machine learning)
Google Colab (for orchestration and analysis)

📊 Results

Achieved 98% classification accuracy with high AUC ≈ 0.997
Effective segmentation of customer groups for targeted strategies
Demonstrated end-to-end scalable modeling on semi-large datasets

⚠️ Challenges

Long training time (~4 hours) for ensemble models in PySpark on a single machine
Lack of distributed infrastructure limited speed and experimentation

🔭 Future Work

⚙️ Distributed Training using HDFS and Spark clusters to improve scalability
🔄 Real-Time Prediction via streaming pipelines for up-to-date churn detection
🧩 Feature Expansion to include customer demographics, browsing logs, and interaction metadata

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
src.ipynb		src.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Segmentation and Churn Prediction with PySpark

📌 Overview

🗂️ Dataset

⚙️ Workflow

🛠️ Tools & Technologies

📊 Results

⚠️ Challenges

🔭 Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Customer Segmentation and Churn Prediction with PySpark

📌 Overview

🗂️ Dataset

⚙️ Workflow

🛠️ Tools & Technologies

📊 Results

⚠️ Challenges

🔭 Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages