You can find this project deployed on streamlit here-https://enterprise-email-classifier.streamlit.app/
This project focuses on automatic e-mail classification for enterprise IT support systems. The objective is to classify incoming emails into meaningful categories and assign priorities using Natural Language Processing and Machine Learning techniques.
The dataset simulates real corporate inbox traffic, including noisy text, inconsistent casing, HTML content, and contact details.
Enterprise IT support teams receive a high volume of emails daily. These emails vary in intent and urgency and include spam, complaints, feedback, and service requests.
Manual sorting and prioritization lead to delays, inconsistency, and operational inefficiency. This project addresses the problem by building an automated e-mail classification pipeline.
The main objectives are:
- Automatically classify emails into predefined categories
- Reduce manual effort in e-mail triaging
- Enable faster response and escalation
- Build a scalable baseline NLP system using TF-IDF
Each email belongs to one of the following four classes:
SPAM complaint feedback request
These categories represent common patterns observed in enterprise IT support inboxes.
Each email is also assigned a priority level:
low medium high
This allows models to support priority-based routing and escalation.
The dataset consists of enterprise-style email messages stored in CSV format.
Each row represents one email and contains:
id subject body label priority
The dataset is balanced across all four labels to ensure fair model training.
The dataset intentionally includes realistic noise such as:
Random capitalization HTML links Email addresses Lengthy and varied text content
These characteristics help test preprocessing robustness and model generalization.
The project follows a clean ML workflow structure:
Raw datasets are stored separately Cleaned datasets are maintained for modeling Train and test splits are isolated Experiments are conducted in Jupyter notebooks
This separation improves reproducibility and clarity.
The typical pipeline used in this project is:
- Load raw email data
- Clean and normalize text
- Combine subject and body fields
- Convert text to numerical features
- Train classification models
- Evaluate performance
TF-IDF vectorization is used to convert email text into numerical representations.
TF-IDF helps by:
Reducing the impact of common words Highlighting discriminative terms Improving linear model performance
Unigrams and bigrams are used to capture both keywords and short phrases.
The following models are suitable for this dataset:
Logistic Regression Naive Bayes Linear Support Vector Machine
Evaluation is performed using accuracy, precision, recall, and F1-score.
python -m venv .venvWindows
.venv\Scripts\activateLinux or macOS
source .venv/bin/activatepip install -r requirements.txtOpen the notebook and execute cells sequentially:
jupyter notebook notebooks-kaggle/email-data.ipynbThe notebook covers preprocessing, TF-IDF vectorization, training, and evaluation.
This project can be extended to:
Enterprise email triaging IT support automation Spam filtering systems Priority-based ticket routing Academic NLP projects
Python pandas NumPy scikit-learn Jupyter Notebook
TF-IDF combined with linear classifiers provides strong baseline performance for email classification tasks. Balanced data improves per-class recall and interpretability.
TF-IDF does not capture semantic meaning or context. Performance may drop on very short or ambiguous emails.
Possible enhancements include:
Using word embeddings or transformer models Handling multilingual emails Adding thread-level context Deploying as an API-based service
This project is released under the MIT License. All data is synthetically generated and safe for academic and personal use.
Developed for enterprise e-mail classification and NLP experimentation.