A machine learning-based SMS/Email spam classifier built with Streamlit and deployed on Heroku. The project uses Natural Language Processing (NLP) techniques and Naive Bayes classification to accurately identify spam messages.
This mini project demonstrates end-to-end machine learning workflow from data exploration to model deployment:
- Dataset: SMS Spam Collection with 5,572 messages
- Accuracy: 97.20%
- Algorithm: Multinomial Naive Bayes with TF-IDF vectorization
- Source: SMS Spam Collection Dataset
- Total Messages: 5,572 (5,169 after removing 403 duplicates)
- Distribution:
- Ham (Legitimate): 87.37% (4,516 messages)
- Spam: 12.63% (653 messages)
- File:
spam.csv
The dataset is imbalanced with significantly more ham messages than spam. The model handles this through appropriate algorithm selection (Naive Bayes performs well on imbalanced text data).
SpamDetection_MiniProject/
├── app.py
├── sms-spam-detection.ipynb
├── model.pkl
├── vectorizer.pkl
├── spam.csv
├── requirements.txt
├── Procfile
├── setup.sh
├── nltk.txt
└── .gitignore
- Python 3.7+
- pip or conda package manager
-
Clone the repository
git clone <repository-url> cd SpamDetection_MiniProject
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK data (required for text preprocessing)
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
-
Run the application
streamlit run app.py
-
Access the app Open your browser and navigate to
http://localhost:8501
- Launch the Streamlit app
- Enter your SMS/Email message in the text area
- Click the "Predict" button
- The app will classify the message as either Spam or Not Spam
Spam Example:
WINNER!! You have been selected to receive a $1000 prize.
Call now at 555-0123 to claim your reward!
Ham Example:
Hey, are we still meeting for lunch at 1pm today? Let me know!
- Removed 403 duplicate messages
- Dropped unnecessary columns (Unnamed: 2, 3, 4)
- Encoded target labels (ham=0, spam=1)
- Character Analysis: Spam messages are longer (avg 137 chars) vs ham (avg 70 chars)
- Word Count: Spam has more words (avg 27) vs ham (avg 17)
- Sentence Count: Spam has more sentences (avg 3) vs ham (avg 2)
- Word Clouds: Visualized most common words in spam vs ham messages
1. Lowercase conversion
2. Tokenization (word_tokenize)
3. Remove special characters (keep alphanumeric only)
4. Remove stop words & punctuation
5. Porter Stemming (reduce words to root form)
Example Transformation:
Input: "I'm gonna be home soon and i don't want to talk about this stuff anymore tonight"
Output: "gon na home soon want talk stuff anymor tonight"
- TF-IDF Vectorization: max_features=3000
- Converts text into numerical vectors representing term importance
- Alternative features tested: character count, word count, sentence count
| Algorithm | Accuracy | Precision | Notes |
|---|---|---|---|
| Multinomial Naive Bayes | 97.20% | 100% | Best - Selected for deployment |
| Extra Trees Classifier | 97.68% | 99.15% | High performance but more complex |
| Random Forest | 97.49% | 98.28% | Excellent but slower |
| SVC (Sigmoid Kernel) | 97.29% | 97.41% | Good but slower training |
| XGBoost | 97.00% | 94.21% | Good but overkill for this problem |
| AdaBoost | 97.20% | 95.04% | Comparable to MNB |
| Logistic Regression | 96.13% | 97.12% | Simple and effective |
| Gradient Boosting | 94.87% | 92.93% | Slower with marginal gains |
| Bagging Classifier | 96.81% | 86.15% | Lower precision |
| Decision Tree | 94.39% | 83.81% | Overfitting risk |
| K-Nearest Neighbors | 92.84% | 77.12% | Poor performance on text |
- 100% Precision: No false positives (crucial - don't want to block legitimate messages)
- 97.20% Accuracy: Excellent overall performance
- Fast Training & Inference: Real-time predictions
- Low Memory Footprint: Small model size (model.pkl is only 96KB)
- Proven for Text Classification: Industry standard for spam detection
Predicted
Ham Spam
Actual Ham 896 0 ← Perfect! No legitimate messages marked as spam
Spam 29 109 ← 79% spam detection rate
- scikit-learn: Model training, evaluation, and vectorization
- pandas: Data manipulation and analysis
- numpy: Numerical computations
- nltk: Natural language processing (tokenization, stemming, stopwords)
- XGBoost: Gradient boosting framework (tested)
- matplotlib: Plotting and visualizations
- seaborn: Statistical data visualization
- wordcloud: Word cloud generation for EDA
- Streamlit: Interactive web interface for model deployment
- Heroku: Cloud platform for application hosting
- pickle: Model serialization
streamlit
nltk
scikit-learnAdditional packages used in notebook:
- pandas
- numpy
- matplotlib
- seaborn
- wordcloud
- xgboost
This project demonstrates:
- End-to-End ML Pipeline: From raw data to deployed application
- Text Classification: NLP techniques for real-world problem
- Model Selection: Comparing 10+ algorithms and choosing optimal solution
- Handling Imbalanced Data: Strategies for skewed class distributions
- Feature Engineering: TF-IDF vectorization for text data
- Web Deployment: Creating interactive ML applications with Streamlit
- Cloud Deployment: Hosting on Heroku with proper configuration