-
This project compares Logistic regression and Support Vector Machine(SVM) to classify the IMDB data set as "postive" or "negative"
-
To evaluate the models the evalutation metrics used are:
- Accuracy
- Precison
- Recall
- F1_score
- Average Precision Score
- Source IMDB Sentient Dataset (50,000 labeled movie reviews)
- Each review is classified as "Positive" or "Negative" (Binary Classification)
- Data set format: CSV file with two columns:
- 'review' -> The text containing the movie reviews
- 'sentiment' -> Either "postive" or "negative"
Follow these steps to set up the project locally:
git clone https://github.com/Harshit-collab104/IMDB_Sentiment-Analysis
cd <your-repo-folder>
Replace <your-repo-folder> with the path to the folder where you cloned the repository.#On windows
python -m venv venv
venv\Scripts\activate
# On macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install --upgrade pip
pip install -r requirements.txtpython main.py-
Data preprocessing:
- Removes HTML tags, punctuations and stop_words
- Converted text to lowercase
- Applied TF-IDF vectorization
-
Models used:
- Logistic Regression
- Linear SVM
-
Evaluation Metrics:
- Accuarcy
- Precison Score
- Recall
- F1-score
- Average Precision score
-
Visualization:
- Confusion Matrices for both the models
- Bar chart comapring performance metrices
Across all evaluation metrics, Logistic Regression outperformed Linear SVM on the IMDB sentiment dataset.
The table below shows the performance of both models on the IMDB sentiment dataset:
| Model | Accuracy | Precision | Recall | F1-score | Average Precision |
|---|---|---|---|---|---|
| Logistic Regression | 0.8953 | 0.8834 | 0.9127 | 0.8978 | 0.9602 |
| Linear SVM | 0.8865 | 0.8824 | 0.8938 | 0.8881 | 0.9543 |
- Precision Difference - 0.10%
- Confusion Matrix (Logistic Regression & SVM)
- Training vs Testing Accuracy
- Metrics Comparison (Accuracy, Precision, Recall, F1, Average Precision)
- Both logistic Regression and SVM are effective for text and sentiment classification
- TF-IDF and Linar models work well for high -dimensional text classification tasks
- Based on the evaluation metrices- (Accuracy, Precision Score, Recall, F1-score and Average Precision) - Logistic Regression outperforms Linear SVM by about 0.1% in precision. This suggests that Logistic Regression effectively captures the linear separability of features in the dataset, providing more reliable predictions for distinguishing positive and negative reviews.