Skip to content

Harshit-collab104/IMDB_Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1. IMDB Sentiment Classification

  • This project compares Logistic regression and Support Vector Machine(SVM) to classify the IMDB data set as "postive" or "negative"

  • To evaluate the models the evalutation metrics used are:

    1. Accuracy
    2. Precison
    3. Recall
    4. F1_score
    5. Average Precision Score

2. Dataset

  • Source IMDB Sentient Dataset (50,000 labeled movie reviews)
  • Each review is classified as "Positive" or "Negative" (Binary Classification)
  • Data set format: CSV file with two columns:
    1. 'review' -> The text containing the movie reviews
    2. 'sentiment' -> Either "postive" or "negative"

3. Installation and Setup

Follow these steps to set up the project locally:

1. Clone the repository

git clone https://github.com/Harshit-collab104/IMDB_Sentiment-Analysis
cd <your-repo-folder>
Replace <your-repo-folder> with the path to the folder where you cloned the repository.

2. Create a virtual environment

#On windows 
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Run the main file

python main.py

4. Approach Used

  1. Data preprocessing:

    • Removes HTML tags, punctuations and stop_words
    • Converted text to lowercase
    • Applied TF-IDF vectorization
  2. Models used:

    • Logistic Regression
    • Linear SVM
  3. Evaluation Metrics:

    • Accuarcy
    • Precison Score
    • Recall
    • F1-score
    • Average Precision score
  4. Visualization:

    • Confusion Matrices for both the models
    • Bar chart comapring performance metrices

5. Results

Across all evaluation metrics, Logistic Regression outperformed Linear SVM on the IMDB sentiment dataset.

The table below shows the performance of both models on the IMDB sentiment dataset:

Model Accuracy Precision Recall F1-score Average Precision
Logistic Regression 0.8953 0.8834 0.9127 0.8978 0.9602
Linear SVM 0.8865 0.8824 0.8938 0.8881 0.9543
  • Precision Difference - 0.10%

6. Visualizations

  • Confusion Matrix (Logistic Regression & SVM)
  • Training vs Testing Accuracy
  • Metrics Comparison (Accuracy, Precision, Recall, F1, Average Precision)

7. Conclusion

  • Both logistic Regression and SVM are effective for text and sentiment classification
  • TF-IDF and Linar models work well for high -dimensional text classification tasks
  • Based on the evaluation metrices- (Accuracy, Precision Score, Recall, F1-score and Average Precision) - Logistic Regression outperforms Linear SVM by about 0.1% in precision. This suggests that Logistic Regression effectively captures the linear separability of features in the dataset, providing more reliable predictions for distinguishing positive and negative reviews.

About

"Binary sentiment analysis on IMDB dataset using Logistic Regression and SVM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages