Skip to content

Latest commit

 

History

History
15 lines (14 loc) · 992 Bytes

File metadata and controls

15 lines (14 loc) · 992 Bytes

SANAD Text Classification

SANAD is a Single-label Arabic News Articles Dataset for automatic text categorization.
NLP pipeline:

Key Actions

  • Consolidated and organized data from multiple directories using pandas and os to streamline preprocessing and analysis workflows.
  • Cleaned and preprocessed a large Arabic text dataset using regex (re) and NLTK, including stopword removal, text normalization, and missing value handling.
  • Performed exploratory text analysis by computing features such as word count, character count, average characters per word, and stopword frequency.
  • Engineered statistical text features like tf-idf to enhance input representation for downstream machine learning tasks using scikit-learn.
  • Trained and evaluated different machine learning models using scikit-learn and Keras, including Logistic Regression (94% accuracy), Naive Bayes (92.4% accuracy), and Random Forest (89.5% accuracy).