SANAD Text Classification

SANAD is a Single-label Arabic News Articles Dataset for automatic text categorization.
NLP pipeline:

Key Actions

Consolidated and organized data from multiple directories using pandas and os to streamline preprocessing and analysis workflows.
Cleaned and preprocessed a large Arabic text dataset using regex (re) and NLTK, including stopword removal, text normalization, and missing value handling.
Performed exploratory text analysis by computing features such as word count, character count, average characters per word, and stopword frequency.
Engineered statistical text features like tf-idf to enhance input representation for downstream machine learning tasks using scikit-learn.
Trained and evaluated different machine learning models using scikit-learn and Keras, including Logistic Regression (94% accuracy), Naive Bayes (92.4% accuracy), and Random Forest (89.5% accuracy).