Skip to content

giordanoscerra/ProjectAthena

Repository files navigation

Project Athena

Project Athena Logo

GitHub Stars Contributors Repository Size Last Commit


Overview

Project Athena is a machine learning-based NLP project developed as part of the Human Language Technologies (HLT) course, a.y. 2023-2024.
Its goal is to classify philosophical currents from textual data, inferring the school of thought behind a given sentence.

The project explores:

  • Dataset analysis and contrastive comparisons with Gutenberg, Brown, and Simple English Wikipedia corpora
  • Multiple modeling approaches: Naive Bayes, RNNs, BERT, DistilBERT, and Zero-shot bart-large-mnli
  • Evaluation via macro-averaged F1 score, addressing dataset imbalance
  • A hypothesis on the impact of short vs. long sentences in classification accuracy

Full report: HLT_Project.pdf


Dataset

We used the History of Philosophy dataset curated by Kourosh Alizadeh.

  • Contains 360,808 sentences
  • Drawn from 59 texts by 36 authors
  • Covers 13 philosophical schools of thought

Note: The dataset is highly imbalanced, with some schools being over-represented.


Methods

We experimented with three families of models:

  • Generative models → Naive Bayes (TF-IDF, Laplace smoothing tuning)
  • Discriminative models → Recurrent Neural Networks (LSTMs, GRUs, BiLSTMs with GloVe embeddings)
  • Transformer-based models → BERT, DistilBERT, Zero-shot bart-large-mnli

Hypothesis: very short sentences (<15 words) may act like "semantic stopwords" and hinder classification.
We tested this by training and evaluating on both the full dataset and a reduced dataset (sentences ≥ 84 characters).


Results

  • BERT and DistilBERT outperformed all other approaches
  • Naive Bayes surprisingly strong baseline with low computational cost
  • Removing short sentences hurt performance, as they actually contribute useful signals
  • Best model: BERT (fine-tuned on full dataset) → F1 ≈ 0.83 (full test set), 0.88 (long sentences only)

Authors

  • Davide Borghini
  • Davide Marchi
  • Giordano Scerra
  • Andrea Marino
  • Yuri Ermes Negri

About

Machine learning-based NLP model for classifying philosophical currents from text. A NetRiders' production.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors