This project focuses on Native Language Identification and Author Profiling from English text. Its main method is raw transformer-based encoders, however in some examples it leverages Convolutional Neural Networks (CNNs) alongside encoder-based models in a hybrid architecture to capture both local patterns and global contextual features. The system is designed to analyze linguistic traits and writing style, enabling accurate prediction of an author’s native language, age, gender and other traits. It serves as a approachable framework for training, and later, experimenting with fined tuned models on set classification tasks.
.
├── App/ # How to run it yourself in a Nutshell
│ └── ...
├── Assets/ # Images, Screenshots, Scripts used to plot data
│ └── ...
├── Testing/ # Evaluation scripts, notebooks, and test runs of models
│ └── ...
├── Training/ # Everything related to model training
│ └── ...
└── README.md
App/
├── ModelWrapper/ # Classes for easy loading and running different models
└── RunModels/ # Notebooks that allow you to play around with model predictions
- Model Wrappers allows for quick loading of models from the project folders.
- Models have to be firstly trained by running their corresponding training script in their folder.
Assets/
├── DataToPlot/ # Scripts with folders to plot data, originally jsons copied from logging training folders
├── Papers/ # Papers used as references when working on the project
├── Plots # Plots for README
└── Screenshots/ # Screenshots for README
- Here we hold various resources needed for presentation and creation of the project
Testing/
├── comparisons/ # Results of testing ReadyToDeploy HuggingFace models
├── metrics/ # Notebook where you can test Accuracy and F1 of different models
├── runs-legacy/ # Results from previous runs (before finding out optimal ways to do that) per task
└── survey/ # Notebook where you can test models on the data from survey conducted by us during project
- SHAP scripts allow per-model testing of actual performance
- SHAP highlights what part of input influenced the output the most which allows diagnostics
- Each subfolder under
runs-legacy/corresponds to a task likeage,gender,language, on other models runs-legacy/noteooks are LEGACY and WILL NOT WORK, they function as a kind of screenshot docummentation on project worksurvey/contains notebook with data from our collection survey, you can load the models you want and see how they work
Training/
├── DATA/ # Raw and preprocessed datasets
└── MODELS/ # Trained model checkpoints organized by task
- Models are organized by task in
Training/MODELS. - Different variants are trained for different tasks.
- Each folder contains a
train.pyscript responsible for training its model. - Data is collected and preprocessed in
Training/DATA.
MODELS/
├── age/
│ ├── DistilBERTRegressionBaseline/ # Only final regression layer trained
│ ├── DistilBERTRegressionFull/ # Fully fine-tuned
│ └── DistilBERTRegressionLoRa/ # LoRA (efficient low-rank adaptation)
├── gender/
│ ├── DistilBERTClassificationFull/ # Fully fine-tuned
│ ├── DistilBERTRegressionBaseline/ # Only final regression layer trained
│ ├── DistilBERTRegressionFull/ # Fully fine-tuned
│ ├── DistilBERTRegressionLoRA/ # LoRA adaptation
│ └── RoBERTaRegressionFull/ # Fully fine-tuned
├── language/
│ ├── CNN/ # Raw CNN on RoBERTa tokenizer embeddings
│ ├── DeBERTaClassificationLoRA/ # LoRA adaptation
│ ├── DistilBERTClassificationBaseline/ # Only final classification layer trained
│ ├── DistilBERTClassificationFull/ # Fully fine-tuned
│ ├── DistilBERTClassificationFullCNNHead/ # Fully fine-tuned standard model with CNN on last layer
│ ├── DistilBERTClassificationLoRA/ # LoRA adaptation
│ ├── MpNetClassificationFull/ # Fully fine-tuned
│ ├── RoBERTaClassificationFull/ # Fully fine-tuned
│ ├── RoBERTaClassificationFullCNNHead/ # Fully fine-tuned standard model with CNN on last layer
│ ├── RoBERTaLargeClassificationFull/ # Fully fine-tuned
│ ├── RoBERTaLargeMixedCNN/ # Parallel running CNN + RoBERTaLarge fully trained, head trained on combined output
│ └── RoBERTaMixedCNN/ # Parallel running CNN + RoBERTa fully trained, head trained on combined output
├── mbti/
│ ├── DistilBERTClassificationBaseline/ # Only final classification layer trained
│ ├── DistilBERTClassificationFull/ # Fully fine-tuned
│ └── DistilBERTClassificationLoRA/ # LoRA adaptation
└── political/
├── DistilBERTRegressionBaseline/ # Only final regression layer trained
├── DistilBERTRegressionFull/ # Fully fine-tuned
├── DistilBERTRegressionFullLongText/ # Unused (results very random because of cutting long articles)
├── DistilBERTRegressionFullRawLogits/ # Classifier head outputs without softmax
└── DistilBERTRegressionLoRA/ # LoRA adaptation
Organized by prediction task:
- Age: Regression models using DistilBERT variants
- Gender: Classification and regression model
- Language: Various classification methods including CNNs, Transformers (DistilBERT, RoBERTa, DeBERTa, MpNet) and mixed ( CNN + RoBERTa ...)
- MBTI: Classification models using DistilBERT
- Political Orientation: Regression models using DistilBERT
DATA/
├── age/ # parquets
│ └── ...
├── gender/ # parquets
│ └── ...
├── language/ # parquets + masking script + language maps
│ └── ...
├── mbti/ # parquets + label map
│ └── ...
├── political/ # parquets
│ └── ...
├── checkTokenDistribution.py # See if dataset is balanced in token lenght
├── deparquetize.py # Convert parquet datasets to other formats
├── parquetize.py # Convert CSV/text datasets to parquet
└── README.md # Detailed description of datasets and work regarding them
- Stored in
Training/DATA/...in parquet format. - Preprocessing scripts in
Training/handle tokenization, masking, and format conversion. - Tasks are as follows:
- Age prediction (regression based)
- Gender prediction (regression based)
- Language classification
- MBTI type classification
- Political orientation prediction (regression based)
-
App/folder contains the scripts neccesary to train and run recommended models when cloning this repository. -
This code allows users to play around with our project without diving to deep in the technicals.
-
Overall system requirements aside from a decent GPU are as follows:
Python >= 3.10
PyTorch (torch) with CUDA support
CUDA toolkit
Transformers (Hugging Face)
Pandas
NumPy
scikit-learn
spaCy
SHAP
tqdm
Matplotlib (for plotting)
Jupyter Notebook (optional, for notebooks)
- Here we present a visual summary of the model evaluation across different tasks. Each plot shows performance metrics (e.g., accuracy, Root Mean Squared Error) for the corresponding task.
| Language Classification |
|---|
![]() |
| Gender Prediction | Age Prediction |
|---|---|
![]() |
![]() |
| Political Orientation | MBTI Classification |
|---|---|
![]() |
![]() |
- Here we show some screenshots of the working prediction models:
| Screenshots (PLACEHOLDERS FOR NOW) |
|---|
![]() |
| ------------------------ |
![]() |
| ------------------------ |
![]() |
| ------------------------ |
![]() |
Maybe in the future








