ML MBTI Predictive Analyzer

An experiment to determine the user's MBTI personality type from provided text.

What were the models used?

All models were provided by the scikit-learn library:

RandomForestClassifier
HistGradientBoostingClassifier
LogisticRegression

What was the process?

1. EDA

First, the dataset was imported and analyzed using Exploratory Data Analysis (EDA for short) techniques. The dataset was examined for imbalance, null entries, and each post row's length was counted for letter counts.

2. Preprocessing

Second, the data was preprocessed. The preprocessing steps included dropping null and duplicate entries, splitting entries and creating chunks to provide more samples to minority classes, and cleaning the post text by removing links and miscellaneous text.

3. Training and Evaluation

Following preprocessing, the data was trained and evaluated. To start off, the X (MBTI type) and y (posts) DataFrames were defined. Since the data had been chunked in the preprocessing step, a single user was now split into 5 entries, so a Group Shuffle split was now necessary to ensure the same user remains in the their respective set.

Following a Group Shuffle split, the text in the X DataFrame was vectorized using a TF-IDF Vectorizer. The 'X_train' split was fit-transformed, and the 'X_test' split was transformed. The y splits were encoded, with the 'y_train' split undergoing a fit-transform step, and the 'y_test' split undergoing a transform step.

The models were then fitted using hyperparameters tailored to the imbalanced dataset:

lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')
lr_model.fit(X_train_tfidf, y_train_encoded)

rf_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=20, class_weight='balanced', random_state=42)
rf_model.fit(X_train_tfidf, y_train_encoded)

hgb_model = HistGradientBoostingClassifier(max_iter=100, learning_rate=0.05, max_depth=4, random_state=42)
hgb_model.fit(X_train_tfidf.toarray(), y_train_encoded)

Following fitting, the models were left to predict:

y_pred_lr = lr_model.predict(X_test_tfidf)
y_pred_rf = rf_model.predict(X_test_tfidf)
y_pred_hgb = hgb_model.predict(X_test_tfidf.toarray())

Lastly, evaluations were collected from the three models.

First was the LogisticRegression model:

lr:              precision    recall  f1-score   support

        ENFJ       0.20      0.34      0.26       201
        ENFP       0.42      0.47      0.44       619
        ENTJ       0.23      0.45      0.30       221
        ENTP       0.43      0.38      0.40       682
        ESFJ       0.08      0.24      0.12        34
        ESFP       0.02      0.07      0.04        40
        ESTJ       0.15      0.40      0.22        35
        ESTP       0.14      0.34      0.20        70
        INFJ       0.55      0.35      0.43      1417
        INFP       0.62      0.46      0.53      1835
        INTJ       0.46      0.42      0.44       936
        INTP       0.57      0.47      0.52      1451
        ISFJ       0.33      0.47      0.39       222
        ISFP       0.22      0.37      0.27       258
        ISTJ       0.23      0.44      0.31       221
        ISTP       0.27      0.46      0.34       329

    accuracy                           0.42      8571
   macro avg       0.31      0.38      0.32      8571
weighted avg       0.48      0.42      0.44      8571

Second up, the RandomForestClassifier model:

rf:              precision    recall  f1-score   support

        ENFJ       0.22      0.36      0.27       201
        ENFP       0.41      0.38      0.40       619
        ENTJ       0.28      0.47      0.35       221
        ENTP       0.45      0.41      0.43       682
        ESFJ       0.07      0.29      0.11        34
        ESFP       0.02      0.05      0.03        40
        ESTJ       0.14      0.37      0.20        35
        ESTP       0.14      0.46      0.22        70
        INFJ       0.59      0.31      0.41      1417
        INFP       0.54      0.46      0.50      1835
        INTJ       0.45      0.33      0.38       936
        INTP       0.49      0.43      0.45      1451
        ISFJ       0.32      0.53      0.40       222
        ISFP       0.24      0.41      0.30       258
        ISTJ       0.23      0.42      0.30       221
        ISTP       0.25      0.45      0.32       329

    accuracy                           0.40      8571
   macro avg       0.30      0.38      0.32      8571
weighted avg       0.46      0.40      0.41      8571

And last, the HistGradientBoostingClassifier model:

hgb:              precision    recall  f1-score   support

        ENFJ       0.38      0.15      0.22       201
        ENFP       0.46      0.41      0.43       619
        ENTJ       0.39      0.24      0.30       221
        ENTP       0.50      0.40      0.44       682
        ESFJ       0.24      0.15      0.18        34
        ESFP       0.50      0.03      0.05        40
        ESTJ       0.30      0.09      0.13        35
        ESTP       0.47      0.24      0.32        70
        INFJ       0.50      0.47      0.48      1417
        INFP       0.43      0.71      0.53      1835
        INTJ       0.48      0.43      0.45       936
        INTP       0.52      0.51      0.52      1451
        ISFJ       0.68      0.31      0.42       222
        ISFP       0.48      0.27      0.35       258
        ISTJ       0.43      0.24      0.31       221
        ISTP       0.53      0.33      0.40       329

    accuracy                           0.47      8571
   macro avg       0.46      0.31      0.35      8571
weighted avg       0.48      0.47      0.46      8571

Lessons Learned?

High-quality Datasets are as Important as the Model you Choose

The MBTI dataset that was chosen for this experiment was fated to plague the performance of the model.

To start off, the data is skewed:

As visible from the graph, sampling bias is present, leading the frequency of entries of a single MBTI from the introverted users to far outnumber entries from all extroverts combined. This could largely be attributed to the lack of extroverts on the platform, as online forums tend to attract introverted users who are less likely to be outside than extroverts.

Imbalanced datasets lead to model bias, specifically bias towards the majority groups in the dataset, as seen in the metrics of the HistGradientBoostingClassifier used in predictions:

hgb:              precision    recall  f1-score   support

        ENFJ       0.38      0.15      0.22       201
        ENFP       0.46      0.41      0.43       619
        ENTJ       0.39      0.24      0.30       221
        ENTP       0.50      0.40      0.44       682
        ESFJ       0.24      0.15      0.18        34
        ESFP       0.50      0.03      0.05        40
        ESTJ       0.30      0.09      0.13        35
        ESTP       0.47      0.24      0.32        70
        INFJ       0.50      0.47      0.48      1417
        INFP       0.43      0.71      0.53      1835
        INTJ       0.48      0.43      0.45       936
        INTP       0.52      0.51      0.52      1451
        ISFJ       0.68      0.31      0.42       222
        ISFP       0.48      0.27      0.35       258
        ISTJ       0.43      0.24      0.31       221
        ISTP       0.53      0.33      0.40       329

    accuracy                           0.47      8571
   macro avg       0.46      0.31      0.35      8571
weighted avg       0.48      0.47      0.46      8571

As seen in the hgb metrics, the supports for the extroverted sensing MBTI types such as ESFJ and ESFP range from 34 to 70 samples. When compared to the support ranges of introverted intuitive types such as INFPs and INFJs (936-1835), the sample frequencies of extroverted sensing types are significantly lacking at 30x less samples on average. The effect of such a discrepancy in data is evident in the recall scores of both ESFP (0.03) and ESTJ (0.09), which indicates that the model rarely predicted these extroverted sensing types unless it was pretty sure it was positive (as evident with the precision score of ESFP although it had a recall of only 0.03).

Numerical Encoding is Imperfect

Machines can't inherently interpret sarcasm and irony. That's a given, but it definitely comes into play when an INTP sarcastically states "I love rust", which could be interpreted as a genuine love for the programming language.

TF-IDF works by counting the frequency of words, such as "fart" or "love", and the model attempts to come up with correlations between those frequencies and their corresponding types.

TF-IDF however can not detect...

Bias
Sentence Structure
Sarcasm and irony
"Vibes"

When machines are faced with nuance, they are quickly left to swift assumptions. The numerical encoding of TF-IDF fails to represent the nuance and intricacies of natural language, and therefore shows its limitations in this dataset.

How I would further improve this model?

To further improve the model, I would do the following:

Source more data (for underrepresented MBTIs like ESFJs) to combat class imbalance.
Implement better features by moving beyond word counts to "contextual embeddings" that allow for nuance and sarcasm to be detected by the model.
Simplify the goal of predicting all 4 axis of the MBTI at a time (I/E, N/S, F/T, P/J) to just a single axis at a time.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML MBTI Predictive Analyzer

What were the models used?

What was the process?

1. EDA

2. Preprocessing

3. Training and Evaluation

Lessons Learned?

High-quality Datasets are as Important as the Model you Choose

Numerical Encoding is Imperfect

How I would further improve this model?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML MBTI Predictive Analyzer

What were the models used?

What was the process?

1. EDA

2. Preprocessing

3. Training and Evaluation

Lessons Learned?

High-quality Datasets are as Important as the Model you Choose

Numerical Encoding is Imperfect

How I would further improve this model?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages