An experiment to determine the user's MBTI personality type from provided text.
All models were provided by the scikit-learn library:
- RandomForestClassifier
- HistGradientBoostingClassifier
- LogisticRegression
First, the dataset was imported and analyzed using Exploratory Data Analysis (EDA for short) techniques. The dataset was examined for imbalance, null entries, and each post row's length was counted for letter counts.
Second, the data was preprocessed. The preprocessing steps included dropping null and duplicate entries, splitting entries and creating chunks to provide more samples to minority classes, and cleaning the post text by removing links and miscellaneous text.
Following preprocessing, the data was trained and evaluated. To start off, the X (MBTI type) and y (posts) DataFrames were defined. Since the data had been chunked in the preprocessing step, a single user was now split into 5 entries, so a Group Shuffle split was now necessary to ensure the same user remains in the their respective set.
Following a Group Shuffle split, the text in the X DataFrame was vectorized using a TF-IDF Vectorizer. The 'X_train' split was fit-transformed, and the 'X_test' split was transformed. The y splits were encoded, with the 'y_train' split undergoing a fit-transform step, and the 'y_test' split undergoing a transform step.
The models were then fitted using hyperparameters tailored to the imbalanced dataset:
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')
lr_model.fit(X_train_tfidf, y_train_encoded)
rf_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=20, class_weight='balanced', random_state=42)
rf_model.fit(X_train_tfidf, y_train_encoded)
hgb_model = HistGradientBoostingClassifier(max_iter=100, learning_rate=0.05, max_depth=4, random_state=42)
hgb_model.fit(X_train_tfidf.toarray(), y_train_encoded)Following fitting, the models were left to predict:
y_pred_lr = lr_model.predict(X_test_tfidf)
y_pred_rf = rf_model.predict(X_test_tfidf)
y_pred_hgb = hgb_model.predict(X_test_tfidf.toarray())Lastly, evaluations were collected from the three models.
First was the LogisticRegression model:
lr: precision recall f1-score support
ENFJ 0.20 0.34 0.26 201
ENFP 0.42 0.47 0.44 619
ENTJ 0.23 0.45 0.30 221
ENTP 0.43 0.38 0.40 682
ESFJ 0.08 0.24 0.12 34
ESFP 0.02 0.07 0.04 40
ESTJ 0.15 0.40 0.22 35
ESTP 0.14 0.34 0.20 70
INFJ 0.55 0.35 0.43 1417
INFP 0.62 0.46 0.53 1835
INTJ 0.46 0.42 0.44 936
INTP 0.57 0.47 0.52 1451
ISFJ 0.33 0.47 0.39 222
ISFP 0.22 0.37 0.27 258
ISTJ 0.23 0.44 0.31 221
ISTP 0.27 0.46 0.34 329
accuracy 0.42 8571
macro avg 0.31 0.38 0.32 8571
weighted avg 0.48 0.42 0.44 8571Second up, the RandomForestClassifier model:
rf: precision recall f1-score support
ENFJ 0.22 0.36 0.27 201
ENFP 0.41 0.38 0.40 619
ENTJ 0.28 0.47 0.35 221
ENTP 0.45 0.41 0.43 682
ESFJ 0.07 0.29 0.11 34
ESFP 0.02 0.05 0.03 40
ESTJ 0.14 0.37 0.20 35
ESTP 0.14 0.46 0.22 70
INFJ 0.59 0.31 0.41 1417
INFP 0.54 0.46 0.50 1835
INTJ 0.45 0.33 0.38 936
INTP 0.49 0.43 0.45 1451
ISFJ 0.32 0.53 0.40 222
ISFP 0.24 0.41 0.30 258
ISTJ 0.23 0.42 0.30 221
ISTP 0.25 0.45 0.32 329
accuracy 0.40 8571
macro avg 0.30 0.38 0.32 8571
weighted avg 0.46 0.40 0.41 8571And last, the HistGradientBoostingClassifier model:
hgb: precision recall f1-score support
ENFJ 0.38 0.15 0.22 201
ENFP 0.46 0.41 0.43 619
ENTJ 0.39 0.24 0.30 221
ENTP 0.50 0.40 0.44 682
ESFJ 0.24 0.15 0.18 34
ESFP 0.50 0.03 0.05 40
ESTJ 0.30 0.09 0.13 35
ESTP 0.47 0.24 0.32 70
INFJ 0.50 0.47 0.48 1417
INFP 0.43 0.71 0.53 1835
INTJ 0.48 0.43 0.45 936
INTP 0.52 0.51 0.52 1451
ISFJ 0.68 0.31 0.42 222
ISFP 0.48 0.27 0.35 258
ISTJ 0.43 0.24 0.31 221
ISTP 0.53 0.33 0.40 329
accuracy 0.47 8571
macro avg 0.46 0.31 0.35 8571
weighted avg 0.48 0.47 0.46 8571The MBTI dataset that was chosen for this experiment was fated to plague the performance of the model.
To start off, the data is skewed:
As visible from the graph, sampling bias is present, leading the frequency of entries of a single MBTI from the introverted users to far outnumber entries from all extroverts combined. This could largely be attributed to the lack of extroverts on the platform, as online forums tend to attract introverted users who are less likely to be outside than extroverts.
Imbalanced datasets lead to model bias, specifically bias towards the majority groups in the dataset, as seen in the metrics of the HistGradientBoostingClassifier used in predictions:
hgb: precision recall f1-score support
ENFJ 0.38 0.15 0.22 201
ENFP 0.46 0.41 0.43 619
ENTJ 0.39 0.24 0.30 221
ENTP 0.50 0.40 0.44 682
ESFJ 0.24 0.15 0.18 34
ESFP 0.50 0.03 0.05 40
ESTJ 0.30 0.09 0.13 35
ESTP 0.47 0.24 0.32 70
INFJ 0.50 0.47 0.48 1417
INFP 0.43 0.71 0.53 1835
INTJ 0.48 0.43 0.45 936
INTP 0.52 0.51 0.52 1451
ISFJ 0.68 0.31 0.42 222
ISFP 0.48 0.27 0.35 258
ISTJ 0.43 0.24 0.31 221
ISTP 0.53 0.33 0.40 329
accuracy 0.47 8571
macro avg 0.46 0.31 0.35 8571
weighted avg 0.48 0.47 0.46 8571
As seen in the hgb metrics, the supports for the extroverted sensing MBTI types such as ESFJ and ESFP range from 34 to 70 samples. When compared to the support ranges of introverted intuitive types such as INFPs and INFJs (936-1835), the sample frequencies of extroverted sensing types are significantly lacking at 30x less samples on average. The effect of such a discrepancy in data is evident in the recall scores of both ESFP (0.03) and ESTJ (0.09), which indicates that the model rarely predicted these extroverted sensing types unless it was pretty sure it was positive (as evident with the precision score of ESFP although it had a recall of only 0.03).
Machines can't inherently interpret sarcasm and irony. That's a given, but it definitely comes into play when an INTP sarcastically states "I love rust", which could be interpreted as a genuine love for the programming language.
TF-IDF works by counting the frequency of words, such as "fart" or "love", and the model attempts to come up with correlations between those frequencies and their corresponding types.
TF-IDF however can not detect...
- Bias
- Sentence Structure
- Sarcasm and irony
- "Vibes"
When machines are faced with nuance, they are quickly left to swift assumptions. The numerical encoding of TF-IDF fails to represent the nuance and intricacies of natural language, and therefore shows its limitations in this dataset.
To further improve the model, I would do the following:
-
Source more data (for underrepresented MBTIs like ESFJs) to combat class imbalance.
-
Implement better features by moving beyond word counts to "contextual embeddings" that allow for nuance and sarcasm to be detected by the model.
-
Simplify the goal of predicting all 4 axis of the MBTI at a time (I/E, N/S, F/T, P/J) to just a single axis at a time.
