Student academic performance is influenced by a wide range of behavioural, demographic, and academic factors. Predicting whether a student will pass or fail based on these factors is a classic binary-classification problem in education analytics.
The challenge is to:
- Identify the most informative features that determine student success.
- Build and compare multiple machine learning models to classify students as Pass (1) or Fail (0).
- Evaluate each model on standard metrics and determine the best performer.
We design and evaluate three machine learning models on a synthetically generated but realistic student-performance dataset:
| # | Model | Type |
|---|---|---|
| 1 | Logistic Regression | Linear parametric classifier |
| 2 | Random Forest | Ensemble (bagged decision trees) |
| 3 | Neural Network (MLP) | Deep-learning multi-layer perceptron |
- Dataset creation - Generate 1,000 student records with 10 behavioural and academic features plus realistic noise (3% missing values injected for authenticity).
- Pre-processing - Median imputation -> Standard scaling (shared pipeline).
- Train / Validation split - 80% train | 20% validation, stratified by class.
- Training - Each model trained on the training split.
- Evaluation - Validation Accuracy, Validation Loss (Log-Loss), Confusion Matrix, ROC Curve, Misclassification Analysis, and a composite Radar Chart.
- Comparison - A unified table ranks all three models across every metric.
Synthetically generated using dataset_generator.py (custom script, included in this repo).
Saved to: data/student_performance.csv
| Property | Value |
|---|---|
| Rows (students) | 1,000 |
| Feature columns | 10 |
| Target column | pass_fail (0 = Fail, 1 = Pass) |
| Class distribution | ~55% Pass · ~45% Fail |
| Missing values | ~3% (injected into 4 feature columns) |
| Feature | Type | Description |
|---|---|---|
study_hours |
Float | Daily study hours (0-12 h) |
attendance_pct |
Float | Class attendance percentage (30-100%) |
prev_gpa |
Float | Previous semester GPA (0-10) |
assignments_done |
Integer | Assignments submitted out of 10 |
sleep_hours |
Float | Average daily sleep (3-10 h) |
extracurricular |
Integer | Number of extracurricular activities (0-3) |
internet_hrs |
Float | Daily internet/social-media usage (hours) |
parental_edu |
Integer | Parental education level (0=None, 1=HS, 2=UG, 3=PG) |
library_visits |
Integer | Weekly library visits (0-7) |
lab_score |
Float | Lab practical score (0-100) |
| Value | Meaning |
|---|---|
1 |
Pass |
0 |
Fail |
score = 0.30 x study_hours + 0.20 x attendance_pct + 0.20 x prev_gpa
+ 0.10 x assignments_done + 0.08 x lab_score
+ 0.05 x (10-internet_hrs) + 0.04 x library_visits + 0.03 x parental_edu
+ Gaussian noise (sigma=5)
pass_fail = 1 if score > 58 else 0
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
def make_preprocessor():
return Pipeline([
("imputer", SimpleImputer(strategy="median")), # fill missing values
("scaler", StandardScaler()), # zero-mean, unit-var
])from sklearn.linear_model import LogisticRegression
model_lr = Pipeline([
("pre", make_preprocessor()),
("clf", LogisticRegression(
max_iter = 1000, # sufficient iterations for convergence
C = 1.0, # inverse regularisation strength
random_state = 42
)),
])
model_lr.fit(X_train, y_train)
y_pred = model_lr.predict(X_val)
y_prob = model_lr.predict_proba(X_val)[:, 1]Key Hyperparameters
| Parameter | Value | Meaning |
|---|---|---|
C |
1.0 | Regularisation (higher = less regularised) |
max_iter |
1000 | Max solver iterations |
solver |
lbfgs (default) | Optimisation algorithm |
from sklearn.ensemble import RandomForestClassifier
model_rf = Pipeline([
("pre", make_preprocessor()),
("clf", RandomForestClassifier(
n_estimators = 200, # 200 decision trees in the forest
max_depth = 12, # prevents overfitting
random_state = 42,
n_jobs = -1, # use all CPU cores
)),
])
model_rf.fit(X_train, y_train)Key Hyperparameters
| Parameter | Value | Meaning |
|---|---|---|
n_estimators |
200 | Number of trees |
max_depth |
12 | Maximum tree depth |
n_jobs |
-1 | Parallelise training |
from sklearn.neural_network import MLPClassifier
model_nn = Pipeline([
("pre", make_preprocessor()),
("clf", MLPClassifier(
hidden_layer_sizes = (128, 64, 32), # 3 hidden layers
activation = "relu",
solver = "adam",
learning_rate_init = 0.001,
max_iter = 500,
early_stopping = True, # stop if val loss stops improving
validation_fraction = 0.1,
random_state = 42,
)),
])
model_nn.fit(X_train, y_train)Architecture
Input(10) -> Dense(128, ReLU) -> Dense(64, ReLU) -> Dense(32, ReLU) -> Output(2, Softmax)
Key Hyperparameters
| Parameter | Value | Meaning |
|---|---|---|
hidden_layer_sizes |
(128, 64, 32) | Neurons per hidden layer |
activation |
ReLU | Non-linearity |
solver |
Adam | Adaptive gradient optimiser |
early_stopping |
True | Prevent overfitting |
from sklearn.metrics import (
accuracy_score, log_loss,
confusion_matrix, roc_curve, auc
)
val_acc = accuracy_score(y_val, y_pred)
val_loss = log_loss(y_val, model.predict_proba(X_val))
fpr, tpr, _ = roc_curve(y_val, y_prob)
roc_auc = auc(fpr, tpr)
cm = confusion_matrix(y_val, y_pred)Sorted by Validation Accuracy (highest = best)
| Rank | Model | Val Accuracy | Val Loss | ROC-AUC | Precision | Recall | F1-Score | Misclassified |
|---|---|---|---|---|---|---|---|---|
| 1st | Logistic Regression | 0.8000 | 0.4381 | 0.8788 | 0.8000 | 0.7835 | 0.7917 | 40 |
| 2nd | Neural Network (MLP) | 0.7900 | 0.4503 | 0.8757 | 0.7957 | 0.7629 | 0.7789 | 42 |
| 3rd | Random Forest | 0.7600 | 0.5120 | 0.8354 | 0.7753 | 0.7113 | 0.7419 | 48 |
| Criterion | Winner |
|---|---|
| Highest Validation Accuracy | Logistic Regression (80.00%) |
| Lowest Validation Loss | Logistic Regression (0.4381) |
| Highest ROC-AUC | Logistic Regression (0.8788) |
| Fewest Misclassifications | Logistic Regression (40 / 200) |
| Best F1-Score | Logistic Regression (0.7917) |
- Logistic Regression is the clear winner on this dataset. The strong linear separability of the feature-weighted scoring function (used during data generation) favours the linear classifier.
- Neural Network (MLP) is a close second - the 3-layer architecture captures non-linearities but needs more data to surpass the linear model.
- Random Forest has the highest misclassification count (48) and lowest AUC (0.8354), suggesting it slightly over-fits the training data despite the
max_depthconstraint.
Lab Assesment/
|-- data/
| `-- student_performance.csv <- Generated dataset (1,000 rows x 11 cols)
|-- results/
| |-- 01_validation_accuracy.png
| |-- 02_validation_loss.png
| |-- 03_confusion_matrices.png
| |-- 04_roc_curves.png
| |-- 05_misclassification.png
| |-- 06_comparison_table.csv
| |-- 07_comparison_table.png
| `-- 08_radar_chart.png
|-- dataset_generator.py <- Creates and saves the dataset
|-- train_models.py <- Trains all 3 models + saves all graphs
`-- README.md <- This file
# Step 1 - Generate the dataset
python dataset_generator.py
# Step 2 - Train models and produce all result graphs
python train_models.pyDependencies: scikit-learn, pandas, numpy, matplotlib, seaborn
pip install scikit-learn pandas numpy matplotlib seabornCDS3005 - Foundations of Data Science | Lab Assessment | VIT






