CDS3005 – Foundations of Data Science

Lab Assessment | Machine Learning Model Comparison

1. Problem Statement

Student academic performance is influenced by a wide range of behavioural, demographic, and academic factors. Predicting whether a student will pass or fail based on these factors is a classic binary-classification problem in education analytics.

The challenge is to:

Identify the most informative features that determine student success.
Build and compare multiple machine learning models to classify students as Pass (1) or Fail (0).
Evaluate each model on standard metrics and determine the best performer.

2. Solution Statement

We design and evaluate three machine learning models on a synthetically generated but realistic student-performance dataset:

#	Model	Type
1	Logistic Regression	Linear parametric classifier
2	Random Forest	Ensemble (bagged decision trees)
3	Neural Network (MLP)	Deep-learning multi-layer perceptron

Approach

Dataset creation - Generate 1,000 student records with 10 behavioural and academic features plus realistic noise (3% missing values injected for authenticity).
Pre-processing - Median imputation -> Standard scaling (shared pipeline).
Train / Validation split - 80% train | 20% validation, stratified by class.
Training - Each model trained on the training split.
Evaluation - Validation Accuracy, Validation Loss (Log-Loss), Confusion Matrix, ROC Curve, Misclassification Analysis, and a composite Radar Chart.
Comparison - A unified table ranks all three models across every metric.

3. Dataset Details

Source

Synthetically generated using dataset_generator.py (custom script, included in this repo). Saved to: data/student_performance.csv

Size

Property	Value
Rows (students)	1,000
Feature columns	10
Target column	`pass_fail` (0 = Fail, 1 = Pass)
Class distribution	~55% Pass · ~45% Fail
Missing values	~3% (injected into 4 feature columns)

Feature Descriptions

Feature	Type	Description
`study_hours`	Float	Daily study hours (0-12 h)
`attendance_pct`	Float	Class attendance percentage (30-100%)
`prev_gpa`	Float	Previous semester GPA (0-10)
`assignments_done`	Integer	Assignments submitted out of 10
`sleep_hours`	Float	Average daily sleep (3-10 h)
`extracurricular`	Integer	Number of extracurricular activities (0-3)
`internet_hrs`	Float	Daily internet/social-media usage (hours)
`parental_edu`	Integer	Parental education level (0=None, 1=HS, 2=UG, 3=PG)
`library_visits`	Integer	Weekly library visits (0-7)
`lab_score`	Float	Lab practical score (0-100)

Target

Value	Meaning
`1`	Pass
`0`	Fail

Generation Formula (weighted score -> threshold)

score = 0.30 x study_hours  +  0.20 x attendance_pct  +  0.20 x prev_gpa
      + 0.10 x assignments_done  +  0.08 x lab_score
      + 0.05 x (10-internet_hrs)  +  0.04 x library_visits  +  0.03 x parental_edu
      + Gaussian noise (sigma=5)

pass_fail = 1  if score > 58  else  0

4. Algorithm Codes

4.1 Pre-Processing Pipeline (shared by all models)

from sklearn.pipeline      import Pipeline
from sklearn.impute        import SimpleImputer
from sklearn.preprocessing import StandardScaler

def make_preprocessor():
    return Pipeline([
        ("imputer", SimpleImputer(strategy="median")),  # fill missing values
        ("scaler",  StandardScaler()),                   # zero-mean, unit-var
    ])

4.2 Model 1 - Logistic Regression

from sklearn.linear_model import LogisticRegression

model_lr = Pipeline([
    ("pre", make_preprocessor()),
    ("clf", LogisticRegression(
        max_iter     = 1000,   # sufficient iterations for convergence
        C            = 1.0,    # inverse regularisation strength
        random_state = 42
    )),
])

model_lr.fit(X_train, y_train)
y_pred = model_lr.predict(X_val)
y_prob = model_lr.predict_proba(X_val)[:, 1]

Key Hyperparameters

Parameter	Value	Meaning
`C`	1.0	Regularisation (higher = less regularised)
`max_iter`	1000	Max solver iterations
`solver`	lbfgs (default)	Optimisation algorithm

4.3 Model 2 - Random Forest

from sklearn.ensemble import RandomForestClassifier

model_rf = Pipeline([
    ("pre", make_preprocessor()),
    ("clf", RandomForestClassifier(
        n_estimators = 200,   # 200 decision trees in the forest
        max_depth    = 12,    # prevents overfitting
        random_state = 42,
        n_jobs       = -1,    # use all CPU cores
    )),
])

model_rf.fit(X_train, y_train)

Key Hyperparameters

Parameter	Value	Meaning
`n_estimators`	200	Number of trees
`max_depth`	12	Maximum tree depth
`n_jobs`	-1	Parallelise training

4.4 Model 3 - Neural Network (MLP)

from sklearn.neural_network import MLPClassifier

model_nn = Pipeline([
    ("pre", make_preprocessor()),
    ("clf", MLPClassifier(
        hidden_layer_sizes  = (128, 64, 32),  # 3 hidden layers
        activation          = "relu",
        solver              = "adam",
        learning_rate_init  = 0.001,
        max_iter            = 500,
        early_stopping      = True,            # stop if val loss stops improving
        validation_fraction = 0.1,
        random_state        = 42,
    )),
])

model_nn.fit(X_train, y_train)

Architecture

Input(10) -> Dense(128, ReLU) -> Dense(64, ReLU) -> Dense(32, ReLU) -> Output(2, Softmax)

Key Hyperparameters

Parameter	Value	Meaning
`hidden_layer_sizes`	(128, 64, 32)	Neurons per hidden layer
`activation`	ReLU	Non-linearity
`solver`	Adam	Adaptive gradient optimiser
`early_stopping`	True	Prevent overfitting

4.5 Evaluation Metrics Code

from sklearn.metrics import (
    accuracy_score, log_loss,
    confusion_matrix, roc_curve, auc
)

val_acc        = accuracy_score(y_val, y_pred)
val_loss       = log_loss(y_val, model.predict_proba(X_val))
fpr, tpr, _   = roc_curve(y_val, y_prob)
roc_auc        = auc(fpr, tpr)
cm             = confusion_matrix(y_val, y_pred)

5. Results

5.1 Validation Accuracy

5.2 Validation Loss

5.3 Confusion Matrices

5.4 ROC Curves

5.5 Misclassification Graph

5.6 Visual Comparison Table

5.7 Radar Chart - Overall Model Profile

6. Comparison Table

Sorted by Validation Accuracy (highest = best)

Rank	Model	Val Accuracy	Val Loss	ROC-AUC	Precision	Recall	F1-Score	Misclassified
1st	Logistic Regression	0.8000	0.4381	0.8788	0.8000	0.7835	0.7917	40
2nd	Neural Network (MLP)	0.7900	0.4503	0.8757	0.7957	0.7629	0.7789	42
3rd	Random Forest	0.7600	0.5120	0.8354	0.7753	0.7113	0.7419	48

Best Model: Logistic Regression

Criterion	Winner
Highest Validation Accuracy	Logistic Regression (80.00%)
Lowest Validation Loss	Logistic Regression (0.4381)
Highest ROC-AUC	Logistic Regression (0.8788)
Fewest Misclassifications	Logistic Regression (40 / 200)
Best F1-Score	Logistic Regression (0.7917)

Interpretation

Logistic Regression is the clear winner on this dataset. The strong linear separability of the feature-weighted scoring function (used during data generation) favours the linear classifier.
Neural Network (MLP) is a close second - the 3-layer architecture captures non-linearities but needs more data to surpass the linear model.
Random Forest has the highest misclassification count (48) and lowest AUC (0.8354), suggesting it slightly over-fits the training data despite the max_depth constraint.

7. Project Structure

Lab Assesment/
|-- data/
|   `-- student_performance.csv   <- Generated dataset (1,000 rows x 11 cols)
|-- results/
|   |-- 01_validation_accuracy.png
|   |-- 02_validation_loss.png
|   |-- 03_confusion_matrices.png
|   |-- 04_roc_curves.png
|   |-- 05_misclassification.png
|   |-- 06_comparison_table.csv
|   |-- 07_comparison_table.png
|   `-- 08_radar_chart.png
|-- dataset_generator.py          <- Creates and saves the dataset
|-- train_models.py               <- Trains all 3 models + saves all graphs
`-- README.md                     <- This file

8. How to Run

# Step 1 - Generate the dataset
python dataset_generator.py

# Step 2 - Train models and produce all result graphs
python train_models.py

Dependencies: scikit-learn, pandas, numpy, matplotlib, seaborn

pip install scikit-learn pandas numpy matplotlib seaborn

CDS3005 - Foundations of Data Science | Lab Assessment | VIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
results		results
README.md		README.md
dataset_generator.py		dataset_generator.py
train_models.py		train_models.py

Folders and files

Latest commit

History

Repository files navigation

CDS3005 – Foundations of Data Science

Lab Assessment | Machine Learning Model Comparison

1. Problem Statement

2. Solution Statement

Approach

3. Dataset Details

Source

Size

Feature Descriptions

Target

Generation Formula (weighted score -> threshold)

4. Algorithm Codes

4.1 Pre-Processing Pipeline (shared by all models)

4.2 Model 1 - Logistic Regression

4.3 Model 2 - Random Forest

4.4 Model 3 - Neural Network (MLP)

4.5 Evaluation Metrics Code

5. Results

5.1 Validation Accuracy

5.2 Validation Loss

5.3 Confusion Matrices

5.4 ROC Curves

5.5 Misclassification Graph

5.6 Visual Comparison Table

5.7 Radar Chart - Overall Model Profile

6. Comparison Table

Best Model: Logistic Regression

Interpretation

7. Project Structure

8. How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages