Skip to content

harshchelani08/Data-Science-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CDS3005 – Foundations of Data Science

Lab Assessment | Machine Learning Model Comparison


1. Problem Statement

Student academic performance is influenced by a wide range of behavioural, demographic, and academic factors. Predicting whether a student will pass or fail based on these factors is a classic binary-classification problem in education analytics.

The challenge is to:

  • Identify the most informative features that determine student success.
  • Build and compare multiple machine learning models to classify students as Pass (1) or Fail (0).
  • Evaluate each model on standard metrics and determine the best performer.

2. Solution Statement

We design and evaluate three machine learning models on a synthetically generated but realistic student-performance dataset:

# Model Type
1 Logistic Regression Linear parametric classifier
2 Random Forest Ensemble (bagged decision trees)
3 Neural Network (MLP) Deep-learning multi-layer perceptron

Approach

  1. Dataset creation - Generate 1,000 student records with 10 behavioural and academic features plus realistic noise (3% missing values injected for authenticity).
  2. Pre-processing - Median imputation -> Standard scaling (shared pipeline).
  3. Train / Validation split - 80% train | 20% validation, stratified by class.
  4. Training - Each model trained on the training split.
  5. Evaluation - Validation Accuracy, Validation Loss (Log-Loss), Confusion Matrix, ROC Curve, Misclassification Analysis, and a composite Radar Chart.
  6. Comparison - A unified table ranks all three models across every metric.

3. Dataset Details

Source

Synthetically generated using dataset_generator.py (custom script, included in this repo). Saved to: data/student_performance.csv

Size

Property Value
Rows (students) 1,000
Feature columns 10
Target column pass_fail (0 = Fail, 1 = Pass)
Class distribution ~55% Pass · ~45% Fail
Missing values ~3% (injected into 4 feature columns)

Feature Descriptions

Feature Type Description
study_hours Float Daily study hours (0-12 h)
attendance_pct Float Class attendance percentage (30-100%)
prev_gpa Float Previous semester GPA (0-10)
assignments_done Integer Assignments submitted out of 10
sleep_hours Float Average daily sleep (3-10 h)
extracurricular Integer Number of extracurricular activities (0-3)
internet_hrs Float Daily internet/social-media usage (hours)
parental_edu Integer Parental education level (0=None, 1=HS, 2=UG, 3=PG)
library_visits Integer Weekly library visits (0-7)
lab_score Float Lab practical score (0-100)

Target

Value Meaning
1 Pass
0 Fail

Generation Formula (weighted score -> threshold)

score = 0.30 x study_hours  +  0.20 x attendance_pct  +  0.20 x prev_gpa
      + 0.10 x assignments_done  +  0.08 x lab_score
      + 0.05 x (10-internet_hrs)  +  0.04 x library_visits  +  0.03 x parental_edu
      + Gaussian noise (sigma=5)

pass_fail = 1  if score > 58  else  0

4. Algorithm Codes

4.1 Pre-Processing Pipeline (shared by all models)

from sklearn.pipeline      import Pipeline
from sklearn.impute        import SimpleImputer
from sklearn.preprocessing import StandardScaler

def make_preprocessor():
    return Pipeline([
        ("imputer", SimpleImputer(strategy="median")),  # fill missing values
        ("scaler",  StandardScaler()),                   # zero-mean, unit-var
    ])

4.2 Model 1 - Logistic Regression

from sklearn.linear_model import LogisticRegression

model_lr = Pipeline([
    ("pre", make_preprocessor()),
    ("clf", LogisticRegression(
        max_iter     = 1000,   # sufficient iterations for convergence
        C            = 1.0,    # inverse regularisation strength
        random_state = 42
    )),
])

model_lr.fit(X_train, y_train)
y_pred = model_lr.predict(X_val)
y_prob = model_lr.predict_proba(X_val)[:, 1]

Key Hyperparameters

Parameter Value Meaning
C 1.0 Regularisation (higher = less regularised)
max_iter 1000 Max solver iterations
solver lbfgs (default) Optimisation algorithm

4.3 Model 2 - Random Forest

from sklearn.ensemble import RandomForestClassifier

model_rf = Pipeline([
    ("pre", make_preprocessor()),
    ("clf", RandomForestClassifier(
        n_estimators = 200,   # 200 decision trees in the forest
        max_depth    = 12,    # prevents overfitting
        random_state = 42,
        n_jobs       = -1,    # use all CPU cores
    )),
])

model_rf.fit(X_train, y_train)

Key Hyperparameters

Parameter Value Meaning
n_estimators 200 Number of trees
max_depth 12 Maximum tree depth
n_jobs -1 Parallelise training

4.4 Model 3 - Neural Network (MLP)

from sklearn.neural_network import MLPClassifier

model_nn = Pipeline([
    ("pre", make_preprocessor()),
    ("clf", MLPClassifier(
        hidden_layer_sizes  = (128, 64, 32),  # 3 hidden layers
        activation          = "relu",
        solver              = "adam",
        learning_rate_init  = 0.001,
        max_iter            = 500,
        early_stopping      = True,            # stop if val loss stops improving
        validation_fraction = 0.1,
        random_state        = 42,
    )),
])

model_nn.fit(X_train, y_train)

Architecture

Input(10) -> Dense(128, ReLU) -> Dense(64, ReLU) -> Dense(32, ReLU) -> Output(2, Softmax)

Key Hyperparameters

Parameter Value Meaning
hidden_layer_sizes (128, 64, 32) Neurons per hidden layer
activation ReLU Non-linearity
solver Adam Adaptive gradient optimiser
early_stopping True Prevent overfitting

4.5 Evaluation Metrics Code

from sklearn.metrics import (
    accuracy_score, log_loss,
    confusion_matrix, roc_curve, auc
)

val_acc        = accuracy_score(y_val, y_pred)
val_loss       = log_loss(y_val, model.predict_proba(X_val))
fpr, tpr, _   = roc_curve(y_val, y_prob)
roc_auc        = auc(fpr, tpr)
cm             = confusion_matrix(y_val, y_pred)

5. Results

5.1 Validation Accuracy

Validation Accuracy


5.2 Validation Loss

Validation Loss


5.3 Confusion Matrices

Confusion Matrices


5.4 ROC Curves

ROC Curves


5.5 Misclassification Graph

Misclassification


5.6 Visual Comparison Table

Comparison Table


5.7 Radar Chart - Overall Model Profile

Radar Chart


6. Comparison Table

Sorted by Validation Accuracy (highest = best)

Rank Model Val Accuracy Val Loss ROC-AUC Precision Recall F1-Score Misclassified
1st Logistic Regression 0.8000 0.4381 0.8788 0.8000 0.7835 0.7917 40
2nd Neural Network (MLP) 0.7900 0.4503 0.8757 0.7957 0.7629 0.7789 42
3rd Random Forest 0.7600 0.5120 0.8354 0.7753 0.7113 0.7419 48

Best Model: Logistic Regression

Criterion Winner
Highest Validation Accuracy Logistic Regression (80.00%)
Lowest Validation Loss Logistic Regression (0.4381)
Highest ROC-AUC Logistic Regression (0.8788)
Fewest Misclassifications Logistic Regression (40 / 200)
Best F1-Score Logistic Regression (0.7917)

Interpretation

  • Logistic Regression is the clear winner on this dataset. The strong linear separability of the feature-weighted scoring function (used during data generation) favours the linear classifier.
  • Neural Network (MLP) is a close second - the 3-layer architecture captures non-linearities but needs more data to surpass the linear model.
  • Random Forest has the highest misclassification count (48) and lowest AUC (0.8354), suggesting it slightly over-fits the training data despite the max_depth constraint.

7. Project Structure

Lab Assesment/
|-- data/
|   `-- student_performance.csv   <- Generated dataset (1,000 rows x 11 cols)
|-- results/
|   |-- 01_validation_accuracy.png
|   |-- 02_validation_loss.png
|   |-- 03_confusion_matrices.png
|   |-- 04_roc_curves.png
|   |-- 05_misclassification.png
|   |-- 06_comparison_table.csv
|   |-- 07_comparison_table.png
|   `-- 08_radar_chart.png
|-- dataset_generator.py          <- Creates and saves the dataset
|-- train_models.py               <- Trains all 3 models + saves all graphs
`-- README.md                     <- This file

8. How to Run

# Step 1 - Generate the dataset
python dataset_generator.py

# Step 2 - Train models and produce all result graphs
python train_models.py

Dependencies: scikit-learn, pandas, numpy, matplotlib, seaborn

pip install scikit-learn pandas numpy matplotlib seaborn

CDS3005 - Foundations of Data Science | Lab Assessment | VIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages