Skip to content

OliverDahle/tabular-classification-model-comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KNN vs Logistic Regression on a 4-Class Tabular Dataset

This project compares k-Nearest Neighbors and Logistic Regression on a multiclass tabular classification task with numeric and categorical features, missing values, and mild class imbalance.

Methods

  • Leakage-safe preprocessing using scikit-learn pipelines
  • Median imputation and scaling for numeric variables
  • Categorical imputation and one-hot encoding
  • Stratified train-validation split
  • Repeated stratified cross-validation
  • Hyperparameter tuning with GridSearchCV
  • Evaluation using macro F1, balanced accuracy, and accuracy

Main Result

The best model was Logistic Regression with interaction-only PolynomialFeatures on the numeric branch, using class balancing and regularization.

Data

The dataset is not included because it was provided for coursework. The notebook expects local files with the same structure as the original course data.

Reproducibility

Install dependencies:

pip install -r requirements.txt

License

This project is licensed under the MIT License.

The license applies only to the code and documentation in this repository. The original coursework dataset is not included and is not covered by this license.

About

Comparison of KNN and Logistic Regression pipelines for multiclass tabular classification with preprocessing, cross-validation, and hyperparameter tuning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors