This project compares k-Nearest Neighbors and Logistic Regression on a multiclass tabular classification task with numeric and categorical features, missing values, and mild class imbalance.
- Leakage-safe preprocessing using scikit-learn pipelines
- Median imputation and scaling for numeric variables
- Categorical imputation and one-hot encoding
- Stratified train-validation split
- Repeated stratified cross-validation
- Hyperparameter tuning with GridSearchCV
- Evaluation using macro F1, balanced accuracy, and accuracy
The best model was Logistic Regression with interaction-only PolynomialFeatures on the numeric branch, using class balancing and regularization.
The dataset is not included because it was provided for coursework. The notebook expects local files with the same structure as the original course data.
Install dependencies:
pip install -r requirements.txtThis project is licensed under the MIT License.
The license applies only to the code and documentation in this repository. The original coursework dataset is not included and is not covered by this license.