This project explores the Forest CoverType dataset to predict the type of forest cover from cartographic variables such as elevation, aspect, slope, soil type, and more.
It’s a multi-class classification problem with 7 cover type classes. Using ensemble methods, this project investigates how handling class imbalance and model choice impacts predictive performance.
- Source: Forest CoverType Dataset – Kaggle
- Classes: 7 types of forest cover.
- Characteristics:
- Dataset already one-hot encoded, with no categorical variables present.
- No duplicates or missing values.
- Imbalanced classes: types 1 and 2 are majority classes, while 4 and 5 are underrepresented.
-
Data Preprocessing
- Verified dataset cleanliness (no nulls, no duplicates).
- Confirmed one-hot encoding of categorical variables.
- Addressed class imbalance using SMOTE oversampling.
-
Modeling
- Random Forest Classifier
- Evaluated with confusion matrix, classification report, and feature importance visualization.
- Validated with stratified cross-validation.
- XGBoost Classifier
- Applied the same evaluation (confusion matrix, classification report, feature importances).
- Validated with stratified cross-validation.
- Random Forest Classifier
-
Evaluation Metrics
- Confusion Matrix
- Classification Report (Precision, Recall, F1-score)
- Stratified Cross-Validation
- Random Forest delivered the strongest performance with an average CV F1 of 0.93, making it the best fit for this dataset.
- XGBoost performed reasonably well with an average CV F1 of 0.8855, though slightly weaker than Random Forest.
- Feature importance analysis highlighted which environmental and soil-related factors most influenced forest cover classification.
- Class imbalance needed addressing for fairer evaluation — SMOTE oversampling proved useful.
- Random Forest outperformed XGBoost on this dataset, showcasing its robustness for high-dimensional, imbalanced data.
- Feature importance visualization provided ecological insights into which variables play the largest role in forest cover classification.