This project was carried out as part of my program at Γcole Polytechnique and HEC Paris.
We had 2 days to build a machine learning model capable of predicting the forest cover type of a given 30x30m area based on cartographic variables.
The task was framed as a multiclass classification problem (7 classes).
Our final model achieved 83% accuracy on the professorβs private test set and up to 87% validation accuracy during training.
The goal of the challenge was:
Predict the forest cover type (a categorical variable) from various cartographic variables.
The metric used to evaluate performance is accuracy on a hidden test set.
The dataset was provided by the US Forest Service (USFS) and US Geological Survey (USGS).
It contains only cartographic features (no remotely sensed data).
Forest cover type was determined from USFS Region 2 Resource Information System (RIS) data.
- Each observation corresponds to a 30m x 30m cell in Roosevelt National Forest (Colorado, USA).
- The study area covers four wilderness areas: Neota, Rawah, Comanche Peak, and Cache la Poudre.
- Features include:
- Elevation, Slope, Aspect
- Distances to hydrology, roadways, fire points
- Hillshade indices (9am, Noon, 3pm)
- 4 binary wilderness area indicators
- 40 binary soil type indicators
- Target: Forest Cover Type (1β7)
- 1 = Spruce/Fir
- 2 = Lodgepole Pine
- 3 = Ponderosa Pine
- 4 = Cottonwood/Willow
- 5 = Aspen
- 6 = Douglas-fir
- 7 = Krummholz
We used Python (pandas, numpy, scikit-learn, LightGBM) to preprocess and model the data.
- Feature Engineering
- Aspect transformed into sine/cosine
- Euclidean hydrology distance
- Road/Fire/Hydro combinations
- Elevation interactions
- Hillshade averages and ranges
- Encoding Soil and Wilderness as categorical variables
- Modeling
- LightGBM (LGBMClassifier) with early stopping
- Hyperparameters tuned for regularization (num_leaves, min_child_samples, subsample, colsample_bytree, etc.)
- Used early stopping to find optimal iterations
- Final Training
- Retrained on the entire dataset with the best iteration
- Exported predictions in CSV and ZIP format for submission
- Validation Accuracy: ~87%
- Private Test Accuracy (professorβs set): 83%
- Best iteration (early stopping): ~2600β3000 rounds
train.csvβ training datasettest-full.csvβ test dataset (no labels)submission.csvβ final predictionsGBBoost.ipynbβ main notebook with training and evaluationexplorations.ipynbβ exploratory data analysis
pip install -r requirements.txt
python train_and_predict.py