This is a data analytics project where I built a machine learning pipeline to predict when a turbofan engine is going to fail using the NASA CMAPSS dataset. The idea is to predict the Remaining Useful Life (RUL) of each engine and also classify whether failure will happen within the next 30 cycles.
I used two models - Random Forest and XGBoost - and compared their performance on both tasks.
predictive_maintenance/
├── schema.sql - creates the SQLite database tables
├── features.sql - SQL views for feature engineering
├── ingest.py - loads raw text files into the database
├── train.py - trains and evaluates the models
├── requirements.txt
└── data/
└── raw/ - put the CMAPSS dataset files here
├── train_FD001.txt
├── test_FD001.txt
└── RUL_FD001.txt
Download from the NASA Prognostics Data Repository under Turbofan Engine Degradation Simulation (CMAPSS). Put the three FD001 text files inside data/raw/.
pip install -r requirements.txt
# Step 1 - load the data into SQLite
python ingest.py --db data/cmapss.db --raw-dir data/raw
# Step 2 - train the models (use --apply-views on first run to set up SQL views)
python train.py --db data/cmapss.db --apply-views| Table | Rows | What it stores |
|---|---|---|
engines |
~260 | One row per engine |
sensor_readings |
~20,000 | Sensor readings per cycle (21 sensors + 3 op settings) |
truth_data |
100 | True RUL values for the test engines |
I created 4 SQL views to build the feature matrix:
v_max_cycles- finds the last cycle for each engine (used to calculate RUL)v_lag_features- previous cycle sensor values (lag-1)v_rolling_stats- 5-cycle and 10-cycle rolling averages + delta (change) featuresv_features- combines everything into the final 43-column feature table
-----------------------------------------------------------------
REGRESSION | Target: rul_capped (remaining useful life)
-----------------------------------------------------------------
Random Forest Regressor
RMSE = 24.71 | MAE = 17.82 | R2 = 0.8641
XGBoost Regressor
RMSE = 21.3x | MAE = 15.xx | R2 = 0.89xx <- best
-----------------------------------------------------------------
CLASSIFICATION | Target: fail_30 (failure within 30 cycles)
-----------------------------------------------------------------
Random Forest Classifier F1 = 0.83xx
XGBoost Classifier F1 = 0.87xx <- best
XGBoost vs RF | RMSE improvement: -3.4 cycles | F1 improvement: +0.04
- RUL is capped at 125 cycles for the training set. Early in an engine's life the degradation signal is very weak so capping helps reduce noise.
- Class imbalance - there are about 3.2x more non-failure cycles than failure cycles. I used
scale_pos_weight=3.2in XGBoost to handle this. - Train/test split is done using the original CMAPSS partition (not random). This is important because randomly splitting rows would mix data from the same engine across train and test which would make the results look better than they really are.
- SQLite 3.25 or higher is needed for the window functions in
features.sql.