This project implements a machine learning solution to flag fraudulent transactions given the constraints of the business problem: aka, only 400 transactions can be reviewed per month, and the goal is to prevent the most amount of fraud value.
- Time-based train-test split to simulate production conditions (last month used as test data)
- Custom feature engineering for fraud detection
- Machine learning model (XGBoost) for fraud prediction
- Rule-based fraud detection for comparison
- Cross-validation for robust model evaluation
- Performance evaluation using a custom fraud capture score
├── data/
│ ├── transactions_obf.csv
│ └── labels_obf.csv
├── src/
│ ├── main.py
│ ├── data_loader.py
│ ├── feature_engineering.py
│ ├── model_pipeline.py
│ └── evaluation.py
└── README.md
-
Install required packages ussing requirements.txt:
pip install -r requirements.txtAlternatively, if there are any clashes, the following should work:
pip install pandas numpy scikit-learn xgboost imbalanced-learn joblib -
Ensure the data files are in the
data/directory
Run the main script:
python src/main.py
This will:
- Load and preprocess the data
- Perform cross-validation
- Train the final model
- Evaluate the model on the test set
- Compare the ML model with a rule-based approach
- Save the trained model
The machine learning model significantly outperforms the rule-based approach:
- ML Model Fraud Capture Score: 90.52%
- Rule-Based Model Fraud Capture Score: 63.93%
- Improvement: 26.59%
Cross-validation results:
- ML Model Average Fraud Capture Score: 88.36% (+/- 3.89%)
- Rule-Based Model Average Fraud Capture Score: 77.51% (+/- 8.16%)
- Average Improvement: 10.86%
DataLoader: Loads transaction and label dataDataPreprocessor: Performs initial data preprocessingCustomFeatureEngineer: Implements domain-specific feature engineeringFraudDetectionPipeline: Combines feature engineering, preprocessing, and model trainingevaluate_models_with_cv: Performs cross-validation for model evaluationfraud_capture_score: Custom metric for evaluating model performance
- Feature importance analysis for better understanding of the model
- Hyperparameter tuning for potentially improved performance
- Exploration of other machine learning algorithms