diff --git a/README.md b/README.md index d541f97..151f387 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,7 @@ Feel free to explore the topics and use this as a reference for software enginee ### Machine Learning & Data Science - [Croissant Metadata](./croissant-metadata.md) - Understanding the standardized metadata format for ML datasets, its components, advantages, and why it was created. +- [ML Model Reproducibility](./ml-model-reproducibility.md) - Comprehensive guide to ensuring reproducible machine learning experiments and models ### Coding best practices diff --git a/ml-model-reproducibility.md b/ml-model-reproducibility.md new file mode 100644 index 0000000..9da0829 --- /dev/null +++ b/ml-model-reproducibility.md @@ -0,0 +1,635 @@ +# Reproducibility in ML Models: A Comprehensive Guide + +## What is ML Model Reproducibility? + +**ML Model Reproducibility** refers to the ability to obtain consistent and identical results when training, validating, and deploying machine learning models. This means that given the same data, code, and environment, a model should produce the same outputs every time it's run. + +## Why is Reproducibility Important? + +### Benefits of Reproducible ML: + +1. **Scientific Validity**: Enables verification and validation of research results +2. **Debugging**: Makes it easier to identify and fix issues in model development +3. **Collaboration**: Teams can work together effectively on the same models +4. **Regulatory Compliance**: Many industries require reproducible results for auditing +5. **Model Comparison**: Fair evaluation of different approaches and improvements +6. **Production Reliability**: Consistent behavior in deployed systems + +### Problems with Non-Reproducible Models: + +1. **Research Credibility**: Results that can't be reproduced undermine scientific validity +2. **Production Issues**: Inconsistent model behavior can cause system failures +3. **Wasted Resources**: Time and compute spent on irreproducible experiments +4. **Compliance Risks**: Regulatory requirements may not be met + +## Key Components of ML Reproducibility + +### 1. Random Seed Management + +Random seeds control the initialization of random processes in machine learning: + +```python +import numpy as np +import torch +import random +import tensorflow as tf + +def set_reproducible_seeds(seed=42): + """Set seeds for all random number generators""" + # Python's built-in random module + random.seed(seed) + + # NumPy + np.random.seed(seed) + + # PyTorch + torch.manual_seed(seed) + torch.cuda.manual_seed(seed) + torch.cuda.manual_seed_all(seed) # For multi-GPU + + # TensorFlow + tf.random.set_seed(seed) + + # Additional PyTorch settings for reproducibility + torch.backends.cudnn.deterministic = True + torch.backends.cudnn.benchmark = False + +# Usage +set_reproducible_seeds(42) +``` + +### 2. Environment Management + +Document and control the computational environment: + +```python +# requirements.txt example +torch==1.12.1 +numpy==1.21.5 +pandas==1.4.2 +scikit-learn==1.1.1 +matplotlib==3.5.2 + +# Or using conda environment.yml +# name: ml-project +# channels: +# - conda-forge +# - pytorch +# dependencies: +# - python=3.9 +# - pytorch=1.12.1 +# - numpy=1.21.5 +# - pandas=1.4.2 +``` + +### 3. Data Version Control + +Track and version your datasets: + +```python +import hashlib +import pandas as pd + +def calculate_data_hash(dataframe): + """Calculate hash of dataset for versioning""" + return hashlib.md5( + pd.util.hash_pandas_object(dataframe, index=True).values + ).hexdigest() + +def log_data_info(dataframe, dataset_name): + """Log important dataset information""" + data_hash = calculate_data_hash(dataframe) + + info = { + 'dataset_name': dataset_name, + 'shape': dataframe.shape, + 'columns': list(dataframe.columns), + 'data_hash': data_hash, + 'null_counts': dataframe.isnull().sum().to_dict(), + 'data_types': dataframe.dtypes.to_dict() + } + + print(f"Dataset Info for {dataset_name}:") + print(f"Shape: {info['shape']}") + print(f"Hash: {info['data_hash']}") + return info + +# Usage +df = pd.read_csv('data.csv') +data_info = log_data_info(df, 'training_data_v1') +``` + +### 4. Model Configuration Management + +Store all hyperparameters and model configurations: + +```python +import json +from datetime import datetime + +class ModelConfig: + def __init__(self): + self.config = { + 'model_type': 'RandomForest', + 'hyperparameters': { + 'n_estimators': 100, + 'max_depth': 10, + 'random_state': 42, + 'min_samples_split': 2 + }, + 'preprocessing': { + 'scaling': 'StandardScaler', + 'feature_selection': 'SelectKBest', + 'k_features': 20 + }, + 'training': { + 'validation_split': 0.2, + 'cross_validation_folds': 5, + 'stratify': True + }, + 'environment': { + 'python_version': '3.9.7', + 'sklearn_version': '1.1.1', + 'timestamp': datetime.now().isoformat() + } + } + + def save_config(self, filepath): + """Save configuration to JSON file""" + with open(filepath, 'w') as f: + json.dump(self.config, f, indent=2) + + def load_config(self, filepath): + """Load configuration from JSON file""" + with open(filepath, 'r') as f: + self.config = json.load(f) + return self.config + +# Usage +config = ModelConfig() +config.save_config('model_config_v1.json') +``` + +## Reproducible ML Pipeline Example + +### Complete Reproducible Training Pipeline + +```python +import pandas as pd +import numpy as np +from sklearn.model_selection import train_test_split +from sklearn.ensemble import RandomForestClassifier +from sklearn.preprocessing import StandardScaler +from sklearn.metrics import accuracy_score, classification_report +import joblib +import json +from datetime import datetime + +class ReproducibleMLPipeline: + def __init__(self, config_path=None, random_seed=42): + self.random_seed = random_seed + self.set_seeds() + + if config_path: + self.load_config(config_path) + else: + self.config = self.default_config() + + def set_seeds(self): + """Set all random seeds for reproducibility""" + np.random.seed(self.random_seed) + + def default_config(self): + return { + 'model_params': { + 'n_estimators': 100, + 'max_depth': 10, + 'random_state': self.random_seed + }, + 'preprocessing': { + 'test_size': 0.2, + 'validation_size': 0.2, + 'stratify': True + } + } + + def load_config(self, config_path): + with open(config_path, 'r') as f: + self.config = json.load(f) + + def save_config(self, config_path): + with open(config_path, 'w') as f: + json.dump(self.config, f, indent=2) + + def prepare_data(self, X, y): + """Reproducible data preparation""" + # Split data with fixed random state + X_temp, X_test, y_temp, y_test = train_test_split( + X, y, + test_size=self.config['preprocessing']['test_size'], + random_state=self.random_seed, + stratify=y if self.config['preprocessing']['stratify'] else None + ) + + X_train, X_val, y_train, y_val = train_test_split( + X_temp, y_temp, + test_size=self.config['preprocessing']['validation_size'], + random_state=self.random_seed, + stratify=y_temp if self.config['preprocessing']['stratify'] else None + ) + + # Fit scaler only on training data + self.scaler = StandardScaler() + X_train_scaled = self.scaler.fit_transform(X_train) + X_val_scaled = self.scaler.transform(X_val) + X_test_scaled = self.scaler.transform(X_test) + + return (X_train_scaled, X_val_scaled, X_test_scaled, + y_train, y_val, y_test) + + def train_model(self, X_train, y_train): + """Train model with reproducible parameters""" + self.model = RandomForestClassifier(**self.config['model_params']) + self.model.fit(X_train, y_train) + return self.model + + def evaluate_model(self, X_test, y_test): + """Evaluate model and return metrics""" + y_pred = self.model.predict(X_test) + accuracy = accuracy_score(y_test, y_pred) + report = classification_report(y_test, y_pred, output_dict=True) + + return { + 'accuracy': accuracy, + 'classification_report': report, + 'predictions': y_pred.tolist() + } + + def save_experiment(self, experiment_name, results, X_test_shape): + """Save complete experiment information""" + experiment_data = { + 'experiment_name': experiment_name, + 'timestamp': datetime.now().isoformat(), + 'random_seed': self.random_seed, + 'config': self.config, + 'results': results, + 'data_info': { + 'test_set_shape': X_test_shape + } + } + + # Save model + joblib.dump(self.model, f'{experiment_name}_model.pkl') + joblib.dump(self.scaler, f'{experiment_name}_scaler.pkl') + + # Save experiment metadata + with open(f'{experiment_name}_experiment.json', 'w') as f: + json.dump(experiment_data, f, indent=2) + + print(f"Experiment '{experiment_name}' saved successfully!") + return experiment_data + +# Usage example +def run_reproducible_experiment(): + # Load your data here + # X, y = load_data() + + pipeline = ReproducibleMLPipeline(random_seed=42) + + # Prepare data + X_train, X_val, X_test, y_train, y_val, y_test = pipeline.prepare_data(X, y) + + # Train model + model = pipeline.train_model(X_train, y_train) + + # Evaluate on validation set + val_results = pipeline.evaluate_model(X_val, y_val) + print(f"Validation Accuracy: {val_results['accuracy']:.4f}") + + # Final evaluation on test set + test_results = pipeline.evaluate_model(X_test, y_test) + print(f"Test Accuracy: {test_results['accuracy']:.4f}") + + # Save complete experiment + experiment_data = pipeline.save_experiment( + 'experiment_v1', + test_results, + X_test.shape + ) + + return experiment_data +``` + +## Best Practices for Reproducible ML + +### 1. Documentation and Logging + +```python +import logging +import sys + +def setup_experiment_logging(experiment_name): + """Setup comprehensive logging for experiments""" + logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', + handlers=[ + logging.FileHandler(f'{experiment_name}.log'), + logging.StreamHandler(sys.stdout) + ] + ) + + logger = logging.getLogger(experiment_name) + + # Log system information + logger.info(f"Python version: {sys.version}") + logger.info(f"Experiment: {experiment_name}") + + return logger + +# Usage +logger = setup_experiment_logging('ml_experiment_v1') +logger.info("Starting model training...") +``` + +### 2. Experiment Tracking + +```python +class ExperimentTracker: + def __init__(self): + self.experiments = [] + + def log_experiment(self, name, params, metrics, artifacts=None): + experiment = { + 'name': name, + 'timestamp': datetime.now().isoformat(), + 'parameters': params, + 'metrics': metrics, + 'artifacts': artifacts or [] + } + self.experiments.append(experiment) + + def save_tracking_data(self, filepath): + with open(filepath, 'w') as f: + json.dump(self.experiments, f, indent=2) + + def compare_experiments(self, metric_name): + """Compare experiments by a specific metric""" + results = [] + for exp in self.experiments: + if metric_name in exp['metrics']: + results.append({ + 'name': exp['name'], + 'timestamp': exp['timestamp'], + metric_name: exp['metrics'][metric_name] + }) + + return sorted(results, key=lambda x: x[metric_name], reverse=True) + +# Usage +tracker = ExperimentTracker() +tracker.log_experiment( + 'RandomForest_v1', + {'n_estimators': 100, 'max_depth': 10}, + {'accuracy': 0.85, 'f1_score': 0.82} +) +``` + +### 3. Docker for Environment Reproducibility + +```dockerfile +# Dockerfile for reproducible ML environment +FROM python:3.9-slim + +WORKDIR /app + +# Install system dependencies +RUN apt-get update && apt-get install -y \ + gcc \ + && rm -rf /var/lib/apt/lists/* + +# Copy requirements and install Python dependencies +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Copy application code +COPY . . + +# Set environment variables for reproducibility +ENV PYTHONHASHSEED=0 +ENV CUBLAS_WORKSPACE_CONFIG=:4096:8 + +# Run the training script +CMD ["python", "train_model.py"] +``` + +## Validation and Testing + +### 1. Reproducibility Tests + +```python +import unittest +import numpy as np + +class TestReproducibility(unittest.TestCase): + def test_model_reproducibility(self): + """Test that model produces same results with same seed""" + # Train model twice with same parameters + pipeline1 = ReproducibleMLPipeline(random_seed=42) + pipeline2 = ReproducibleMLPipeline(random_seed=42) + + # Use same data + X_train, X_val, X_test, y_train, y_val, y_test = pipeline1.prepare_data(X, y) + + model1 = pipeline1.train_model(X_train, y_train) + model2 = pipeline2.train_model(X_train, y_train) + + # Predictions should be identical + pred1 = model1.predict(X_test) + pred2 = model2.predict(X_test) + + np.testing.assert_array_equal(pred1, pred2) + + def test_data_preprocessing_reproducibility(self): + """Test that data preprocessing is reproducible""" + pipeline1 = ReproducibleMLPipeline(random_seed=42) + pipeline2 = ReproducibleMLPipeline(random_seed=42) + + # Same splits should be generated + splits1 = pipeline1.prepare_data(X, y) + splits2 = pipeline2.prepare_data(X, y) + + for i in range(len(splits1)): + if hasattr(splits1[i], 'shape'): # For arrays + np.testing.assert_array_equal(splits1[i], splits2[i]) + +if __name__ == '__main__': + unittest.main() +``` + +## Common Reproducibility Challenges + +### 1. Hardware Differences + +- **Problem**: Different GPUs, CPUs can produce different results +- **Solution**: Document hardware specifications, use deterministic algorithms + +```python +# For PyTorch +torch.use_deterministic_algorithms(True) +torch.backends.cudnn.deterministic = True +torch.backends.cudnn.benchmark = False + +# Log hardware info +import torch +print(f"CUDA available: {torch.cuda.is_available()}") +print(f"CUDA device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}") +``` + +### 2. Dependency Version Conflicts + +- **Problem**: Different package versions can change behavior +- **Solution**: Pin exact versions, use virtual environments + +```bash +# Create reproducible environment +pip freeze > requirements.txt + +# Or with conda +conda env export > environment.yml +``` + +### 3. Data Loading Order + +- **Problem**: Non-deterministic data loading can affect training +- **Solution**: Sort data, use fixed random seeds for shuffling + +```python +# Ensure deterministic data loading +def load_data_reproducibly(filepath, shuffle=True, random_seed=42): + df = pd.read_csv(filepath) + + # Sort by a deterministic column (like ID) first + if 'id' in df.columns: + df = df.sort_values('id') + + # Then shuffle with fixed seed if needed + if shuffle: + df = df.sample(frac=1, random_state=random_seed).reset_index(drop=True) + + return df +``` + +## Tools for ML Reproducibility + +### 1. MLflow for Experiment Tracking + +```python +import mlflow +import mlflow.sklearn + +def train_with_mlflow(): + with mlflow.start_run(): + # Log parameters + mlflow.log_param("n_estimators", 100) + mlflow.log_param("max_depth", 10) + mlflow.log_param("random_seed", 42) + + # Train model + model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42) + model.fit(X_train, y_train) + + # Log metrics + accuracy = model.score(X_test, y_test) + mlflow.log_metric("accuracy", accuracy) + + # Log model + mlflow.sklearn.log_model(model, "model") +``` + +### 2. DVC for Data Version Control + +```bash +# Initialize DVC +dvc init + +# Track data files +dvc add data/train.csv +dvc add data/test.csv + +# Commit to git +git add data/train.csv.dvc data/test.csv.dvc .gitignore +git commit -m "Add data files to DVC" + +# Push data to remote storage +dvc push +``` + +### 3. Weights & Biases for Experiment Tracking + +```python +import wandb + +# Initialize experiment +wandb.init(project="ml-reproducibility", config={ + "n_estimators": 100, + "max_depth": 10, + "random_seed": 42 +}) + +# Log metrics during training +wandb.log({"accuracy": accuracy, "loss": loss}) + +# Save model +wandb.save("model.pkl") +``` + +## Checklist for Reproducible ML + +### Before Training: +- [ ] Set all random seeds (Python, NumPy, PyTorch/TensorFlow) +- [ ] Document environment (Python version, package versions) +- [ ] Version control your data +- [ ] Save model configuration +- [ ] Set up logging and experiment tracking + +### During Training: +- [ ] Use deterministic algorithms where possible +- [ ] Log all hyperparameters +- [ ] Save intermediate checkpoints +- [ ] Record training metrics +- [ ] Document any manual interventions + +### After Training: +- [ ] Save final model and preprocessing artifacts +- [ ] Document final metrics and evaluation results +- [ ] Test reproducibility with same seeds +- [ ] Create model documentation +- [ ] Archive complete experiment + +### For Deployment: +- [ ] Containerize environment (Docker) +- [ ] Version control inference code +- [ ] Monitor model performance drift +- [ ] Maintain experiment lineage + +## Conclusion + +Reproducibility in machine learning is not just a best practice—it's essential for scientific validity, reliable production systems, and effective collaboration. By implementing the strategies and tools outlined in this guide, you can ensure that your ML experiments are reproducible, your results are trustworthy, and your models behave consistently across different environments. + +Key takeaways: +- **Always set random seeds** for all libraries and processes +- **Version control everything**: data, code, environment, and models +- **Document thoroughly**: configurations, decisions, and results +- **Test reproducibility** regularly with automated tests +- **Use appropriate tools** like MLflow, DVC, and containerization + +Remember: A reproducible model today saves countless hours of debugging tomorrow! + +## References + +- [Reproducible Research in Computational Science](https://science.sciencemag.org/content/334/6060/1226) +- [PyTorch Reproducibility Guide](https://pytorch.org/docs/stable/notes/randomness.html) +- [TensorFlow Determinism Guide](https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism) +- [MLflow Documentation](https://mlflow.org/docs/latest/index.html) +- [DVC Documentation](https://dvc.org/doc) \ No newline at end of file