This repository provides an implementation of Shapley regression, a game-theoretic extension of logistic regression based on the Choquet integral. The framework is designed to model non-linear feature interactions while preserving the interpretability, efficiency, and statistical grounding of classical logistic regression.
Traditional logistic regression assumes additive feature effects, which limits its ability to model interactions between variables. This project introduces Shapley regression, a principled extension of logistic regression based on the Choquet integral, designed to capture pairwise and higher-order feature interactions while preserving interpretability.
The framework is motivated by application settings in which datasets are typically small, heterogeneous, and noisy. In such regimes, standard linear models fail to represent meaningful interaction patterns, while deep learning approaches often suffer from instability and limited interpretability. Shapley regression bridges this gap by explicitly modeling interactions in a structured and transparent manner, while maintaining the convexity and statistical grounding of classical regression models.
At its core, Shapley regression replaces the linear predictor with a k-additive cooperative game. This formulation enables controlled interaction modeling with theoretical guarantees on model complexity and performance, allowing predictive power, robustness, and interpretability to be balanced through the choice of k and regularization.
The repository provides both theoretical and empirical tools for analyzing these trade-offs, along with experimental pipelines demonstrating the applicability of Shapley regression to biomedical classification tasks and structured clinical data.
- Multiple Representation Bases: Supports Game, Möbius, and Shapley representations of the Choquet integral. These bases are mathematically equivalent but offer distinct interpretability and sparsity properties.
- K-additivity Analysis: Tools to study the trade-off between model complexity, interpretability, and predictive performance as a function of k-additivity.
- Robustness Testing: Comprehensive framework for testing model robustness under various perturbations
- Visualization Tools: Specialized visualization functions for each representation basis
project/
├── core/ # Core Choquet and regression implementations
│ ├── models/
│ │ ├── choquet.py # Choquet integral implementations
│ │ └── regression.py # ChoquisticRegression model
│ └── __init__.py
│
├── paper_code/ # Code used for experiments in the paper
│ ├── APDS/ # APDS case study
│ │ ├── models.py # Model definitions
│ │ ├── preprocess.py # APDS data preprocessing
│ │ ├── run_apds_experiment.py # Main experiment script
│ │ └── visualisation.py # Pairwise interaction visualization
│ │
│ └── benchmark/ # Benchmark experiments
│ ├── bootstrap_and_noise_robustness/
│ │ └── robustness_k_add.py # K-additivity analysis with noise/bootstrap
│ └── bounds/
│ ├── validate_effective_dimension.py
│ └── validate_stability_perturbation.py
│
├── utils/ # Shared utilities
│ ├── plotting.py # Visualization helpers
│ ├── data_loader.py # Benchmark data loaders
│ └── metrics.py # Evaluation metrics
│
├── examples/ # Example usage scripts
│ ├── comparison_example.py
│ ├── plot_coefficients_example.py
│ ├── plot_interaction_matrix_example.py
│ ├── plot_k_additivity_example.py
│ ├── plot_model_performance_comparison_example.py
│ └── plot_noise_robustness_example.py
│
├── results/ # Experimental results (auto-generated)
│ ├── APDS/
│ └── benchmark/
│ ├── bounds/
│ ├── noise_robustness/ # Noise robustness results by dataset
│ └── bootstrap/ # Bootstrap stability results by dataset
│
├── data/ # Data directory (git-ignored)
├── requirements.txt # Python dependencies
├── setup.py # Package installation
├── LICENSE
└── README.md # Project documentation
The project is based on three different mathematical bases that are linearly related:
- Game Representation: The traditional representation of fuzzy measures with the caveat of having fewer restrictions such as monotonicity.
- Möbius Representation: An alternative representation that directly captures the interaction between features.
- Shapley Representation: A representation that uses the Shapley value and the pairwise interaction indices between features.
Each representation has its own interpretability properties and is suitable for different types of analysis.
- Clone the repository:
git clone https://github.com/[anonymous]/shapely_regression.git
cd shapely_regression- Install the package in development mode:
pip install -e .- Install additional dependencies if needed:
pip install -r requirements.txtIf you prefer not to install the package, you can still run the scripts by adding the project root to your Python path:
- Clone the repository:
git clone https://github.com/[anonymous]/shapely_regression.git
cd shapely_regression- Install required dependencies:
pip install -r requirements.txt- Run scripts from the project root:
python -m examples.comparison_exampleBefore running the examples, you need to place your datasets in the data/ directory. The data loader expects the following files:
data_apds.csv: APDS datasetdata_banknotes.csv: Banknote authentication dataset (with header: authentic column)transfusion.csv: Blood Transfusion Service Center Data Setdata_mammographic.data: Mammographic mass datasetdata_raisin.xlsx: Raisin datasetdata_rice.xlsx: Rice (Commeo and Osmancik) datasetdiabetes.csv: Diabetes (PIMA) datasetdata_skin.csv: Skin segmentation datasetdados_covid_sbpo_atual.csv: COVID SBPO datasetpure_pairwise_interaction_dataset.csv: Pure pairwise interaction dataset
Note: The data directory is git-ignored to prevent pushing large datasets to the repository.
The data_loader.py automatically applies:
- Class balancing via
RandomOverSamplerfor imbalanced datasets (not applied to synthetic datasets) - Missing value handling for specific datasets (e.g., COVID, mammographic)
- Data type conversions as needed
from core.models.regression import ChoquisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from utils.data_loader import func_read_data
# Load data
X, y = func_read_data("banknotes")
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train model with game representation
model_game = ChoquisticRegression(
representation="game",
k_add=2,
scale_data=True
)
model_game.fit(X_train, y_train)
# Create and train model with Mobius representation
model_mobius = ChoquisticRegression(
representation="mobius",
k_add=2,
scale_data=True
)
model_mobius.fit(X_train, y_train)
# Create and train model with Shapley representation (k=2)
model_shapley = ChoquisticRegression(
representation="shapley",
k_add=2,
scale_data=True
)
model_shapley.fit(X_train, y_train)
# Evaluate models
y_pred_game = model_game.predict(X_test)
y_pred_mobius = model_mobius.predict(X_test)
y_pred_shapley = model_shapley.predict(X_test)
print(f"Game representation accuracy: {accuracy_score(y_test, y_pred_game):.4f}")
print(f"Mobius representation accuracy: {accuracy_score(y_test, y_pred_mobius):.4f}")
print(f"Shapley representation accuracy: {accuracy_score(y_test, y_pred_shapley):.4f}")The repository includes an example script that demonstrates the use of different representations:
# If installed as a package (Option 1):
python examples/comparison_example.py
# If using without installation (Option 2):
python -m examples.comparison_exampleRun comprehensive k-additivity analysis with noise robustness and bootstrap stability:
import sys
import os
sys.path.append(os.path.abspath('.'))
from paper_code.benchmark.bootstrap_and_noise_robustness.robustness_k_add import run_analysis_for_dataset
# Run analysis for a single dataset
results = run_analysis_for_dataset(
dataset="banknotes",
representation="shapley",
regularization="l2",
random_state=42
)
# Results are saved to:
# - results/benchmark/noise_robustness/banknotes/
# - results/benchmark/bootstrap/banknotes/For multiple datasets:
datasets = ['banknotes', 'mammographic', 'diabetes']
for dataset in datasets:
run_analysis_for_dataset(
dataset=dataset,
representation="shapley",
regularization="l2"
)from paper_code.benchmark.bootstrap_and_noise_robustness.paper_plots_same_scale import main
# Generate plots for a specific dataset
main(dataset_name='banknotes')
# Plots are saved to:
# - results/benchmark/noise_robustness/banknotes/noise_robustness_scaled.png
# - results/benchmark/bootstrap/banknotes/bootstrap_stability_scaled.pngThe analysis generates:
-
CSV files (
results.csv) with metrics for each k value:- Number of parameters
- Baseline accuracy
- Noise robustness (at 0.1, 0.2, 0.3 noise levels)
- Bootstrap stability (mean ± std)
-
Summary files (
summary.txt) with full results table -
Plots:
- Noise robustness vs k-additivity
- Bootstrap stability vs k-additivity
- Scaled versions for cross-dataset comparison
- Automatic class balancing: Uses
RandomOverSamplerto handle imbalanced datasets - Noise robustness: Tests model performance under Gaussian noise (scaled by feature std)
- Bootstrap stability: Evaluates prediction consistency across bootstrap samples
- K-additivity sweep: Analyzes all k values from 1 to n_features
from utils.plotting import plot_coefficients, plot_interaction_matrix_2add
# Plot model coefficients
plot_coefficients(
feature_names=X.columns.tolist(),
all_coefficients=[model_shapley.coef_[0]],
plot_folder="results/",
k_add=2
)
# Plot interaction matrix for Shapley representation (2-additive model)
plot_interaction_matrix_2add(
feature_names=X.columns.tolist(),
coefs=model_shapley.coef_[0], # Coefficients from the fitted model
plot_folder="results/"
)If you encounter import errors like ModuleNotFoundError: No module named 'core', try one of these solutions:
-
Install the package (recommended):
pip install -e . -
Run from the project root:
python -m examples.comparison_example
-
Add the project root to your Python path in your script:
import sys import os sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))