🚀 MIL-Reliability

Beyond accuracy: Quantifying the reliability of multiple instance learning for whole slide image classification

This repository contains the official implementation, experiments, and evaluation code for our work on MIL reliability assessment.

Figure 1: Overall framework for evaluating the reliability of MIL models.

🧠 Overview

This project provides a unified pipeline to quantitatively evaluate the reliability of multiple instance learning (MIL) models. It is designed for WSI classification tasks and supports several MIL architectures (e.g., ABMIL, CLAM, etc.). Reliability in this study is defined as the consistent focus of MIL models on diagnostically relevant ROIs within WSIs, a prerequisite for trustworthy and clinically useful predictions. To quantify this alignment, we selected three complementary metrics that together capture different aspects of spatial concordance between predicted patch scores and ground truth annotations:

Mutual Information (MI)
Spearman’s Correlation (Spearman’s)
Area Under the Precision-Recall Curve (AUPRC)

📂 Data Preparation

We follow the preprocessing pipeline of CLAM. Pre-extracted patch features should be organized similarly to CLAM’s directory structure. Please refer to the original CLAM documentation or the accompanying paper for detailed guidance.

🏋️ Training

Train MIL models using:

python train.py \
  --data_root_dir feat-directory ... \
  --lr 1e-4 --reg 1e-5 --seed 2021 \
  --k 5 --k_end 5 \
  --split_dir task_camelyon16 \
  --model_type abmil \
  --task task_1_tumor_vs_normal \
  --csv_path ./dataset_csv/camelyon16.csv \
  --exp_code ABMIL

🔍 Evaluation

After training, compute and store patch-level attention / prediction scores:

python eval.py \
  --drop_out \
  --k 5 --k_start 0 --k_end -1 \
  --models_exp_code ABMIL_s2021 \
  --save_exp_code ABMIL_eval \
  --task task_1_tumor_vs_normal \
  --model_type abmil \
  --results_dir results \
  --data_root_dir ...

📊 Reliability Computation

Compute reliability metrics across folds:

python reliability.py \
  --model_name ABMIL \
  --att_path ... \
  --anno_path ...

📈 Results

Below are reliability comparisons across MIL models based on AUPRC, Spearman correlation, and Mutual Information.

Figure 2: Reliability evaluation of different MIL architectures.

📚 Citation

If you find this repository useful, please consider citing:

@article{keshvarikhojasteh2025beyond,
  title={Beyond accuracy: Quantifying the reliability of multiple instance learning for whole slide image classification},
  author={Keshvarikhojasteh, Hassan and Aubreville, Marc and Bertram, Christof A and Pluim, Josien PW and Veta, Mitko},
  journal={PloS one},
  volume={20},
  number={12},
  pages={e0337261},
  year={2025},
  publisher={Public Library of Science San Francisco, CA USA}
}

⚖ License

This repository is licensed under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
dataset_csv		dataset_csv
figures		figures
models		models
splits/task_camelyon16		splits/task_camelyon16
utils		utils
wsi_core		wsi_core
LICENSE		LICENSE
README.md		README.md
create_patches_fp.py		create_patches_fp.py
crop_patch.py		crop_patch.py
eval.py		eval.py
extract_features_fp.py		extract_features_fp.py
flops.py		flops.py
reliability.py		reliability.py
train.py		train.py
train_hp.py		train_hp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 MIL-Reliability

📌 Table of Contents

🧠 Overview

📂 Data Preparation

🏋️ Training

🔍 Evaluation

📊 Reliability Computation

📈 Results

📚 Citation

⚖ License

About

Uh oh!

Languages

License

tueimage/MIL-Reliability

Folders and files

Latest commit

History

Repository files navigation

🚀 MIL-Reliability

📌 Table of Contents

🧠 Overview

📂 Data Preparation

🏋️ Training

🔍 Evaluation

📊 Reliability Computation

📈 Results

📚 Citation

⚖ License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages