Beyond accuracy: Quantifying the reliability of multiple instance learning for whole slide image classification
This repository contains the official implementation, experiments, and evaluation code for our work on MIL reliability assessment.
Figure 1: Overall framework for evaluating the reliability of MIL models.
This project provides a unified pipeline to quantitatively evaluate the reliability of multiple instance learning (MIL) models. It is designed for WSI classification tasks and supports several MIL architectures (e.g., ABMIL, CLAM, etc.). Reliability in this study is defined as the consistent focus of MIL models on diagnostically relevant ROIs within WSIs, a prerequisite for trustworthy and clinically useful predictions. To quantify this alignment, we selected three complementary metrics that together capture different aspects of spatial concordance between predicted patch scores and ground truth annotations:
- Mutual Information (MI)
- Spearman’s Correlation (Spearman’s)
- Area Under the Precision-Recall Curve (AUPRC)
We follow the preprocessing pipeline of CLAM. Pre-extracted patch features should be organized similarly to CLAM’s directory structure. Please refer to the original CLAM documentation or the accompanying paper for detailed guidance.
Train MIL models using:
python train.py \
--data_root_dir feat-directory ... \
--lr 1e-4 --reg 1e-5 --seed 2021 \
--k 5 --k_end 5 \
--split_dir task_camelyon16 \
--model_type abmil \
--task task_1_tumor_vs_normal \
--csv_path ./dataset_csv/camelyon16.csv \
--exp_code ABMIL
After training, compute and store patch-level attention / prediction scores:
python eval.py \
--drop_out \
--k 5 --k_start 0 --k_end -1 \
--models_exp_code ABMIL_s2021 \
--save_exp_code ABMIL_eval \
--task task_1_tumor_vs_normal \
--model_type abmil \
--results_dir results \
--data_root_dir ...
Compute reliability metrics across folds:
python reliability.py \
--model_name ABMIL \
--att_path ... \
--anno_path ...
Below are reliability comparisons across MIL models based on AUPRC, Spearman correlation, and Mutual Information.
Figure 2: Reliability evaluation of different MIL architectures.
If you find this repository useful, please consider citing:
@article{keshvarikhojasteh2025beyond,
title={Beyond accuracy: Quantifying the reliability of multiple instance learning for whole slide image classification},
author={Keshvarikhojasteh, Hassan and Aubreville, Marc and Bertram, Christof A and Pluim, Josien PW and Veta, Mitko},
journal={PloS one},
volume={20},
number={12},
pages={e0337261},
year={2025},
publisher={Public Library of Science San Francisco, CA USA}
}This repository is licensed under MIT License.



