Skip to content

DRAFT: finalise metrics#40

Open
hspitzer wants to merge 61 commits into
mainfrom
metrics
Open

DRAFT: finalise metrics#40
hspitzer wants to merge 61 commits into
mainfrom
metrics

Conversation

@hspitzer

@hspitzer hspitzer commented Jan 26, 2026

Copy link
Copy Markdown
Collaborator

Draft PR for finalising metrics.
Basic structure for every metric: compute_ and plot_ function in cellseg_benchmark/metrics. Script in scripts/metrics for every metric. The script creates a csv file with rows for every sample & method. All scripts can be run on a specific set of methods / all methods at once and have an overwrite flag. The script also plots figures using the plot_ function. Finally, there is a notebook for every method to interactively compute methods & plots.

Metrics to implement

Clean up:

  • remove unused notebooks.
  • remove unused functions in cellseg_benchmark/metrics

@hspitzer

Copy link
Copy Markdown
Collaborator Author

@simonmfr please check implementation of clustering scores. I compute leiden clustering per sample. "leiden" is not in adata.obsm (using adata_integrated).

@simonmfr

simonmfr commented Feb 19, 2026

Copy link
Copy Markdown
Owner

The reason why leiden clustering labels don't appear in adata_integrated is the following: clustering is currently computed during the cell type annotation step, per sample, with resolution=10 (here). The resulting cluster labels are written to adata_obs_annotated.csv, but they are not transferred into the downstream master_sdata object (see here).

I think a clustering resolution of 10 is not ideal for metric evaluation, therefore I would suggest keeping the current behavior unchanged. This mean running the clustering step twice, but with different resolutions:

  1. leiden with res=10 for cell type annotation
  2. leiden with res=1 (or similar) for clustering metric evaluation

@hspitzer

hspitzer commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator Author

I just added scripts for general stats extraction & plotting (all present in adata obs). I noticed that intensities (DAPI / PolyT) were not computed for the following methods (aging cohort): vpt_2D_DAPI_nuclei, vpt_3D_DAPI_PolyT_nuclei, vpt_3D_DAPI_nuclei, Cellpose_1_Merlin, vpt_2D_DAPI_PolyT, vpt_2D_DAPI_PolyT_nuclei, vpt_3D_DAPI_PolyT

@simonmfr

simonmfr commented Mar 6, 2026

Copy link
Copy Markdown
Owner

I just added scripts for general stats extraction & plotting (all present in adata obs). I noticed that intensities (DAPI / PolyT) were not computed for the following methods (aging cohort): vpt_2D_DAPI_nuclei, vpt_3D_DAPI_PolyT_nuclei, vpt_3D_DAPI_nuclei, Cellpose_1_Merlin, vpt_2D_DAPI_PolyT, vpt_2D_DAPI_PolyT_nuclei, vpt_3D_DAPI_PolyT

Thanks for highlighting. These are all our VPT/Merlin methods, which contain a separate DAPI/PolyT quantification. We need to format these correctly; I've added it to our to-dos as #43.

Comment thread cellseg_benchmark/metrics/utils.py Outdated
@hspitzer

Copy link
Copy Markdown
Collaborator Author

Re cleanup for this PR: I propose to remove the following notebooks:
metrics_cell_type_based, metrics_cell_types, metrics_general_old, metrics_morphology
All of these are re-implemented by my functions.

There might also be old metrics functions, maybe we can find them once we remove the notebooks by checking which code in the repo references them. If we don't use it at all it can go imo.

@simonmfr

Copy link
Copy Markdown
Owner
  • I double checked this and removed the notebooks you mentioned: metrics_cell_type_based, metrics_cell_types, metrics_general_old, metrics_morphology
  • Also removed the outdated metric function files: metrics/wasserstein.py and metrics/specificity.py
  • Other outdated metric functions were defined in the removed notebooks

@hspitzer

hspitzer commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator Author

Todos:

  • Hannah: absolute number of cells -> track in cell type metrics computation; + remove previous implementation from metrics/general.py (n_cells)
  • Hannah: exclude Tanycytes and ABC for marker gene dict
  • Hannah: check negative marker purity computation (results are inverse of expected)
  • Hannah: compute MECR score per sample, and marker F1 per sample
  • Simon: Check distribution of cell types in 3D vs 2D models (background: 3D models have <50% assigned transcripts -> is there a bias in which cell types are segmented?)
  • Simon: quantify cell type proportions somehow. Idea: use nuclei model as baseline + research neurons vs non neurons quantification papers -> is there sth quantifying a coronal section similar to ours?
  • Simon: Select metrics: eg. CHS vs SH
  • Simon: Revise metric weights, in particular for assigned_transcripts
  • Simon: Check metric normalization: min-max? fix lower/upper boundaries for comparability? note: even with fixed boundaries, if all values are low, it may be valid to lower upper boundary. Check how other benchmarks did this, eg scib.
  • Jonas: finish vpt segmentations + run annotation/ovrlpy/ficture
  • Jonas: memory / runtime
  • Jonas: ficture f1 score
  • Jonas: DAPI/PolyT intensity for vpt samples
  • Jonas or Simon: read out sacct mem/runtime from previous cellpose jobs before proper logging was implemented; based on slurm job name (once only)

@hspitzer

hspitzer commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator Author

I updated the MECR and marker f1 computation to include samples. The plots now aggregate per gene, then per cell type showing samples as points and ordered by mean sample value. I realised that MECR is lower is better, so we'll have to *-1 this for the summary table.

@simonmfr

simonmfr commented Apr 9, 2026

Copy link
Copy Markdown
Owner
  • I added marker F1 sample mean to table
  • Regarding metric normalization:
    • The total score is now computed as the mean of absolute metrics (all are in range [0, 1] without any normalization, except for silhouette score [-1, 1] which is rescaled to [0, 1]). In this way, the total score stays comparable between datasets.
    • For plotting, individual metrics (but not the total score) are min-max normalized given the dataset, as before and to highlight differences between methods; otherwise differences appear very small.
    • scib used the same approach, where metrics are in the range of [0, 1] based on theoretical min/max, which is then used for computing scores. On the plots, metrics are most likely min-max normalized (all range from [0, 1]), but I haven't found that in the plotting code yet.

Comment thread cellseg_benchmark/metrics/utils.py Outdated
Comment thread cellseg_benchmark/metrics/general.py
Comment thread cellseg_benchmark/metrics/general.py Outdated
Comment thread cellseg_benchmark/_constants.py Outdated
Comment thread cellseg_benchmark/ficture_utils.py Outdated
Comment thread cellseg_benchmark/sdata_utils.py Outdated
Comment thread scripts/metrics/compute_f1_samples.py
Comment thread cellseg_benchmark/metrics/f1_score.py
Comment thread cellseg_benchmark/metrics/utils.py Outdated
Comment thread cellseg_benchmark/metrics/utils.py Outdated
Comment thread scripts/metrics/compute_f1_samples.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants