Find ways to help users find the most useful models from a large collection of scores

One problem with {scoringutils} as a user (which is more inherent to scores, rather than the software that is {scoringutils}), is that outputting a large number of scores, across different forecasting units (time horizons, locations, location levels, ...) is there there is an incredibly large amount of data for a human to process in order to make a choice about model selection. Of course, the "best" model (or best ensemble when combining different collections of models) depends on which scoring metric is used, and perhaps other external factors. These external factors could be the quality of the prediction, but also the maintainability of a model or its runtime. These may influence if a model is worth pursuing.  An easy to maintain, but non-optimal scoring model may be preferred in a fast-paced operational environment over a theoretically optimal model which is hard to implement or has lots of difficult to validate assumptions baked into it.

**I am in essence asking for {scoringutils} to consider including methods that help users make pragmatic choices about models and ensembles.**

{scoringutils} _could_ be expanded to help users cut out less-than-ideal model choices. Here are some ideas off the top of my head, I don't necessarily agree with all of them, but in my opinion they are potentially useful avenues, or at least fun to think about. Perhaps they have been explored elsewhere, these are just ideas I have used myself or off the top of my head.

 1. A function, or suite of functions for Pareto analysis of selected scores. See for example a {scoringutils}-{rPref} approach [here](https://github.com/jcken95/sub-ensemble-evaluation/blob/8591e534fb237654dfe0a11604143a86ceebd7a5/publication/02-plots-pareto.R#L115). A front and plot of scores could be output. For >2 scores a graphic may be difficult. This has been done in practice by myself, and for the example I demonstrate on, is fast and simple to implement (given a set of scores).
 2. Some kind of statistical model - could a [Plackett-Luce](https://projecteuclid.org/journals/bayesian-analysis/volume-20/issue-3/Modelling-and-Analysis-of-Rank-Ordered-Data-with-Ties-via/10.1214/24-BA1434.full) model help us infer a ranking? No idea if this makes sense but it's cool?
 3. An idea based on electoral systems: consider each score a preference vote for a model. Preferences can be combined to "elect" a best model (or set of good models).  Again, no idea if this makes sense, but interesting intersection of scoring with other fields.
 4. Guiding a user towards a bespoke utility function (score) based on a weighted sum of other scores. For example, combining WIS or RPS for different location levels/spatial granularities as in [Kennedy et al (2026)](https://www.medrxiv.org/content/medrxiv/early/2026/02/18/2026.02.12.26346156.1.full.pdf)


Note a challenge with 4. is combing scores with different scales, for example the units of `WIS + RPS` are completely meaningless.  [Multliattribute value elicitiation](https://link.springer.com/chapter/10.1007/978-3-319-65052-4_12) gives a method for doing this, it is a simple algorithm but imposes a large cognitive burden on the user

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find ways to help users find the most useful models from a large collection of scores #1140

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Find ways to help users find the most useful models from a large collection of scores #1140

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions