One problem with {scoringutils} as a user (which is more inherent to scores, rather than the software that is {scoringutils}), is that outputting a large number of scores, across different forecasting units (time horizons, locations, location levels, ...) is there there is an incredibly large amount of data for a human to process in order to make a choice about model selection. Of course, the "best" model (or best ensemble when combining different collections of models) depends on which scoring metric is used, and perhaps other external factors. These external factors could be the quality of the prediction, but also the maintainability of a model or its runtime. These may influence if a model is worth pursuing. An easy to maintain, but non-optimal scoring model may be preferred in a fast-paced operational environment over a theoretically optimal model which is hard to implement or has lots of difficult to validate assumptions baked into it.
I am in essence asking for {scoringutils} to consider including methods that help users make pragmatic choices about models and ensembles.
{scoringutils} could be expanded to help users cut out less-than-ideal model choices. Here are some ideas off the top of my head, I don't necessarily agree with all of them, but in my opinion they are potentially useful avenues, or at least fun to think about. Perhaps they have been explored elsewhere, these are just ideas I have used myself or off the top of my head.
- A function, or suite of functions for Pareto analysis of selected scores. See for example a {scoringutils}-{rPref} approach here. A front and plot of scores could be output. For >2 scores a graphic may be difficult. This has been done in practice by myself, and for the example I demonstrate on, is fast and simple to implement (given a set of scores).
- Some kind of statistical model - could a Plackett-Luce model help us infer a ranking? No idea if this makes sense but it's cool?
- An idea based on electoral systems: consider each score a preference vote for a model. Preferences can be combined to "elect" a best model (or set of good models). Again, no idea if this makes sense, but interesting intersection of scoring with other fields.
- Guiding a user towards a bespoke utility function (score) based on a weighted sum of other scores. For example, combining WIS or RPS for different location levels/spatial granularities as in Kennedy et al (2026)
Note a challenge with 4. is combing scores with different scales, for example the units of WIS + RPS are completely meaningless. Multliattribute value elicitiation gives a method for doing this, it is a simple algorithm but imposes a large cognitive burden on the user
One problem with {scoringutils} as a user (which is more inherent to scores, rather than the software that is {scoringutils}), is that outputting a large number of scores, across different forecasting units (time horizons, locations, location levels, ...) is there there is an incredibly large amount of data for a human to process in order to make a choice about model selection. Of course, the "best" model (or best ensemble when combining different collections of models) depends on which scoring metric is used, and perhaps other external factors. These external factors could be the quality of the prediction, but also the maintainability of a model or its runtime. These may influence if a model is worth pursuing. An easy to maintain, but non-optimal scoring model may be preferred in a fast-paced operational environment over a theoretically optimal model which is hard to implement or has lots of difficult to validate assumptions baked into it.
I am in essence asking for {scoringutils} to consider including methods that help users make pragmatic choices about models and ensembles.
{scoringutils} could be expanded to help users cut out less-than-ideal model choices. Here are some ideas off the top of my head, I don't necessarily agree with all of them, but in my opinion they are potentially useful avenues, or at least fun to think about. Perhaps they have been explored elsewhere, these are just ideas I have used myself or off the top of my head.
Note a challenge with 4. is combing scores with different scales, for example the units of
WIS + RPSare completely meaningless. Multliattribute value elicitiation gives a method for doing this, it is a simple algorithm but imposes a large cognitive burden on the user