model-criticism

Observable-based model evaluation, Pareto optimization, and Bayesian stacking for scientific model criticism.

Overview

model-criticism provides a structured framework for evaluating scientific simulation models against known ground truth via observable-based scoring, multi-objective Pareto optimization, and Bayesian model stacking.

The core pattern (from MFAI §4–5):

Model world — Generate synthetic ground truth with known latent state
Observe — Apply a realistic observation model (noise, masks, bias)
Score — Evaluate structured observables against known truth using proper scoring rules (not noisy observations)
Pareto — Extract the multi-objective Pareto front across competing objectives (e.g. quality vs. cost)
Stack — Combine models via Bayesian stacking weights or score-based optimization

Studies can be organized into hierarchical phases (discovery → refinement → benchmark), where each phase filters configs for the next.

Architecture

model_criticism/
├── protocols.py    ModelWorld, Scorer, Observable, Annotation, ResultsTable
├── design.py       Factor screening (SALib) and grid construction (pyDOE3)
├── scoring.py      Proper scoring rules (scoringrules) and calibration
├── pareto.py       Non-dominated sorting and indicators (pymoo)
├── stacking.py     Bayesian stacking (arviz) and score-based weights (scipy)
├── runner.py       Grid execution (joblib) and adaptive search (optuna)
├── study.py        Multi-phase orchestration with filtering
└── io.py           Save/load results (npz + JSON)

Observable Hierarchy (MFAI §4)

Tier	Role	Examples
Embedded	Must hold by construction	ELBO monotonicity, conservation laws
Penalized	Soft constraints in objective	Stopping criteria, convergence rate
Diagnostic	Post-hoc evaluation only	Coverage, CRPS, PIT, rank error
Cost	Resource axis for Pareto	Wall time, dollar cost, field teams

Two Execution Modes

Grid mode: Full factorial, LHS, or fractional factorial via pyDOE3 → run all → post-hoc Pareto extraction via pymoo.
Adaptive mode: Multi-objective Bayesian optimization via optuna NSGA-II when the full grid is too expensive.

Stacking: Two Paths

Bayesian (arviz): For models with log-likelihoods. Implements Yao et al 2018 stacking via arviz.compare(method='stacking').
Score-based (scipy): For arbitrary score matrices. Optimizes simplex weights to minimize/maximize composite score.

Quick Example

from model_criticism import Study, Phase, Observable, Tier, Direction
from model_criticism.study import top_k_pareto_filter
from model_criticism.design import build_grid, Factor, FactorType

# Define observables
coverage = Observable("coverage_95", Tier.DIAGNOSTIC, Direction.MAXIMIZE)
rank_err = Observable("rank_error", Tier.EMBEDDED, Direction.MINIMIZE)
wall_time = Observable("wall_seconds", Tier.COST, Direction.MINIMIZE)

# Build design grid
factors = [
    Factor("alpha", FactorType.CONTINUOUS, bounds=(0.01, 0.10)),
    Factor("method", FactorType.CATEGORICAL, levels=["A", "B", "C"]),
]
grid = build_grid(factors, method="lhs", n_samples=500)

# Run hierarchical study
study = Study(
    world=MyModelWorld(),
    scorer=MyScorer(),
    observables=[coverage, rank_err, wall_time],
    phases=[
        Phase("discovery", grid=grid, filter_fn=top_k_pareto_filter(k=20)),
        Phase(
            "benchmark",
            grid="carry",  # top-20 from discovery
            filter_fn=None,
        ),
    ],
)
study.run(n_jobs=-1)
print(study.summary())

Installation

pip install model-criticism[all]

Or install only the extras you need:

pip install model-criticism[scoring,pareto]  # just scoring + Pareto

Extra	Packages
`scoring`	scoringrules
`pareto`	pymoo
`stacking`	arviz, scipy
`design`	pyDOE3, SALib
`adaptive`	optuna
`parallel`	joblib
`all`	All of the above

Consumers

Package	Domain	Use Case
VBPCApy	Bayesian PCA	Convergence sweep: 15-factor grid, coverage/RMSE/CRPS vs. wall time
pp-eigentest	Rank selection	3-stage method selection: type-I/power/exact vs. robustness
TICCS	Surveillance design	relWIS/peak-timing vs. surveillance cost ($)

Related Packages

Package	Description
ModelCriticism.jl	Julia implementation of this same framework

Development

uv sync --extra dev
just ci

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
src/model_criticism		src/model_criticism
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

model-criticism

Overview

Architecture

Observable Hierarchy (MFAI §4)

Two Execution Modes

Stacking: Two Paths

Quick Example

Installation

Consumers

Related Packages

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

model-criticism

Overview

Architecture

Observable Hierarchy (MFAI §4)

Two Execution Modes

Stacking: Two Paths

Quick Example

Installation

Consumers

Related Packages

Development

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages