Observable-based model evaluation, Pareto optimization, and Bayesian stacking for scientific model criticism.
model-criticism provides a structured framework for evaluating scientific
simulation models against known ground truth via observable-based scoring,
multi-objective Pareto optimization, and Bayesian model stacking.
The core pattern (from MFAI §4–5):
- Model world — Generate synthetic ground truth with known latent state
- Observe — Apply a realistic observation model (noise, masks, bias)
- Score — Evaluate structured observables against known truth using proper scoring rules (not noisy observations)
- Pareto — Extract the multi-objective Pareto front across competing objectives (e.g. quality vs. cost)
- Stack — Combine models via Bayesian stacking weights or score-based optimization
Studies can be organized into hierarchical phases (discovery → refinement → benchmark), where each phase filters configs for the next.
model_criticism/
├── protocols.py ModelWorld, Scorer, Observable, Annotation, ResultsTable
├── design.py Factor screening (SALib) and grid construction (pyDOE3)
├── scoring.py Proper scoring rules (scoringrules) and calibration
├── pareto.py Non-dominated sorting and indicators (pymoo)
├── stacking.py Bayesian stacking (arviz) and score-based weights (scipy)
├── runner.py Grid execution (joblib) and adaptive search (optuna)
├── study.py Multi-phase orchestration with filtering
└── io.py Save/load results (npz + JSON)
| Tier | Role | Examples |
|---|---|---|
| Embedded | Must hold by construction | ELBO monotonicity, conservation laws |
| Penalized | Soft constraints in objective | Stopping criteria, convergence rate |
| Diagnostic | Post-hoc evaluation only | Coverage, CRPS, PIT, rank error |
| Cost | Resource axis for Pareto | Wall time, dollar cost, field teams |
- Grid mode: Full factorial, LHS, or fractional factorial via pyDOE3 → run all → post-hoc Pareto extraction via pymoo.
- Adaptive mode: Multi-objective Bayesian optimization via optuna NSGA-II when the full grid is too expensive.
- Bayesian (arviz): For models with log-likelihoods. Implements Yao et al
2018 stacking via
arviz.compare(method='stacking'). - Score-based (scipy): For arbitrary score matrices. Optimizes simplex weights to minimize/maximize composite score.
from model_criticism import Study, Phase, Observable, Tier, Direction
from model_criticism.study import top_k_pareto_filter
from model_criticism.design import build_grid, Factor, FactorType
# Define observables
coverage = Observable("coverage_95", Tier.DIAGNOSTIC, Direction.MAXIMIZE)
rank_err = Observable("rank_error", Tier.EMBEDDED, Direction.MINIMIZE)
wall_time = Observable("wall_seconds", Tier.COST, Direction.MINIMIZE)
# Build design grid
factors = [
Factor("alpha", FactorType.CONTINUOUS, bounds=(0.01, 0.10)),
Factor("method", FactorType.CATEGORICAL, levels=["A", "B", "C"]),
]
grid = build_grid(factors, method="lhs", n_samples=500)
# Run hierarchical study
study = Study(
world=MyModelWorld(),
scorer=MyScorer(),
observables=[coverage, rank_err, wall_time],
phases=[
Phase("discovery", grid=grid, filter_fn=top_k_pareto_filter(k=20)),
Phase(
"benchmark",
grid="carry", # top-20 from discovery
filter_fn=None,
),
],
)
study.run(n_jobs=-1)
print(study.summary())pip install model-criticism[all]Or install only the extras you need:
pip install model-criticism[scoring,pareto] # just scoring + Pareto| Extra | Packages |
|---|---|
scoring |
scoringrules |
pareto |
pymoo |
stacking |
arviz, scipy |
design |
pyDOE3, SALib |
adaptive |
optuna |
parallel |
joblib |
all |
All of the above |
| Package | Domain | Use Case |
|---|---|---|
| VBPCApy | Bayesian PCA | Convergence sweep: 15-factor grid, coverage/RMSE/CRPS vs. wall time |
| pp-eigentest | Rank selection | 3-stage method selection: type-I/power/exact vs. robustness |
| TICCS | Surveillance design | relWIS/peak-timing vs. surveillance cost ($) |
| Package | Description |
|---|---|
| ModelCriticism.jl | Julia implementation of this same framework |
uv sync --extra dev
just ciMIT