Skip to contents

Introduction

Fitting multiple models raises the question: which model best describes the data? bayesma provides two complementary tools — formal predictive model comparison via compare_models() and visual diagnostics via compare_plot() and ecdf_model_plot().

Leave-one-study-out cross-validation

The primary comparison criterion is leave-one-study-out cross-validation (LOSO-CV). For each study ii, the model fitted to the remaining k1k - 1 studies is used to predict the withheld study’s effect. The predictive quality is measured by the continuous ranked probability score (CRPS), averaged over studies.

CRPS is a proper scoring rule that rewards sharpness (narrow predictive distributions) and penalises miscalibration (predictive distributions that miss the true value). For a predictive distribution FiF_i and observed value yiy_i:

CRPS(Fi,yi)=[Fi(x)𝟙(xyi)]2dx \text{CRPS}(F_i, y_i) = \int_{-\infty}^{\infty} \left[F_i(x) - \mathbb{1}(x \geq y_i)\right]^2 dx

Lower CRPS is better. The total LOSO-CRPS is

CRPS¯=1ki=1kCRPS(Fi,yi) \overline{\text{CRPS}} = \frac{1}{k} \sum_{i=1}^{k} \text{CRPS}(F_{-i}, y_i)

Advantages of LOSO-CV for meta-analysis

  • Cross-stage comparability. One-stage and two-stage models are fitted on different likelihoods but both produce predictive distributions on the effect-size scale. LOSO-CV on the effect scale makes them comparable.
  • Diagnostic value. Per-study CRPS scores identify individual studies that a model predicts poorly — often the same studies that drive sensitivity analysis results.
  • No distributional assumption. CRPS does not require the predictive distribution to be Gaussian.

LOO-IC and WAIC

Within-stage model comparison (e.g., comparing RE distributions for the same likelihood) can use LOO information criterion (LOO-IC) or the widely applicable information criterion (WAIC). These approximate leave-one-observation-out CV on the log-score scale.

WAIC=2(ilogp¯(yiθ)iVarθ[logp(yiθ)]) \text{WAIC} = -2\left(\sum_i \log \overline{p}(y_i \mid \theta) - \sum_i \text{Var}_\theta[\log p(y_i \mid \theta)]\right)

LOO-IC is computed via Pareto-smoothed importance sampling and reports Pareto k̂\hat{k} diagnostics for each observation. Studies with k̂>0.7\hat{k} > 0.7 are influential and their LOO contribution may be unreliable; LOSO-CV is preferable for these cases.

Running model comparison

fit_ce  <- bayesma(data, model_type = "common_effect")
fit_re  <- bayesma(data, model_type = "random_effect")
fit_t   <- bayesma(data, model_type = "random_effect", re_dist = "t")

comp <- compare_models(
  "Common-effect" = fit_ce,
  "RE Gaussian"   = fit_re,
  "RE Student-t"  = fit_t
)

compare_table(comp)
compare_plot(comp)

Interpreting compare_table()

compare_table() returns a ranked tibble with columns:

Column Description
model Model label
loso_crps Mean LOSO-CRPS (lower = better)
delta_crps Difference from best model
coverage_50 Empirical 50% PI coverage
coverage_95 Empirical 95% PI coverage
elpd_loo LOO expected log-predictive density

Differences in loso_crps less than 0.05 on the log-OR scale are typically negligible. Prefer simpler models when differences are small.

Interpreting compare_plot()

compare_plot() shows prediction interval coverage at each nominal level (50%, 80%, 90%, 95%). A perfectly calibrated model lies on the diagonal. Models above the diagonal have intervals that are too wide (conservative); models below the diagonal have intervals that are too narrow (anti-conservative).

Per-study diagnostics

The $study_scores element of a compare_models() result contains per-study CRPS and coverage statistics. Large per-study CRPS values flag studies that are systematically mispredicted — a useful diagnostic for identifying influential outliers or subgroups that require separate modelling.

comp$study_scores |>
  dplyr::arrange(dplyr::desc(loso_crps)) |>
  head(5)

When not to compare models

LOSO-CV is not an appropriate tool for selecting between a model with and without a potential moderator — the moderator comparison is a question about explanation, not prediction over a fixed set of studies. Use posterior inclusion probabilities from meta_reg() or robma() for moderator selection.

Similarly, comparing a bias-corrected model to an uncorrected model with CRPS conflates model fit with bias correction — a model that perfectly reproduces biased data will outscore a correctly specified but bias-adjusted model. See Bias & Heterogeneity for appropriate diagnostics.