Bayesian hierarchical pooling¶

What it is¶

Across-model and across-region calibration is partially pooled in a hierarchical Bayesian model fit in PyMC. New models inherit informative priors from related families (Mistral 7B → Mistral Medium 3 → Mixtral 8×22B), narrowing intervals while remaining honest about novelty.

Why hierarchical pooling¶

Without pooling, every new model would start with maximally wide priors derived from generic family heuristics. With full pooling (treating all models as identical), the model-specific signal is lost. Partial pooling — the hierarchical Bayesian middle ground — is the right compromise: the new model's posterior is informed by the family's data, with weight that decreases as the new model accumulates its own observations.

Model structure¶

For tier-2 calibration residuals (the difference between tier-2 estimate and tier-3 measurement):

σ_family ~ HalfNormal(0.5)              # family-level dispersion
μ_family ~ Normal(0, σ_family)          # family bias prior

σ_model | family ~ HalfNormal(σ_family)         # model-level dispersion
μ_model | family ~ Normal(μ_family, σ_family)   # model bias

residual ~ Normal(μ_model, σ_model)             # observed residuals

This is fit quarterly, jointly across all models and regions, using PyMC's NUTS sampler with 2,000 tuning + 4,000 draws across 4 chains.

What this gives us¶

A new model in the Mistral family inherits the family-level bias and dispersion estimates with shrinkage. The first hundred tier-3 measurements are weighted reasonably; the priors are not flat.
Cross-region calibration is similarly pooled — Scaleway PAR-1 inherits from a "Western European EU-grid datacentre" prior; atNorth STO-1 inherits from a "Nordic low-carbon datacentre" prior.
Posterior predictive intervals are propagated to tier 2 as the prior dispersion before conformal calibration.

Inference cadence¶

The hierarchical model is re-fit quarterly. Between fits, new models pick up the most recent posterior as their prior. Re-fits are tagged in the methodology changelog as patch releases.

Limitations¶

The model assumes Gaussian residuals (after log-normal Monte Carlo at tier 2). For models with heavy-tailed residuals, this would underestimate tail risk; in practice, tier-2 residuals are well-approximated as Gaussian after the log-normal scaling, but we monitor this.
The hierarchy is two-level (family → model). A three-level hierarchy (family → architecture → model) would be more granular but does not currently improve the calibration on our calibration set; we revisit annually.

Where this is implemented¶

methodology/uncertainty/bayesian_hierarchical.py

Citations¶

Gelman, A., et al. (2013). Bayesian Data Analysis, 3^rd ed., chapter 5 on hierarchical models.
Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55.