Skip to content

Monte Carlo log-normal propagation

What it is

Uncertainty propagation through the calculation graph using Monte Carlo simulation over log-normal-distributed inputs. Following Lloyd & Ries (2007), each emission factor is treated as a log-normal random variable; the calculation is run 10,000 times with samples from each input distribution; the empirical 5th and 95th percentiles of the output distribution define the 90% interval before conformal calibration.

Why log-normal

Empirical work in life-cycle assessment (Lloyd & Ries 2007; Frischknecht et al. 2007) finds that log-normal distributions fit observed emission-factor variability better than normal distributions. The intuition: emission factors cannot be negative, and their variability typically scales with magnitude. Log-normal captures both.

Per-input distribution construction

For each emission factor:

  1. Median is the central estimate from the source database (ecoinvent, ADEME)
  2. Geometric standard deviation (GSD) is derived from the pedigree score using Ciroth's lookup
  3. Log-normal parameters are computed: μ = ln(median), σ = ln(GSD)
import numpy as np

def lognormal_sample(median, gsd, n=10000):
    mu = np.log(median)
    sigma = np.log(gsd)
    return np.random.lognormal(mean=mu, sigma=sigma, size=n)

Per-query simulation

For a tier-2 receipt, the simulation runs:

co2e_samples = (
    parametric_gpu_energy(model, hardware, batch)  # log-normal
    * pue                                            # log-normal
    + host_share                                     # log-normal
    + idle_share                                     # log-normal
    + embodied_amortisation                          # log-normal
) * grid_intensity                                   # log-normal

Each operand is sampled 10,000 times; the resulting 10,000 co2e_samples are the empirical posterior distribution. The 5th, 50th, and 95th percentiles become the lower CI, median, and upper CI respectively.

Caching

Sampling 10,000 trials per request would be wasteful when most of the inputs are stable across requests. We cache the input distributions per (region, model, hour) tuple; the request-specific token-count adjustment is applied as a deterministic scaling. Effective wall-time per request: < 5 ms for the Monte Carlo step.

The cache is invalidated:

  • When the live grid intensity changes by more than 5%
  • When the methodology version increments
  • Hourly, regardless

Limitations

Three honest limitations:

  1. Independence assumption. We sample each emission factor independently. In reality, hardware energy and PUE are correlated (a hot day raises both). We are tracking this for a future revision; see Sobol for which correlations matter most.

  2. Log-normal mis-fit at tails. Some emission factors have heavier tails than log-normal captures. The conformal interval is the backstop: even if the parametric tail is wrong, the conformal calibration adjusts the reported width to cover.

  3. Tail truncation. Log-normal in principle has support on (0, ∞). For the 90% interval this rarely matters, but for very high pedigree scores the upper tail is unreasonable. We truncate at the 99.5th percentile of the underlying distribution.

Where this is implemented

methodology/uncertainty/monte_carlo.py

Citations

  • Lloyd, S. M., & Ries, R. (2007). Characterizing, Propagating, and Analyzing Uncertainty in Life-Cycle Assessment. J. Industrial Ecology 11(1).
  • Frischknecht, R. et al. (2007). Overview and Methodology — Data v2.0. ecoinvent report no. 1.