Skip to content

Shapley attribution

What it is

For multi-tenant batched inference where the per-request energy cost depends on which other requests share the GPU at execution time, we use Shapley-based attribution to allocate the batch's total energy fairly across concurrent requests. Following Han et al. (ISCA 2025), the Shapley value of a request is its average marginal contribution to the batch's energy across all permutations of the batch.

Why this matters

Naive allocation strategies are biased:

  • Per-token allocation (divide batch energy by total tokens) under-charges concurrency-tolerant requests and over-charges concurrency-sensitive ones
  • Per-request allocation (divide by request count) ignores prompt length entirely
  • Provisioned-share allocation (each request pays its share of the provisioned GPU capacity, regardless of what else is in the batch) is biased toward concurrency-tolerant workloads

Shapley attribution is the unique allocation rule satisfying efficiency (sum equals total), symmetry (identical requests get the same charge), dummy (a request that contributes nothing pays nothing), and additivity. It is the right answer; the cost is computational.

Approximation

Exact Shapley computation is exponential in batch size, intractable for production. We use truncated Shapley sampling (Han et al., 2025): randomly sample K permutations, compute the marginal contribution of each request in each permutation, and take the average.

For our workloads, K=200 gives Shapley estimates within 2% of exact at batch size 32, with cost roughly 1ms per request (computed offline on receipt aggregation, not on the critical inference path).

When Shapley is applied

  • Tier 2 receipts: Shapley attribution is applied at receipt aggregation time. The per-receipt energy estimate uses the batch composition at execution; the share of batch energy allocated to a request is its truncated Shapley value.
  • Tier 3 receipts: with telemetry available, we have the actual measured GPU energy for the batch. Shapley distributes that measured total across the requests.

What the receipt shows

The Shapley share is implicit in the per-receipt energy and gCO₂e numbers. For transparency, the receipt's lineage_url JSON-LD blob includes the full attribution detail:

{
  "batch_id": "batch-vi-9f3c2e",
  "batch_size": 28,
  "batch_total_wh": 8.4,
  "this_request_share_wh": 0.231,
  "attribution_method": "shapley_truncated_K200",
  "shapley_sample_count": 200
}

Limitations

  • Within-batch correlations. If two requests have correlated computational profiles (similar lengths, similar models), the Shapley computation is well-behaved. Pathological cases (one extremely long request mixed with many short ones) can produce noisy K=200 estimates; we increase K adaptively based on the variance of the marginal-contribution distribution.
  • Cross-batch interactions are not attributed. A request that triggered a model load (cold-start cost) still pays only its Shapley share of the batch it ran in; the cold-start amortisation is allocated separately to the first N requests after a load event.

Where this is implemented

methodology/attribution/shapley.py

Citations

  • Han, X., et al. (2025). Fair Energy Attribution for Multi-Tenant LLM Inference. Proc. ISCA 2025.
  • Shapley, L. S. (1953). A Value for n-Person Games. Contributions to the Theory of Games II.
  • Štrumbelj, E., & Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41.