Skip to content

Calibration Error

Calibration error provides a measure of the degree to which a model's estimated probabilities differ from its accuracy. For QA tasks, this represents how well a model's confidence scores can be trusted.

Examples

A model gives an average confidence of 80% across 400 examples, and gave the correct answer for 320 of those examples. The actual probability over the 400 examples was 80%, so this model is well calibrated.

Another model gives an average confidence of 90%, but is only correct 20% of the time. This model is overconfident.

Finally, a model which always produces the correct answer but has an average confidence of 20% would be said to be underconfident.

Calculation

In practice the calibration error is generally calculated over bins of the evaluation set and aggregated over all bins - either by taking the mean or maximum error. We calculate the Expected Calibration Error over \(N\) bins as:

ECE = \(\sum_{m=1}^M\frac{|B_m|}{n}|\text{acc}(B_m) - \text{conf}(B_m)|^2\)

where \(B_m\) are \(M\) equally spaced bins, \(\text{conf}(B_m)\) is the mean estimated probability over a bin and \(\text{acc}(B_m)\) is the mean accuracy over a bin.

References

https://en.wikipedia.org/wiki/Calibration_(statistics)