Calibration Error
Calibration error provides a measure of the degree to which a model's estimated probabilities differ from its accuracy. For QA tasks, this represents how well a model's confidence scores can be trusted.
Examples
A model gives an average confidence of 80% across 400 examples, and gave the correct answer for 320 of those examples. The actual probability over the 400 examples was 80%, so this model is well calibrated.
Another model gives an average confidence of 90%, but is only correct 20% of the time. This model is overconfident.
Finally, a model which always produces the correct answer but has an average confidence of 20% would be said to be underconfident.
Calculation
In practice the calibration error is generally calculated over bins of the evaluation set and aggregated over all bins - either by taking the mean or maximum error. We calculate the Expected Calibration Error over \(N\) bins as:
ECE = \(\sum_{m=1}^M\frac{|B_m|}{n}|\text{acc}(B_m) - \text{conf}(B_m)|^2\)
where \(B_m\) are \(M\) equally spaced bins, \(\text{conf}(B_m)\) is the mean estimated probability over a bin and \(\text{acc}(B_m)\) is the mean accuracy over a bin.