Reference

Measuring Calibration in Deep Learning, Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, Dustin Tran. arXiv:1904.01685 [cs, stat](2020)

Abstract

The reliability of a machine learning model’s confidence in its predictions is critical for high-risk applications. Calibration—the idea that a model’s predicted probabilities of outcomes reflect true probabilities of those outcomes—formalizes this notion. Current calibration metrics fail to consider all of the predictions made by machine learning models, and are inefficient in their estimation of the calibration error. We design the Adaptive Calibration Error (ACE) metric to resolve these pathologies and show that it outperforms other metrics, especially in settings where predictions beyond the maximum prediction that is chosen as the output class matter.