Local calibration: metrics and recalibration | TransferLab

To compute calibration of a classifier, one typically only looks at the confidences output by it, independently of the properties of the samples considered: calibration is, so to speak, “global in feature space” (for an introduction to calibration consider reading our review of the topic or attending our training). However, when one is interested in fair outcomes, calibration matters within groups of interest like ethnicity or gender: one wants to avoid being consistently underconfident in one subgroup (e.g. rare disease) and overconfident in another (flu). Under some conditions, one can expect this to be avoided automatically (basically having very low prediction error, as investigated in [Liu19I]), and it is also true that group calibration is in itself insufficient for fairness (one can simply predict class frequencies in subgroups and have a calibrated but terrible predictor).

Nevertheless, in practice it often makes sense to consider calibration conditioned to subgroups (e.g. when the model is not able to auto-calibrate because it’s not accurate enough). One wishes to miminise Expected Calibration Error (ECE), which is the average discrepancy between confidence and accuracy, within subsets of the population. [Heb18M] put forth an algorithm for so-called multi-calibration, or calibration wrt. all possible groups in a given concept class, not only those explicitly encoded as categorical features, but alas, computational cost is high.

In order to achieve calibration for intrinsically defined groups, as opposed to groups explicitly defined by some categorical input feature, the authors of [Luo22L] (presented this August at UAI2022), propose to use soft clusters and a modified calibration error. First, they define a similarity measure between samples using a Laplacian kernel over learned features. The features are intermediate layers in one of a number of networks in their experiments, the rationale being that Euclidian distances are semantically meaningful in these feature spaces. Second, ECE is extended to include a weighting by this kernel.

The resulting definition of Local Calibration Error is stricter than the conventional Maximum Calibration Error (MCE, itself stricter than ECE): optimising for Maximum Local Calibration Error (MLCE) leads to significant improvements in group-wise MCE, despite the fact that the method proposed for this optimisation (LoRe, Local Recalibration) is unaware of the actual groups.

LoRe is a generalized form of histogram binning where samples in the same confidence bin are weighted by the kernel: predicted confidences are corrected to match a weighted average over similar samples. This yields great results in group-wise calibration at the cost of some global calibration error.

Table 1. Performance on downstream fairness, as measured by maximum group-wise MCE (lower is better). Experimental settings as described in Section 5.3. Mean and standard deviations are computed over 60 random seeds for settings 1 and 4, and 20 for settings 2 and 3. Best results are bold.

It is interesting to consider what the bandwidth of the kernel means. As it grows towards infinity, MLCE converges to MCE. But as it decreases towards 0, and the effective neighborhood of a point is reduced to itself, MLCE is just the discrepancy between the confidence and “accuracy” for a single point.

Figure 4. MLCE vs. kernel bandwith $\gamma$ for ImageNet. LoRe (with t-SNE and $\gamma = 0.2$) achieves the lowest MLCE for a wide range of $\gamma$. This suggests that LoRe leads to lower LCE values across the whole dataset

Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability on average across the entire population. However, it is in general impossible …

We clarify what fairness guarantees we can and cannot expect to follow from unconstrained machine learning. Specifically, we show that in many settings, unconstrained learning on its own implies group calibration, that is, the outcome variable is conditionally independent of group membership given the score. A lower bound confirms the optimality of our upper bound. Moreover, we prove that as the …

We develop and study multicalibration as a new measure of fairness in machine learning that aims to mitigate inadvertent or malicious discrimination that is introduced at training time (even from ground truth data). Multicalibration guarantees meaningful (calibrated) predictions for every subpopulation that can be identified within a specified class of computations. The specified class can be …

References

In this series →