Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

The out-of-bag (OOB) error estimate is a scalable approach to data valuation. Unlike marginal contribution methods, Data-OOB can leverage existing weak learners from a bagging process, providing a statistically interpretable measure of data value.

The main limitation of most data valuation methods is computational cost. The popular family based on the marginal contribution of samples to subsets of training data typically suffers from an exponential cost (see Data valuation for a quick summary of the different approaches).1 1 In the case of Leave-One-Out, the complexity is instead $\mathcal{O}(n)$ but the contribution of a single sample to the rest of the dataset can be small enough that the signal is lost in the noise, making the method unreliable (although surprisingly performant in some situations).

Despite attempts to reduce variance and improve convergence of Monte Carlo estimates, like stratified sampling [Mal14B, Cas17I, Wu23V], or different approximation strategies, like Lasso (introduced in AME, [Lin22M]), this family of methods struggles to scale to larger datasets. Several forms of antithetic sampling, which draw correlated instead of independent samples for the Monte Carlo estimate, have also been proposed, with mixed results.2 2 The reason is a No-Free-Lunch theorem for antithetics in [Ren19A]: for every fixed sampling strategy, there is a utility function for which variance is not reduced.

These issues motivate looking for alternative definitions of value: Data-OOB [Kwo23D] addresses them with an out-of-bag test-error estimate. When the model of interest for valuation is a bagging estimator, this enables very fast valuation of large datasets by reusing the already trained weak learners (of course one can do the bagging with another model). Even faster than KNN-Shapley [Jia19aE], which exploits the local structure of KNN to provide a closed formula for the Shapley values.

Say one has trained weak learners $f_1, …, f_B$ and wants to compute a value for a training sample $z_i = (x_i, y_i)$. A subset $ \{f_{j}: j \in I_i\}$ of the learners won’t have seen this sample in their bootstrap datasets. Data-OOB defines the value to be

$$v(z_i) := \frac{1}{|I_i|} \sum_{j \in I_i} s(f_j(x_i), y_i),$$

for some scoring function $s$, like 1-0 or negative MSE. Note that the classical OOB error estimate is the mean of the $v(z_i)$. Data-OOB is also statistically interpretable; under certain assumptions, it identifies important data points consistently with the infinitesimal jackknife influence function, see Proposition 3.1 of [Kwo23D].3 3 Like the jackknife, its infinitesimal variant is used to reduce bias and estimate variance, but does so using a Taylor expansion to approximate the effect of removing samples. The (empirical) influence function is essentially the first order term in this expansion: a directional derivative of the statistic in the direction of a sample of interest. This is however a more fundamental concept than the influence function as popularised in the ML literature by [Koh17U], where the chain rule is used to compute influence over test points.

Figure 1. [Kwo23D], fig. 4. Test accuracy curves as a function of the percentage of low-value data removed. Datasets with $n=1000$ (top) and $n = 10000$ (bottom) samples. Higher curves indicate better performance for data valuation. The error bar indicates a 95% confidence interval based on 50 independent experiments.

In their experiments, the authors show Data-OOB to vastly outperform AME and KNN-Shap in speed, and KNN-Shapley, Data Shapley and Beta Shapley in mislabeled data detection. Results are not so dramatic for the data removal task, where points of lowest value are successively removed from the dataset and test performance is measured each time, but the method is at least on par with the rest while being much faster (for a pre-trained bagged model).

Figure 2. [Kwo23D], fig. 8. Test accuracy curves as a function of the percentage of low-value data removed. Datasets with $n=1000$ samples.

The method performs best for the datasets depicted in Figure 1. For the remaining 8 datasets, seen in Figure 2 and Figure 3, the comparison is not so clear-cut, and at times a bit difficult to interpret. Still these are impressive results that warrant thorough reproduction! You can try the authors’ code (link in the references) or use TransferLab’s pyDVL: the python Data Valuation Library, where you will also find many other methods and examples. For Data-OOB you can use any model for the bagging.

Figure 3. [Kwo23D], fig. 9. Test accuracy curves as a function of the percentage of low-value data removed. Datasets with $n=10000$ samples.

References

In this series