Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

The out-of-bag (OOB) error estimate is a scalable approach to data valuation. Unlike marginal contribution methods, Data-OOB can leverage existing weak learners from a bagging process, providing a statistically interpretable measure of data value.

The main limitation of most data valuation methods is computational cost. The popular family based on the marginal contribution of samples to subsets of training data typically suffers from an exponential cost (see Data valuation for a quick summary of the different approaches).1 1 In the case of Leave-One-Out, the complexity is instead $\mathcal{O}(n)$ but the contribution of a single sample to the rest of the dataset can be small enough that the signal is lost in the noise, making the method unreliable (although surprisingly performant in some situations).

Despite attempts to reduce variance and improve convergence of Monte Carlo estimates, like stratified sampling [Mal14B, Cas17I, Wu23V], or different approximation strategies, like Lasso (introduced in AME, [Lin22M]), this family of methods struggles to scale to larger datasets. Several forms of antithetic sampling, which draw correlated instead of independent samples for the Monte Carlo estimate, have also been proposed, with mixed results.2 2 The reason is a No-Free-Lunch theorem for antithetics in [Ren19A]: for every fixed sampling strategy, there is a utility function for which variance is not reduced.

These issues motivate looking for alternative definitions of value: Data-OOB [Kwo23D] addresses them with an out-of-bag test-error estimate. When the model of interest for valuation is a bagging estimator, this enables very fast valuation of large datasets by reusing the already trained weak learners (of course one can do the bagging with another model). Even faster than KNN-Shapley [Jia19aE], which exploits the local structure of KNN to provide a closed formula for the Shapley values.

Say one has trained weak learners $f_1, …, f_B$ and wants to compute a value for a training sample $z_i = (x_i, y_i)$. A subset $ \{f_{j}: j \in I_i\}$ of the learners won’t have seen this sample in their bootstrap datasets. Data-OOB defines the value to be

$$v(z_i) := \frac{1}{|I_i|} \sum_{j \in I_i} s(f_j(x_i), y_i),$$

for some scoring function $s$, like 1-0 or negative MSE. Note that the classical OOB error estimate is the mean of the $v(z_i)$. Data-OOB is also statistically interpretable; under certain assumptions, it identifies important data points consistently with the infinitesimal jackknife influence function, see Proposition 3.1 of [Kwo23D].3 3 Like the jackknife, its infinitesimal variant is used to reduce bias and estimate variance, but does so using a Taylor expansion to approximate the effect of removing samples. The (empirical) influence function is essentially the first order term in this expansion: a directional derivative of the statistic in the direction of a sample of interest. This is however a more fundamental concept than the influence function as popularised in the ML literature by [Koh17U], where the chain rule is used to compute influence over test points.

Figure 1. [Kwo23D], fig. 4. Test accuracy curves as a function of the percentage of low-value data removed. Datasets with $n=1000$ (top) and $n = 10000$ (bottom) samples. Higher curves indicate better performance for data valuation. The error bar indicates a 95% confidence interval based on 50 independent experiments.

In their experiments, the authors show Data-OOB to vastly outperform AME and KNN-Shap in speed, and KNN-Shapley, Data Shapley and Beta Shapley in mislabeled data detection. Results are not so dramatic for the data removal task, where points of lowest value are successively removed from the dataset and test performance is measured each time, but the method is at least on par with the rest while being much faster (for a pre-trained bagged model).

Figure 2. [Kwo23D], fig. 8. Test accuracy curves as a function of the percentage of low-value data removed. Datasets with $n=1000$ samples.

The method performs best for the datasets depicted in Figure 1. For the remaining 8 datasets, seen in Figure 2 and Figure 3, the comparison is not so clear-cut, and at times a bit difficult to interpret. Still these are impressive results that warrant thorough reproduction! You can try the authors’ code (link in the references) or use TransferLab’s pyDVL: the python Data Valuation Library, where you will also find many other methods and examples. For Data-OOB you can use any model for the bagging.

Figure 3. [Kwo23D], fig. 9. Test accuracy curves as a function of the percentage of low-value data removed. Datasets with $n=10000$ samples.

References

[Cas17I]

Improving polynomial estimation of the Shapley value by stratified random sampling with optimum allocation, Javier Castro, Daniel Gómez, Elisenda Molina, Juan Tejada.

Jun 2017

In this paper, we propose a refinement of the polynomial method based on sampling theory proposed by Castro et al. (2009) to estimate the Shapley value for cooperative games. In addition to analyzing the variance of the previously proposed estimation method, we employ stratified random sampling with optimum allocation in order to reduce the variance. We examine some desirable statistical features …

[Jia19aE]

Efficient task-specific data valuation for nearest neighbor algorithms, Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song.

Jul 2019

Given a data set D containing millions of data points and a data consumer who is willing to pay \$X to train a machine learning (ML) model over D, how should we distribute this \$X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, …

[Koh17U]

Understanding Black-box Predictions via Influence Functions, Pang Wei Koh, Percy Liang.

Jul 2017

How can we explain the predictions of a black-box model? In this paper, we use influence functions — a classic technique from robust statistics — to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a …

[Kwo23D]

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value, Yongchan Kwon, James Zou.

Jul 2023

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as …

[Lin22M]

Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments, Jinkun Lin, Anqi Zhang, Mathias Lécuyer, Jinyang Li, Aurojit Panda, Siddhartha Sen.

Jun 2022

We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the …

[Mal14B]

Bounding the Estimation Error of Sampling-based Shapley Value Approximation, Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, Alex Rogers.

Feb 2014

The Shapley value is arguably the most central normative solution concept in cooperative game theory. It specifies a unique way in which the reward from cooperation can be "fairly" divided among players. While it has a wide range of real world applications, its use is in many cases hampered by the hardness of its computation. A number of researchers have tackled this problem by (i) focusing on …

[Ren19A]

Adaptive Antithetic Sampling for Variance Reduction, Hongyu Ren, Shengjia Zhao, Stefano Ermon.

May 2019

Variance reduction is crucial in stochastic estimation and optimization problems. Antithetic sampling reduces the variance of a Monte Carlo estimator by drawing correlated, rather than independent, samples. However, designing an effective correlation structure is challenging and application specific, thus limiting the practical applicability of these methods. In this paper, we propose a …

[Wu23V]

Variance reduced Shapley value estimation for trustworthy data valuation, Mengmeng Wu, Ruoxi Jia, Changle Lin, Wei Huang, Xiangyu Chang.

Nov 2023

Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, …

References

In this series →