Data Banzhaf: A Robust Data Valuation Framework for Machine Learning

A definition of data value with much higher rank stability than Shapley values, proven to be most robust among all semi-values, mitigates the pressing issue of utility variance in ML applications.

In our recent pill Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning [Kwo22B], we discussed how relaxing the Shapley axioms and working with semi-values (i.e. allowing for certain choices of weights for the marginal utilities) makes the resulting values more useful in ML applications.

One such semi-value is the Banzhaf index, which uses a constant weight across all subsets:

$$ v_{\text{Bzf}}(x_i) = \frac{1}{2^{n-1}} \sum_{S \subseteq D \setminus \{x_i\}} [u(S \cup \{x_i\}) − u(S)] , $$

This turns out to be (very) helpful in mitigating the bane of Shapley values for supervised ML, namely: noise. For stochastic utilities, e.g. models trained with a stochastic method like SGD, there will be sets $S$ such that $u(S)$ has high variance, e.g. sets of tiny size, although in general there will be some variance for all sets.

If the coefficients weighting the marginal utility happen to be highest for those sets (as is the case for Shapley), the related values can be rather uninformative. Mitigation strategies like taking averages over multiple evaluations of $u(S)$, or generalized weights like those we saw in Beta Shapley, improve the situation but either increase the computational burden considerably or are ineffective in some contexts.

[Wan23D] addresses exactly the issue of noise in the estimate of the utility. The authors begin by introducing a definition of sensitivity to variance in the utility, in a worst-case sense: semi-values are measured according to what’s the largest noise in the utility they can tolerate before flipping the rank of samples when sorted according to their values.

Then they show that Banzhaf performs best wrt. this definition, i.e. it ensures maximal rank stability, but also the best Lipschitz constant among all semi-values, both in $\ell_2$ and $\ell_\infty$. This results in an impressive improvement in Spearman correlation wrt. vanilla Shapley across runs. The intuition for this is simple: for any given choice of weights for a semi-value, one can construct a utility with high variance where the weights are highest. In such a worst-case scenario, the best one can do is to take a constant weight.

Box-plot of the estimates of (a) LOO, (b) Shapley Value, and (c) Banzhaf value of 20 randomly selected CIFAR10 images, with 5 mislabeled images. The 5 mislabeled images are shown in red and clean images are shown in green. The variance is only due to the stochasticity in utility evaluation. ‘SP’ means the average Spearman index across different runs.

The authors also tweak the standard Monte Carlo approximation to shave off a factor $n$ from the sample complexity (although this can be also achieved via caching).

If you are interested in using data valuation algorithms, as of version 0.7.0, the TransferLab’s library pyDVL provides an implementation of Data Banzhaf, as well as most game-theoretic methods for data valuation, all with exact and different Monte Carlo approximations, parallelisation and distributed caching. To try it out, install with pip install pydvl and read the documentation.

References

[Wan23D]

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, Jiachen T. Wang, Ruoxi Jia.

Apr 2023

Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the …

[Kwo22B]

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning, Yongchan Kwon, James Zou.

Jan 2022

Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, …

[Gho19D]

Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou.

May 2019

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, …

[Jia19aE]

Efficient task-specific data valuation for nearest neighbor algorithms, Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song.

Jul 2019

Given a data set D containing millions of data points and a data consumer who is willing to pay \$X to train a machine learning (ML) model over D, how should we distribute this \$X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, …

[Okh21M]

A Multilinear Sampling Algorithm to Estimate Shapley Values, Ramin Okhrati, Aldo Lipani.

Jan 2021

Shapley values are great analytical tools in game theory to measure the importance of a player in a game. Due to their axiomatic and desirable properties such as efficiency, they have become popular for feature importance analysis in data science and machine learning. However, the time complexity to compute Shapley values based on the original formula is exponential, and as the number of features …

[Wan22I]

Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning, Tianhao Wang, Yu Yang, Ruoxi Jia.

Apr 2022

The Shapley value (SV) and Least core (LC) are classic methods in cooperative game theory for cost/profit sharing problems. Both methods have recently been proposed as a principled solution for data valuation tasks, i.e., quantifying the contribution of individual datum in machine learning. However, both SV and LC suffer computational challenges due to the need for retraining models on …

[Yan21I]

If You Like Shapley Then You’ll Love the Core, Tom Yan, Ariel D. Procaccia.

May 2021

The prevalent approach to problems of credit assignment in machine learning — such as feature and data valuation— is to model the problem at hand as a cooperative game and apply the Shapley value. But cooperative game theory offers a rich menu of alternative solution concepts, which famously includes the core and its variants. Our goal is to challenge the machine learning community’s current …

References

In this series →