Variance Reduced Shapley Value Estimation for Trustworthy Data Valuation

VRDS, an effective method to reduce the variance of Data Shapley, based on a stratified sampling technique with precomputed coefficients.

In two of our previous paper pills about Data Valuation methods, namely Beta-Shapley and Data Banzhaf, we discussed how to make the computed values more robust to noise in the utility evaluation.

Whereas [Kwo22B] and [Wan23D] address the noise in stochastic utilities, in the paper we study here, [Wu23V], the authors improve the quality of the values by tackling a different problem. They reduce the variance of the Monte Carlo estimate using stratified sampling, similarly to previous work in [Mal14B] and [Cas17I]. They call their method Variance Reduce Data Shapley (VRDS).

Starting with the combinatorial formulation of Shapley values, one can divide the coalitions into strata according to their size when calculating the marginal contributions:

$$\begin{align} v(x_i) &= \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}, |S| = k} \binom{n-1}{k}^{-1} \left[u(S \cup \{x_i\}) − u(S)\right]\\ &= \frac{1}{n} \sum_{k=0}^{n-1} \binom{n-1}{k}^{-1} \sum_{S \subseteq D \setminus \{x_i\}, |S| = k} \left[u(S \cup \{x_i\}) − u(S)\right]. \end{align}$$

Figure 1: Example of stratified sampling for the set ${1, 2, 3, 4}.$

Grouping by size the coalitions which do not contain the data point $i,$ we have $n$ strata $S_0, S_1, \dots , S_{n−1}$ such that $S_k = \{S \subseteq N \ {i}, |S| = k\}$ contains all the coalitions with size $k.$

The marginal contribution of data point $i$ within stratum $S_k$ is:

$$ v^k(x_i) = \binom{n-1}{k}^{-1} \sum_{S \subseteq D \setminus \{x_i\}, |S| = k} \left[u(S \cup \{x_i\}) − u(S)\right], $$

and with it, the Shapley value of point $i$ can be written as:

$$ v(x_i) = \frac{1}{n} \sum_{k=0}^{n-1} v^k(x_i). $$

A standard Monte Carlo approximation (to reduce the number of evaluations of $u$) is then:

$$ \hat{v}(x_i) = \frac{1}{n} \sum_{k=0}^{n-1} \hat{v}^k(x_i) = \frac{1}{n} \sum_{k=0}^{n-1} \frac{1}{m_k} \sum_{j=1}^{m_k} \left[u(S^k_j \cup \{x_i\}) − u(S^k_j)\right], $$

where $S^k_j$ is the $j$th sample coalition in stratum $k$ and $m_k$ is the number of samples in the stratum. Grouping coalitions by size into strata allows us to optimise the number of samples taken from each stratum.

In Theorem 4.1 of [Wu23V], the authors show that the choice of $m_k$ which optimally reduces variance for a given number of samples $m$ is proportional to $\sigma_{i,k},$ the standard deviation of $\hat{v}^k(x_i)$:

$$ \tilde{m}_k = m\frac{\sigma_{i,k}}{\sum_{j=0}^{n−1} \sigma_{i,j}}, \quad k = 0, \dots , n − 1. $$

Because the variance of the estimators $\hat{v}^k(x_i)$ is unknown, the authors suggest substituting a general $f(k)$ for $\sigma_{i,k}$ and propose a few different possibilities. They pick

$$ f(k) = (k+1)^a, \quad a < 0, $$

a choice that, when plugged into the formula above, is a generalization of the optimal value obtained in [Mal14B]. 1 1 Contrary to VRDS, this is derived from a Hoeffding bound on the quality of the estimator, after solving a simple minimization problem, leading to a specific choice ofexponent $a = -\frac{2}{3},$ cf [Mal14B] Eq. (14).

Taking into account that sample size should be an integer, we can set the value of $m_k = \min(1, \lfloor \tilde{m}_k \rfloor).$ However, this implies that some samples may be left unused as $\sum_{k=0}^{n-1} m_k \le m.$ In that case, the authors sequentially increase the value of $m_k$ from $k = 0$ to $n − 1$ until the sum exceeds $m.$

We note that previously, [Cas17I] used estimated standard deviations $\tilde{\sigma}_{i,k}$ instead:

$$ \tilde{m}_{i,k} = m \frac{\tilde{\sigma}_{i,k}^2}{ \sum_{j=0}^{n-1} \sum_{l=0}^{n-1} \tilde{\sigma}_{j,l}^2}, \quad k = 0, \dots , n − 1. $$

Sadly, all of these approaches assume that the number of samples is known beforehand which is almost never the case in practice.

To empirically demonstrate that the stratified sampling algorithm for VRDS can estimate the exact data Shapley values with a variance smaller than that obtained by the permutation sampling algorithm, the authors then run a few different experiments on a plethora of datasets:

Comparison of variance of VRDS and Permutation sampling:

Figure 2: Data Shapley values of 20 randomly selected data points from the FashionMNIST dataset and their estimates. The experiment is repeated 30 times to estimate values.

From Figure 2, we can clearly see that VRDS’s variance is lower than Permutation sampling. However, in many data valuation applications we are more interested in the order or ranking than in the actual value, and it is not clear whether the rank of VRDS values remains truly stable across experiments. The authors should have added an experiment to evaluate the ranking stability of VRDS, similar to what has been done in Data Banzhaf [Wan23D].

Performance of VRDS using different exponents $a:$

Figure 3: Variance of VRDS when $a$ takes different values in the range $[−3, 3].$ The dataset is FashionMNIST, the models are KNN and LR and the data sizes are 100 and 30, respectively.

Figure 3 shows that VRDS consistently achieves lower variance than permutation sampling, and that $a=-1$ (black, bottom line) seems to yield the best result. It would have been interesting to see how the choice in [Mal14B] of $a=-\frac{2}{3}$ would have performed.

Removal of best and worst data groups:

Figure 4: Data group removal experiment on the FashionMNIST data set. There are 20 groups and each group has 100 data points. VRDS ($a = −\frac{1}{2}$) and permutation sampling algorithms are used to calculate the value of data groups. (a) shows the removal of the most valuable data groups according to the estimated data Shapley values. (b) shows the removal of the most valuable data groups where the estimated data Shapley values $v_i$ have been shifted by the variances $\sigma^2_i$ with $−100 \sigma^2_i.$

The plots in Figure 4 don’t clearly show that VRDS performs better than permutation sampling for point removal tasks, and, crucially, lack confidence intervals. The similar performance could be due to the fact that even though VRDS’ variance is lower it doesn’t necessarily converge to correct values, which may indicate that the estimation is biased. It would also have been beneficial to compare it to other methods such as Data Banzhaf.

Alas, only the first experiment includes confidence intervals. Without more details and a study of the quality of the approximation, possibly using rank stability statistics, it is hard to decide on the interest of VRDS in practice. The experiments in general lack details and no code was provided to reproduce their results.

However, the use of stratified sampling seems like a worthwhile addition to the data Shapley methods as the authors wrote in their conclusion: it would be interesting to apply it to other data valuations methods that rely on sampling to estimate their values (e.g. Beta Shapley, Data Banzhaf, Truncated Monte Carlo Shapley).

If you are interested in testing out data valuation algorithms consider using the TransferLab’s pyDVL package for data valuation. To try it out, install with pip install pydvl and read the documentation.

References

In this series