Rather than better models or more data, good data is very often the key to a successful application of machine learning. Sophisticated models can only go so far and, almost invariably for real business applications, improvements in data acquisition, annotation and cleaning are a much better investment of resources than researching complex models. As part of our mission to help practitioners to make the most of their time and their data, we have developed pyDVL, the python Data Valuation Library.

As of version 0.8.1, pyDVL provides robust, parallel implementations of most popular methods for data valuation. We are also developing a robust framework for the computation of influence functions, with lazy evaluation of influence factors, and out-of-core computation and parallelization, enabling the computation of influence functions for large models and datasets.

Finally, we are also working on a benchmarking suite to compare all methods. In the documentation, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples for most of them.

Install with pip install pydvl, or check out the documentation

Methods for data valuation

Leave One Out
Data Shapley [Gho19D] values with different sampling methods
Truncated Monte Carlo Shapley [Gho19D]
Exact Data Shapley for KNN [Jia19aE]
Owen sampling [Okh21M]
Group testing Shapley [Jia19aE]
Least Core [Yan21I]
Data Utility Learning [Wan22I]
Data Banzhaf [Wan23D]
Beta Shapley [Kwo22B]
Generalized semi-values, subsuming Shapley, Banzhaf and Beta-Shapley into one framework.
Data-OOB [Kwo23D]
Class-Wise Shapley [Sch22C]

Methods for influence functions

Exact computation
Conjugate Gradient [Koh17U]
Linear (time) Stochastic Second-Order Algorithm (LiSSA) [Aga17S]
Arnoldi iteration [Sch22S]
Kronecker-factored Approximate Curvature [Mar15O]
Eigenvalue-corrected Kronecker-Factored Approximate Curvature [Geo18F, Gro23S]

Roadmap

We are currently implementing or plan to implement:

Standardized benchmarks for all valuation methods (v0.10)
Improved parallelization strategies (v0.9)
LAVA [Jus23L]
$\delta$-Shapley [Wat23A]
Neural Tangent Kernel scorer [Wu22D]
Variance-reduced sampling methods for Shapley values
(Approximate) Maximum Influence Perturbation [Bro21A]

To see what new methods, features and improvements are coming, check out the issues on GitHub.

References

[Aga17S]

Second-Order Stochastic Optimization for Machine Learning in Linear Time, Naman Agarwal, Brian Bullins, Elad Hazan.

2017

First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to eﬃcient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored due to the high cost of computing the second-order information. In this paper we develop second-order stochastic methods for optimization problems in machine …

[Bro21A]

An Automatic Finite-Sample Robustness Metric: When Can Dropping a Little Data Make a Big Difference?, Tamara Broderick, Ryan Giordano, Rachael Meager.

Nov 2021

We propose a method to assess the sensitivity of econometric analyses to the removal of a small fraction of the data. Manually checking the influence of all possible small subsets is computationally infeasible, so we provide an approximation to find the most influential subset. Our metric, the "Approximate Maximum Influence Perturbation," is automatically computable for common methods including …

[Geo18F]

Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis, Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent.

2018

Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covari- ance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal …

[Gho19D]

Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou.

May 2019

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, …

[Gro23S]

Studying Large Language Model Generalization with Influence Functions, Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman.

Aug 2023

When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? …

[Jia19aE]

Efficient task-specific data valuation for nearest neighbor algorithms, Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song.

Jul 2019

Given a data set D containing millions of data points and a data consumer who is willing to pay \$X to train a machine learning (ML) model over D, how should we distribute this \$X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, …

[Jus23L]

LAVA: Data Valuation without Pre-Specified Learning Algorithms, Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, Ruoxi Jia.

Feb 2023

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a …

[Koh17U]

Understanding Black-box Predictions via Influence Functions, Pang Wei Koh, Percy Liang.

Jul 2017

How can we explain the predictions of a black-box model? In this paper, we use influence functions — a classic technique from robust statistics — to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a …

[Kwo22B]

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning, Yongchan Kwon, James Zou.

Jan 2022

Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, …

[Kwo23D]

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value, Yongchan Kwon, James Zou.

Jul 2023

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as …

[Mar15O]

Optimizing Neural Networks with Kronecker-factored Approximate Curvature, James Martens, Roger Grosse.

Jun 2015

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network’s Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks …

[Okh21M]

A Multilinear Sampling Algorithm to Estimate Shapley Values, Ramin Okhrati, Aldo Lipani.

Jan 2021

Shapley values are great analytical tools in game theory to measure the importance of a player in a game. Due to their axiomatic and desirable properties such as efficiency, they have become popular for feature importance analysis in data science and machine learning. However, the time complexity to compute Shapley values based on the original formula is exponential, and as the number of features …

[Sch22S]

Scaling Up Influence Functions, Andrea Schioppa, Polina Zablotskaia, David Vilar, Artem Sokolov.

Jun 2022

We address efficient calculation of influence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) …

[Sch22C]

CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification, Stephanie Schoch, Haifeng Xu, Yangfeng Ji.

Oct 2022

Data valuation, or the valuation of individual datum contributions, has seen growing interest in machine learning due to its demonstrable efficacy for tasks such as noisy label detection. In particular, due to the desirable axiomatic properties, several Shapley value approximations have been proposed. In these methods, the value function is usually defined as the predictive accuracy over the …

[Wan23D]

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, Jiachen T. Wang, Ruoxi Jia.

Apr 2023

Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the …

[Wan22I]

Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning, Tianhao Wang, Yu Yang, Ruoxi Jia.

Apr 2022

The Shapley value (SV) and Least core (LC) are classic methods in cooperative game theory for cost/profit sharing problems. Both methods have recently been proposed as a principled solution for data valuation tasks, i.e., quantifying the contribution of individual datum in machine learning. However, both SV and LC suffer computational challenges due to the need for retraining models on …

[Wat23A]

Accelerated Shapley Value Approximation for Data Evaluation, Lauren Watson, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, Rik Sarkar.

Nov 2023

Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets …

[Wu22D]

DAVINZ: Data Valuation using Deep Neural Networks at Initialization, Zhaoxuan Wu, Yao Shu, Bryan Kian Hsiang Low.

Jun 2022

Recent years have witnessed a surge of interest in developing trustworthy methods to evaluate the value of data in many real-world applications (e.g., collaborative machine learning, data marketplaces). Existing data valuation methods typically valuate data using the generalization performance of converged machine learning models after their long-term model training, hence making data valuation on …

[Yan21I]

If You Like Shapley Then You’ll Love the Core, Tom Yan, Ariel D. Procaccia.

May 2021

The prevalent approach to problems of credit assignment in machine learning — such as feature and data valuation— is to model the problem at hand as a cooperative game and apply the Shapley value. But cooperative game theory offers a rich menu of alternative solution concepts, which famously includes the core and its variants. Our goal is to challenge the machine learning community’s current …

Methods for data valuation

Methods for influence functions

Roadmap

References

In this series →