Data (e)valuation and model interpretation: a game theoretic approach

Attributing a “fair” (for some definition of fair) value to training samples has multiple applications. It can be used to investigate or improve data sources, but it can also help detect outliers and to investigate how certain features and the values thereof influence global model performance. It is also possible to improve model performance: removing points of low value can decrease model error. Finally, in problems with few data of high cost (e. g. some types of medical images), a fair value can be used to compensate providers for their data.

Alternatively, attributing global values (as opposed to local, i.e. around single predictions) to features can enormously help guide the process of data collection and cleaning towards those having the highest impact, thus saving time and resources. Also, by identifying worthy features, companies can gain insight into their businesses based on their data.

References

[Gho19D]

Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou.

May 2019

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, …

[Gho20D]

A Distributional Framework For Data Valuation, Amirata Ghorbani, Michael Kim, James Zou.

Nov 2020

Shapley value is a classic notion from game theory, historically used to quantify the contributions of individuals within groups, and more recently applied to assign values to data points when training machine learning models. Despite its foundational role, a key limitation of the data Shapley framework is that it only provides valuations for points within a fixed data set. It does not account for …

[Yoo20D]

Data Valuation using Reinforcement Learning, Jinsung Yoon, Sercan Ö Arık, Tomas Pfister.

2020

Quantifying the value of data is a fundamental problem in machine learning and has multiple important use cases: (1) building insights about the dataset and task, (2) domain adaptation, (3) corrupted sample discovery, and (4) robust learning. We propose Data Valuation using Reinforcement Learning (DVRL), to adaptively learn data values jointly with the predictor model. DVRL uses a data value …

[Mal14B]

Bounding the Estimation Error of Sampling-based Shapley Value Approximation, Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, Alex Rogers.

Feb 2014

The Shapley value is arguably the most central normative solution concept in cooperative game theory. It specifies a unique way in which the reward from cooperation can be "fairly" divided among players. While it has a wide range of real world applications, its use is in many cases hampered by the hardness of its computation. A number of researchers have tackled this problem by (i) focusing on …

References

In this series →