Skip to content

Commit

Permalink
Merge branch 'release/v0.6.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
mdbenito committed Mar 16, 2023
2 parents e1d28ef + e26eee2 commit f8e07cc
Show file tree
Hide file tree
Showing 50 changed files with 2,285 additions and 1,031 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.5.0
current_version = 0.6.0
commit = False
tag = False
allow_dirty = False
Expand Down
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
notebooks/*.ipynb -linguist-detectable
32 changes: 32 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,37 @@
# Changelog

## 0.6.0 - 🆕 New algorithms, cleanup and bug fixes 🏗

- Fixes in `ValuationResult`: bugs around data names, semantics of
`empty()`, new method `zeros()` and normalised random values
[PR #327](https://github.com/appliedAI-Initiative/pyDVL/pull/327)
- **New method**: Implements generalised semi-values for data valuation,
including Data Banzhaf and Beta Shapley, with configurable sampling strategies
[PR #319](https://github.com/appliedAI-Initiative/pyDVL/pull/319)
- Adds kwargs parameter to `from_array` and `from_sklearn`
Dataset and GroupedDataset class methods
[PR #316](https://github.com/appliedAI-Initiative/pyDVL/pull/316)
- PEP-561 conformance: added `py.typed`
[PR #307](https://github.com/appliedAI-Initiative/pyDVL/pull/307)
- Removed default non-negativity constraint on least core subsidy
and added instead a `non_negative_subsidy` boolean flag.
Renamed `options` to `solver_options` and pass it as dict.
Change default least-core solver to SCS with 10000 max_iters.
[PR #304](https://github.com/appliedAI-Initiative/pyDVL/pull/304)
- Cleanup: removed unnecessary decorator `@unpackable`
[PR #233](https://github.com/appliedAI-Initiative/pyDVL/pull/233)
- Stopping criteria: fixed problem with `StandardError` and enable proper
composition of index convergence statuses. Fixed a bug with `n_jobs` in
`truncated_montecarlo_shapley`.
[PR #300](https://github.com/appliedAI-Initiative/pyDVL/pull/300) and
[PR #305](https://github.com/appliedAI-Initiative/pyDVL/pull/305)
- Shuffling code around to allow for simpler user imports, some cleanup and
documentation fixes.
[PR #284](https://github.com/appliedAI-Initiative/pyDVL/pull/284)
- **Bug fix**: Warn instead of raising an error when `n_iterations`
is less than the size of the dataset in Monte Carlo Least Core
[PR #281](https://github.com/appliedAI-Initiative/pyDVL/pull/281)

## 0.5.0 - 💥 Fixes, nicer interfaces and... more breaking changes 😒

- Fixed parallel and antithetic Owen sampling for Shapley values. Simplified
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ Consider installing any of [black's IDE
integrations](https://black.readthedocs.io/en/stable/integrations/editors.html)
to make your life easier.

Run the following command to set up the pre-commit git hook:
Run the following to set up the pre-commit git hook to run before pushes:

```shell script
pre-commit install
pre-commit install --hook-type pre-push
```

## Setting up your environment
Expand Down
28 changes: 16 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,13 @@ methods from the following papers:
[Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
In 22nd International Conference on Artificial Intelligence and Statistics,
1167–76. PMLR, 2019.
- Wang, Jiachen T., and Ruoxi Jia.
[Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
arXiv, October 22, 2022.
- Kwon, Yongchan, and James Zou.
[Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
In Proceedings of the 25th International Conference on Artificial Intelligence
and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.

Influence Functions compute the effect that single points have on an estimator /
model. We implement methods from the following papers:
Expand Down Expand Up @@ -97,21 +104,18 @@ This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for
Data Shapley values:

```python
import numpy as np
from pydvl.utils import Dataset, Utility
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from pydvl.value import *
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X, y = np.arange(100).reshape((50, 2)), np.arange(50)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=16
)
dataset = Dataset(X_train, y_train, X_test, y_test)
model = LinearRegression()
utility = Utility(model, dataset)
data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
model = LogisticRegression()
u = Utility(model, data, Scorer("accuracy", default=0.0))
values = compute_shapley_values(
u=utility, mode="truncated_montecarlo", done=MaxUpdates(100)
u,
mode=ShapleyMode.TruncatedMontecarlo,
done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
truncation=RelativeTruncation(u, rtol=0.01),
)
```

Expand Down
142 changes: 119 additions & 23 deletions docs/30-data-valuation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,7 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
.. code-block:: python
from pydvl.value import compute_shapley_value
utility = Utility(...)
values = compute_shapley_values(utility, mode="combinatorial_exact")
df = values.to_dataframe(column='value')
Expand All @@ -264,7 +265,8 @@ same pattern:
.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value.shapley import compute_shapley_values
from pydvl.value import compute_shapley_values
model = ...
data = Dataset(...)
utility = Utility(model, data)
Expand Down Expand Up @@ -303,7 +305,8 @@ values in pyDVL. First construct the dataset and utility, then call
.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value.shapley import compute_shapley_values
from pydvl.value import compute_shapley_values
model = ...
dataset = Dataset(...)
utility = Utility(data, model)
Expand All @@ -329,11 +332,11 @@ It uses permutations over indices instead of subsets:

$$
v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
[u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})]
[u(\sigma_{:i} \cup \{i\}) − u(\sigma_{:i})]
,$$

where $\sigma_i$ denotes the set of indices in permutation sigma up until the
position of index $i$. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
where $\sigma_{:i}$ denotes the set of indices in permutation sigma before the
position where $i$ appears. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
one uses Monte Carlo sampling of permutations, something which has surprisingly
low sample complexity. By adding early stopping, the result is the so-called
**Truncated Monte Carlo Shapley** (:footcite:t:`ghorbani_data_2019`), which is
Expand All @@ -342,7 +345,7 @@ efficient enough to be useful in some applications.
.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value.shapley import compute_shapley_values
from pydvl.value import compute_shapley_values
model = ...
data = Dataset(...)
Expand All @@ -364,7 +367,7 @@ and can be used in pyDVL with:
.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value.shapley import compute_shapley_values
from pydvl.value import compute_shapley_values
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
Expand Down Expand Up @@ -410,7 +413,7 @@ its variance.
.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value.shapley import compute_shapley_values
from pydvl.value import compute_shapley_values
model = ...
data = Dataset(...)
Expand Down Expand Up @@ -449,7 +452,7 @@ It satisfies the following 2 properties:
The sum of payoffs to the agents in any coalition S is at
least as large as the amount that these agents could earn by
forming a coalition on their own.
$$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subseteq D\,$$
$$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subset D\,$$

The second property states that the sum of payoffs to the agents
in any subcoalition $S$ is at least as large as the amount that
Expand All @@ -463,7 +466,7 @@ By relaxing the coalitional rationality property by a subsidy $e \gt 0$,
we are then able to find approximate payoffs:

$$
\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subseteq D\
\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subset D, S \neq \emptyset \
,$$

The least core value $v$ of the $i$-th sample in dataset $D$ wrt.
Expand All @@ -473,7 +476,7 @@ $$
\begin{array}{lll}
\text{minimize} & e & \\
\text{subject to} & \sum_{x_i\in D} v_u(x_i) = u(D) & \\
& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subseteq D \\
& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subset D, S \neq \emptyset \\
\end{array}
$$

Expand All @@ -487,11 +490,12 @@ As such it returns as exact a value as the utility function allows
.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value.least_core import exact_least_core
from pydvl.value import compute_least_core_values
model = ...
dataset = Dataset(...)
utility = Utility(data, model)
values = exact_least_core(utility)
values = compute_least_core_values(utility, mode="exact")
Monte Carlo Least Core
----------------------
Expand All @@ -515,16 +519,20 @@ where $e^{*}$ is the optimal least core subsidy.
.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value.least_core import montecarlo_least_core
from pydvl.value import compute_least_core_values
model = ...
dataset = Dataset(...)
n_iterations = ...
utility = Utility(data, model)
values = montecarlo_least_core(utility, n_iterations=n_iterations)
values = compute_least_core_values(
utility, mode="montecarlo", n_iterations=n_iterations
)
.. note::

``n_iterations`` needs to be at least equal to the number of data points.
Although any number is supported, it is best to choose ``n_iterations`` to be
at least equal to the number of data points.

Because computing the Least Core values requires the solution of a linear and a
quadratic problem *after* computing all the utility values, we offer the
Expand All @@ -538,6 +546,7 @@ list of problems to solve, then solve them in parallel with
from pydvl.utils import Dataset, Utility
from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
model = ...
dataset = Dataset(...)
n_iterations = ...
Expand All @@ -548,15 +557,102 @@ list of problems to solve, then solve them in parallel with
values = lc_solve_problems(problems)
Other methods
=============
Semi-values
===========

Shapley values are a particular case of a more general concept called semi-value,
which is a generalization to different weighting schemes. A **semi-value** is
any valuation function with the form:

$$
v\_\text{semi}(i) = \sum_{i=1}^n w(k)
\sum_{S \subset D\_{-i}^{(k)}} [U(S\_{+i})-U(S)],
$$

where the coefficients $w(k)$ satisfy the property:

$$\sum_{k=1}^n w(k) = 1.$$

Two instances of this are **Banzhaf indices** (:footcite:t:`wang_data_2022`),
and **Beta Shapley** (:footcite:t:`kwon_beta_2022`), with better numerical and
rank stability in certain situations.

.. note::

Shapley values are a particular case of semi-values and can therefore also be
computed with the methods described here. However, as of version 0.6.0, we
recommend using :func:`~pydvl.value.shapley.compute_shapley_values` instead,
in particular because it implements truncated Monte Carlo sampling for faster
computation.


Beta Shapley
^^^^^^^^^^^^

For some machine learning applications, where the utility is typically the
performance when trained on a set $S \subset D$, diminishing returns are often
observed when computing the marginal utility of adding a new data point.

Beta Shapley is a weighting scheme that uses the Beta function to place more
weight on subsets deemed to be more informative. The weights are defined as:

$$
w(k) := \frac{B(k+\beta, n-k+1+\alpha)}{B(\alpha, \beta)},
$$

where $B$ is the `Beta function <https://en.wikipedia.org/wiki/Beta_function>`_,
and $\alpha$ and $\beta$ are parameters that control the weighting of the
subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$, and
$\beta = 16$ is reported in :footcite:t:`kwon_beta_2022` to be a good choice for
some applications. See however :ref:`banzhaf indices` for an alternative choice
of weights which is reported to work better.

.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value import compute_semivalues
model = ...
data = Dataset(...)
utility = Utility(model, data)
values = compute_semivalues(
u=utility, mode="beta_shapley", done=MaxUpdates(500), alpha=1, beta=16
)
.. _banzhaf indices:

There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
of semivalue, which is a generalization to different weighting schemes:
in particular **Banzhaf indices** and **Beta Shapley**, with better numerical
and rank stability in certain situations.
Banzhaf indices
^^^^^^^^^^^^^^^

Contributions are welcome!
As noted below in :ref:`problems of data values`, the Shapley value can be very
sensitive to variance in the utility function. For machine learning applications,
where the utility is typically the performance when trained on a set $S \subset
D$, this variance is often largest for smaller subsets $S$. It is therefore
reasonable to try reducing the relative contribution of these subsets with
adequate weights.

One such choice of weights is the Banzhaf index, which is defined as the
constant:

$$w(k) := 2^{n-1},$$

for all set sizes $k$. The intuition for picking a constant weight is that for
any choice of weight function $w$, one can always construct a utility with
higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
one can do is to pick a constant weight.

The authors of :footcite:t:`wang_data_2022` show that Banzhaf indices are more
robust to variance in the utility function than Shapley and Beta Shapley values.

.. code-block:: python
from pydvl.utils import Dataset, Utility
from pydvl.value import compute_semivalues
model = ...
data = Dataset(...)
utility = Utility(model, data)
values = compute_semivalues( u=utility, mode="banzhaf", done=MaxUpdates(500))
.. _problems of data values:
Expand Down
13 changes: 11 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@
"sphinx.ext.extlinks",
"sphinx_math_dollar",
"sphinx.ext.todo",
"sphinx_rtd_theme",
"hoverxref.extension", # This only works on read the docs
"sphinx_design",
"sphinxcontrib.bibtex",
Expand Down Expand Up @@ -98,6 +97,16 @@
.nboutput .prompt {
display: none;
}
@media not print {
[data-theme='dark'] .output_area img {
filter: invert(0.9);
}
@media (prefers-color-scheme: dark) {
:root:not([data-theme="light"]) .output_area img {
filter: invert(0.9);
}
}
}
</style>
"""

Expand Down Expand Up @@ -325,7 +334,7 @@ def lineno_from_object_name(source_file, object_name):

# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
html_show_copyright = True
copyright = "2022 AppliedAI Institute gGmbH"
copyright = "AppliedAI Institute gGmbH"

# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
Expand Down
Loading

0 comments on commit f8e07cc

Please sign in to comment.