Merge branch 'release/v0.6.0'

aai-institute · Mar 16, 2023 · f8e07cc · f8e07cc
2 parents e1d28ef + e26eee2
commit f8e07cc
Show file tree

Hide file tree

Showing 50 changed files with 2,285 additions and 1,031 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.5.0
+current_version = 0.6.0
 commit = False
 tag = False
 allow_dirty = False

diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+notebooks/*.ipynb -linguist-detectable
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,37 @@
 # Changelog
 
+## 0.6.0 - 🆕 New algorithms, cleanup and bug fixes 🏗
+
+- Fixes in `ValuationResult`: bugs around data names, semantics of
+  `empty()`, new method `zeros()` and normalised random values
+  [PR #327](https://github.com/appliedAI-Initiative/pyDVL/pull/327)
+- **New method**: Implements generalised semi-values for data valuation,
+  including Data Banzhaf and Beta Shapley, with configurable sampling strategies
+  [PR #319](https://github.com/appliedAI-Initiative/pyDVL/pull/319)
+- Adds kwargs parameter to `from_array` and `from_sklearn`
+  Dataset and GroupedDataset class methods
+  [PR #316](https://github.com/appliedAI-Initiative/pyDVL/pull/316)
+- PEP-561 conformance: added `py.typed`
+  [PR #307](https://github.com/appliedAI-Initiative/pyDVL/pull/307)
+- Removed default non-negativity constraint on least core subsidy
+  and added instead a `non_negative_subsidy` boolean flag.
+  Renamed `options` to `solver_options` and pass it as dict.
+  Change default least-core solver to SCS with 10000 max_iters.
+  [PR #304](https://github.com/appliedAI-Initiative/pyDVL/pull/304)
+- Cleanup: removed unnecessary decorator `@unpackable`
+  [PR #233](https://github.com/appliedAI-Initiative/pyDVL/pull/233)
+- Stopping criteria: fixed problem with `StandardError` and enable proper
+  composition of index convergence statuses. Fixed a bug with `n_jobs` in
+  `truncated_montecarlo_shapley`.
+  [PR #300](https://github.com/appliedAI-Initiative/pyDVL/pull/300) and
+  [PR #305](https://github.com/appliedAI-Initiative/pyDVL/pull/305)
+- Shuffling code around to allow for simpler user imports, some cleanup and
+  documentation fixes.
+  [PR #284](https://github.com/appliedAI-Initiative/pyDVL/pull/284)
+- **Bug fix**: Warn instead of raising an error when `n_iterations`
+  is less than the size of the dataset in Monte Carlo Least Core
+  [PR #281](https://github.com/appliedAI-Initiative/pyDVL/pull/281)
+
 ## 0.5.0 - 💥 Fixes, nicer interfaces and... more breaking changes 😒
 
 - Fixed parallel and antithetic Owen sampling for Shapley values. Simplified

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -21,10 +21,10 @@ Consider installing any of [black's IDE
 integrations](https://black.readthedocs.io/en/stable/integrations/editors.html)
 to make your life easier.
 
-Run the following command to set up the pre-commit git hook:
+Run the following to set up the pre-commit git hook to run before pushes:
 
 ```shell script
-pre-commit install
+pre-commit install --hook-type pre-push
 ```
 
 ## Setting up your environment

diff --git a/README.md b/README.md
@@ -54,6 +54,13 @@ methods from the following papers:
   [Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
   In 22nd International Conference on Artificial Intelligence and Statistics,
   1167–76. PMLR, 2019.
+- Wang, Jiachen T., and Ruoxi Jia. 
+  [Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
+  arXiv, October 22, 2022.
+- Kwon, Yongchan, and James Zou.
+  [Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
+  In Proceedings of the 25th International Conference on Artificial Intelligence
+  and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
 
 Influence Functions compute the effect that single points have on an estimator /
 model. We implement methods from the following papers:
@@ -97,21 +104,18 @@ This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for
 Data Shapley values:
 
 ```python
-import numpy as np
-from pydvl.utils import Dataset, Utility
+from sklearn.datasets import load_breast_cancer
+from sklearn.linear_model import LogisticRegression
 from pydvl.value import *
-from sklearn.linear_model import LinearRegression
-from sklearn.model_selection import train_test_split
 
-X, y = np.arange(100).reshape((50, 2)), np.arange(50)
-X_train, X_test, y_train, y_test = train_test_split(
-    X, y, test_size=0.5, random_state=16
-)
-dataset = Dataset(X_train, y_train, X_test, y_test)
-model = LinearRegression()
-utility = Utility(model, dataset)
+data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
+model = LogisticRegression()
+u = Utility(model, data, Scorer("accuracy", default=0.0))
 values = compute_shapley_values(
-    u=utility, mode="truncated_montecarlo", done=MaxUpdates(100)
+    u,
+    mode=ShapleyMode.TruncatedMontecarlo,
+    done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
+    truncation=RelativeTruncation(u, rtol=0.01),
 )
 ```
 

diff --git a/docs/30-data-valuation.rst b/docs/30-data-valuation.rst
@@ -241,6 +241,7 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
 .. code-block:: python
 
    from pydvl.value import compute_shapley_value
+
    utility = Utility(...)
    values = compute_shapley_values(utility, mode="combinatorial_exact")
    df = values.to_dataframe(column='value')
@@ -264,7 +265,8 @@ same pattern:
 .. code-block:: python
 
    from pydvl.utils import Dataset, Utility
-   from pydvl.value.shapley import compute_shapley_values
+   from pydvl.value import compute_shapley_values
+
    model = ...
    data = Dataset(...)
    utility = Utility(model, data)
@@ -303,7 +305,8 @@ values in pyDVL. First construct the dataset and utility, then call
 .. code-block:: python
 
    from pydvl.utils import Dataset, Utility
-   from pydvl.value.shapley import compute_shapley_values
+   from pydvl.value import compute_shapley_values
+
    model = ...
    dataset = Dataset(...)
    utility = Utility(data, model)
@@ -329,11 +332,11 @@ It uses permutations over indices instead of subsets:
 
 $$
 v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
-[u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})]
+[u(\sigma_{:i} \cup \{i\}) − u(\sigma_{:i})]
 ,$$
 
-where $\sigma_i$ denotes the set of indices in permutation sigma up until the
-position of index $i$. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
+where $\sigma_{:i}$ denotes the set of indices in permutation sigma before the
+position where $i$ appears. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
 one uses Monte Carlo sampling of permutations, something which has surprisingly
 low sample complexity. By adding early stopping, the result is the so-called
 **Truncated Monte Carlo Shapley** (:footcite:t:`ghorbani_data_2019`), which is
@@ -342,7 +345,7 @@ efficient enough to be useful in some applications.
 .. code-block:: python
 
    from pydvl.utils import Dataset, Utility
-   from pydvl.value.shapley import compute_shapley_values
+   from pydvl.value import compute_shapley_values
 
    model = ...
    data = Dataset(...)
@@ -364,7 +367,7 @@ and can be used in pyDVL with:
 .. code-block:: python
 
    from pydvl.utils import Dataset, Utility
-   from pydvl.value.shapley import compute_shapley_values
+   from pydvl.value import compute_shapley_values
    from sklearn.neighbors import KNeighborsClassifier
 
    model = KNeighborsClassifier(n_neighbors=5)
@@ -410,7 +413,7 @@ its variance.
 .. code-block:: python
 
    from pydvl.utils import Dataset, Utility
-   from pydvl.value.shapley import compute_shapley_values
+   from pydvl.value import compute_shapley_values
 
    model = ...
    data = Dataset(...)
@@ -449,7 +452,7 @@ It satisfies the following 2 properties:
   The sum of payoffs to the agents in any coalition S is at
   least as large as the amount that these agents could earn by
   forming a coalition on their own.
-  $$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subseteq D\,$$
+  $$\sum_{x_i\in S} v_u(x_i) \geq u(S), \forall S \subset D\,$$
 
 The second property states that the sum of payoffs to the agents
 in any subcoalition $S$ is at least as large as the amount that
@@ -463,7 +466,7 @@ By relaxing the coalitional rationality property by a subsidy $e \gt 0$,
 we are then able to find approximate payoffs:
 
 $$
-\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subseteq D\
+\sum_{x_i\in S} v_u(x_i) + e \geq u(S), \forall S \subset D, S \neq \emptyset \
 ,$$
 
 The least core value $v$ of the $i$-th sample in dataset $D$ wrt.
@@ -473,7 +476,7 @@ $$
 \begin{array}{lll}
 \text{minimize} & e & \\
 \text{subject to} & \sum_{x_i\in D} v_u(x_i) = u(D) & \\
-& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subseteq D \\
+& \sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subset D, S \neq \emptyset  \\
 \end{array}
 $$
 
@@ -487,11 +490,12 @@ As such it returns as exact a value as the utility function allows
 .. code-block:: python
 
    from pydvl.utils import Dataset, Utility
-   from pydvl.value.least_core import exact_least_core
+   from pydvl.value import compute_least_core_values
+
    model = ...
    dataset = Dataset(...)
    utility = Utility(data, model)
-   values = exact_least_core(utility)
+   values = compute_least_core_values(utility, mode="exact")
 
 Monte Carlo Least Core
 ----------------------
@@ -515,16 +519,20 @@ where $e^{*}$ is the optimal least core subsidy.
 .. code-block:: python
 
    from pydvl.utils import Dataset, Utility
-   from pydvl.value.least_core import montecarlo_least_core
+   from pydvl.value import compute_least_core_values
+
    model = ...
    dataset = Dataset(...)
    n_iterations = ...
    utility = Utility(data, model)
-   values = montecarlo_least_core(utility, n_iterations=n_iterations)
+   values = compute_least_core_values(
+       utility, mode="montecarlo", n_iterations=n_iterations
+   )
 
 .. note::
 
-   ``n_iterations`` needs to be at least equal to the number of data points.
+   Although any number is supported, it is best to choose ``n_iterations`` to be
+   at least equal to the number of data points.
 
 Because computing the Least Core values requires the solution of a linear and a
 quadratic problem *after* computing all the utility values, we offer the
@@ -538,6 +546,7 @@ list of problems to solve, then solve them in parallel with
 
    from pydvl.utils import Dataset, Utility
    from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
+
    model = ...
    dataset = Dataset(...)
    n_iterations = ...
@@ -548,15 +557,102 @@ list of problems to solve, then solve them in parallel with
    values = lc_solve_problems(problems)
 
 
-Other methods
-=============
+Semi-values
+===========
+
+Shapley values are a particular case of a more general concept called semi-value,
+which is a generalization to different weighting schemes. A **semi-value** is
+any valuation function with the form:
+
+$$
+v\_\text{semi}(i) = \sum_{i=1}^n w(k)
+\sum_{S \subset D\_{-i}^{(k)}} [U(S\_{+i})-U(S)],
+$$
+
+where the coefficients $w(k)$ satisfy the property:
+
+$$\sum_{k=1}^n w(k) = 1.$$
+
+Two instances of this are **Banzhaf indices** (:footcite:t:`wang_data_2022`),
+and **Beta Shapley** (:footcite:t:`kwon_beta_2022`), with better numerical and
+rank stability in certain situations.
+
+.. note::
+
+   Shapley values are a particular case of semi-values and can therefore also be
+   computed with the methods described here. However, as of version 0.6.0, we
+   recommend using :func:`~pydvl.value.shapley.compute_shapley_values` instead,
+   in particular because it implements truncated Monte Carlo sampling for faster
+   computation.
+
+
+Beta Shapley
+^^^^^^^^^^^^
+
+For some machine learning applications, where the utility is typically the
+performance when trained on a set $S \subset D$, diminishing returns are often
+observed when computing the marginal utility of adding a new data point.
+
+Beta Shapley is a weighting scheme that uses the Beta function to place more
+weight on subsets deemed to be more informative. The weights are defined as:
+
+$$
+w(k) := \frac{B(k+\beta, n-k+1+\alpha)}{B(\alpha, \beta)},
+$$
+
+where $B$ is the `Beta function <https://en.wikipedia.org/wiki/Beta_function>`_,
+and $\alpha$ and $\beta$ are parameters that control the weighting of the
+subsets. Setting both to 1 recovers Shapley values, and setting $\alpha = 1$, and
+$\beta = 16$ is reported in :footcite:t:`kwon_beta_2022` to be a good choice for
+some applications. See however :ref:`banzhaf indices` for an alternative choice
+of weights which is reported to work better.
+
+.. code-block:: python
+
+   from pydvl.utils import Dataset, Utility
+   from pydvl.value import compute_semivalues
+
+   model = ...
+   data = Dataset(...)
+   utility = Utility(model, data)
+   values = compute_semivalues(
+       u=utility, mode="beta_shapley", done=MaxUpdates(500), alpha=1, beta=16
+   )
+
+.. _banzhaf indices:
 
-There are other game-theoretic concepts in pyDVL's roadmap, based on the notion
-of semivalue, which is a generalization to different weighting schemes:
-in particular **Banzhaf indices** and **Beta Shapley**, with better numerical
-and rank stability in certain situations.
+Banzhaf indices
+^^^^^^^^^^^^^^^
 
-Contributions are welcome!
+As noted below in :ref:`problems of data values`, the Shapley value can be very
+sensitive to variance in the utility function. For machine learning applications,
+where the utility is typically the performance when trained on a set $S \subset
+D$, this variance is often largest for smaller subsets $S$. It is therefore
+reasonable to try reducing the relative contribution of these subsets with
+adequate weights.
+
+One such choice of weights is the Banzhaf index, which is defined as the
+constant:
+
+$$w(k) := 2^{n-1},$$
+
+for all set sizes $k$. The intuition for picking a constant weight is that for
+any choice of weight function $w$, one can always construct a utility with
+higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
+one can do is to pick a constant weight.
+
+The authors of :footcite:t:`wang_data_2022` show that Banzhaf indices are more
+robust to variance in the utility function than Shapley and Beta Shapley values.
+
+.. code-block:: python
+
+   from pydvl.utils import Dataset, Utility
+   from pydvl.value import compute_semivalues
+
+   model = ...
+   data = Dataset(...)
+   utility = Utility(model, data)
+   values = compute_semivalues( u=utility, mode="banzhaf", done=MaxUpdates(500))
 
 
 .. _problems of data values:

diff --git a/docs/conf.py b/docs/conf.py
@@ -43,7 +43,6 @@
     "sphinx.ext.extlinks",
     "sphinx_math_dollar",
     "sphinx.ext.todo",
-    "sphinx_rtd_theme",
     "hoverxref.extension",  # This only works on read the docs
     "sphinx_design",
     "sphinxcontrib.bibtex",
@@ -98,6 +97,16 @@
         .nboutput .prompt {
             display: none;
         }
+        @media not print {
+            [data-theme='dark'] .output_area img {
+                filter: invert(0.9);
+            }
+            @media (prefers-color-scheme: dark) {
+                :root:not([data-theme="light"]) .output_area img {
+                    filter: invert(0.9);
+                }
+            }
+        }
     </style>
 """
 
@@ -325,7 +334,7 @@ def lineno_from_object_name(source_file, object_name):
 
 # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
 html_show_copyright = True
-copyright = "2022 AppliedAI Institute gGmbH"
+copyright = "AppliedAI Institute gGmbH"
 
 # If true, an OpenSearch description file will be output, and all pages will
 # contain a <link> tag referring to it.  The value of this option must be the