Merge branch 'release/v0.8.1'

aai-institute · Jan 26, 2024 · ac4ac7f · ac4ac7f
2 parents 70df031 + 63753a2
commit ac4ac7f
Show file tree

Hide file tree

Showing 62 changed files with 4,575 additions and 1,227 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.8.0
+current_version = 0.8.1
 commit = False
 tag = False
 allow_dirty = False

diff --git a/.test_durations b/.test_durations
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,31 @@
 # Changelog
 
+## 0.8.1 - 🆕 🏗  New method and noteboo, Games with exact shapley values, bug fixes and cleanup
+
+### Added
+
+- Implement new method: `EkfacInfluence`
+  [PR #451](https://github.com/aai-institute/pyDVL/issues/451)
+- New notebook to showcase ekfac for LLMs
+  [PR #483](https://github.com/aai-institute/pyDVL/pull/483)
+- Implemented exact games in Castro et al. 2009 and 2017
+  [PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)
+
+### Fixed
+
+- Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
+  for single dimensional arrays [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
+- Fix implementations of `to` methods of `TorchInfluenceFunctionModel` implementations
+  [PR #487](https://github.com/aai-institute/pyDVL/pull/487)
+- Fixed bug with checking for converged values in semivalues
+  [PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)
+
+### Docs
+
+- Add applications of data valuation section, display examples more prominently,
+  make all sections visible in table of contents, use mkdocs material cards
+  in the home page [PR #492](https://github.com/aai-institute/pyDVL/pull/492)
+
 ## 0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁
 
 ### Added

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -23,7 +23,7 @@ to make your life easier.
 
 Run the following to set up the pre-commit git hook to run before pushes:
 
-```shell script
+```shell
 pre-commit install --hook-type pre-push
 ```
 
@@ -32,15 +32,15 @@ pre-commit install --hook-type pre-push
 We strongly suggest using some form of virtual environment for working with the
 library. E.g. with venv:
 
-```shell script
+```shell
 python -m venv ./venv
 . venv/bin/activate  # `venv\Scripts\activate` in windows
 pip install -r requirements-dev.txt -r requirements-docs.txt
 ```
 
 With conda:
 
-```shell script
+```shell
 conda create -n pydvl python=3.8
 conda activate pydvl
 pip install -r requirements-dev.txt -r requirements-docs.txt
@@ -49,7 +49,7 @@ pip install -r requirements-dev.txt -r requirements-docs.txt
 A very convenient way of working with your library during development is to
 install it in editable mode into your environment by running
 
-```shell script
+```shell
 pip install -e .
 ```
 
@@ -58,7 +58,7 @@ suite) [pandoc](https://pandoc.org/) is required. Except for OSX, it should be i
 automatically as a dependency with `requirements-docs.txt`. Under OSX you can
 install pandoc (you'll need at least version 2.11) with:
 
-```shell script
+```shell
 brew install pandoc
 ```
 
@@ -152,11 +152,11 @@ Two important markers are:
 To test the notebooks separately, run (see [below](#notebooks) for details):
 
 ```shell
-tox -e tests -- notebooks/
+tox -e notebook-tests
 ```
 
 To create a package locally, run:
-```shell script
+```shell
 python setup.py sdist bdist_wheel
 ```
 
@@ -343,8 +343,12 @@ runs](#skipping-ci-runs)).
 3. We split the tests based on their duration into groups and run them in parallel.
   
    For that we use [pytest-split](https://jerry-git.github.io/pytest-split)
-   to first store the duration of all tests with `pytest --store-durations pytest --slow-tests`
+   to first store the duration of all tests with
+   `tox -e tests -- --store-durations --slow-tests`
    in a `.test_durations` file.
+   
+   Alternatively, we case use pytest directly
+   `pytest --store-durations --slow-tests`.
 
    > **Note** This does not have to be done each time a new test or test case
    > is added. For new tests and test cases pytes-split assumes
@@ -359,11 +363,14 @@ runs](#skipping-ci-runs)).
    Then we can have as many splits as we want:
 
    ```shell
-   pytest --splits 3 --group 1
-   pytest --splits 3 --group 2
-   pytest --splits 3 --group 3
+   tox -e tests -- --splits 3 --group 1
+   tox -e tests -- --splits 3 --group 2
+   tox -e tests -- --splits 3 --group 3
    ```
    
+   Alternatively, we case use pytest directly
+   `pytest --splits 3 ---group 1`.
+   
    Each one of these commands should be run in a separate shell/job
    to run the test groups in parallel and decrease the total runtime.
 
@@ -510,13 +517,13 @@ Then, a new release can be created using the script
 `bumpversion` automatically derive the next release version by bumping the patch
 part):
 
-```shell script
+```shell
 build_scripts/release-version.sh 0.1.6
 ```
 
 To find out how to use the script, pass the `-h` or `--help` flags:
 
-```shell script
+```shell
 build_scripts/release-version.sh --help
 ```
 
@@ -542,7 +549,7 @@ create a new release manually by following these steps:
 2. When ready to release: From the develop branch create the release branch and
    perform release activities (update changelog, news, ...). For your own
    convenience, define an env variable for the release version
-    ```shell script
+    ```shell
     export RELEASE_VERSION="vX.Y.Z"
     git checkout develop
     git branch release/${RELEASE_VERSION} && git checkout release/${RELEASE_VERSION}
@@ -553,7 +560,7 @@ create a new release manually by following these steps:
    (the `release` part is ignored but required by bumpversion :rolling_eyes:).
 4. Merge the release branch into `master`, tag the merge commit, and push back to the repo. 
    The CI pipeline publishes the package based on the tagged commit.
-    ```shell script
+    ```shell
     git checkout master
     git merge --no-ff release/${RELEASE_VERSION}
     git tag -a ${RELEASE_VERSION} -m"Release ${RELEASE_VERSION}"
@@ -564,7 +571,7 @@ create a new release manually by following these steps:
    always strictly more recent than the last published release version from 
    `master`.
 6. Merge the release branch into `develop`:
-    ```shell script
+    ```shell
     git checkout develop
     git merge --no-ff release/${RELEASE_VERSION}
     git push origin develop

diff --git a/README.md b/README.md
@@ -7,27 +7,13 @@
 </p>
 
 <p align="center" style="text-align:center;">
-    <a href="https://pypi.org/project/pydvl/">
-        <img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI">
-    </a>
-    <a href="https://pypi.org/project/pydvl/">
-        <img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version">
-    </a>
-    <a href="https://pydvl.org">
-        <img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation">
-    </a>
-    <a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE">
-        <img alt="License" src="https://img.shields.io/pypi/l/pydvl">
-    </a>
-    <a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml">
-        <img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" >
-    </a>
-    <a href="https://codecov.io/gh/aai-institute/pyDVL">
-      <img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/>
-    </a>
-    <a href="https://zenodo.org/badge/latestdoi/354117916">
-        <img src="https://zenodo.org/badge/354117916.svg" alt="DOI">
-    </a>
+    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI"></a>
+    <a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version"></a>
+    <a href="https://pydvl.org"><img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation"></a>
+    <a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE"><img alt="License" src="https://img.shields.io/pypi/l/pydvl"></a>
+    <a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml"><img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" ></a>
+    <a href="https://codecov.io/gh/aai-institute/pyDVL"><img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/></a>
+    <a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
 </p>
 
 **pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
@@ -332,7 +318,8 @@ We currently implement the following papers:
 - Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov. 
   [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). 
   In Proceedings of the AAAI-22. arXiv, 2021.
-
+- James Martens, Roger Grosse, [Optimizing Neural Networks with Kronecker-factored Approximate Curvature](https://arxiv.org/abs/1503.05671), International Conference on Machine Learning (ICML), 2015.
+- George, Thomas, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent, [Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis](https://arxiv.org/abs/1806.03884), Advances in Neural Information Processing Systems 31,2018.
   
 # License
 

diff --git a/build_scripts/copy_contributing_guide.py b/build_scripts/copy_contributing_guide.py
@@ -0,0 +1,38 @@
+import logging
+import os
+from pathlib import Path
+
+import mkdocs.plugins
+
+logger = logging.getLogger(__name__)
+
+root_dir = Path(__file__).parent.parent
+docs_dir = root_dir / "docs"
+contributing_file = root_dir / "CONTRIBUTING.md"
+target_filepath = docs_dir / contributing_file.name
+
+
+@mkdocs.plugins.event_priority(100)
+def on_pre_build(config):
+    logger.info("Temporarily copying contributing guide to docs directory")
+    try:
+        if os.path.getmtime(contributing_file) <= os.path.getmtime(target_filepath):
+            logger.info(
+                f"Contributing guide '{os.fspath(contributing_file)}' hasn't been updated, skipping."
+            )
+            return
+    except FileNotFoundError:
+        pass
+    logger.info(
+        f"Creating symbolic link for '{os.fspath(contributing_file)}' "
+        f"at '{os.fspath(target_filepath)}'"
+    )
+    target_filepath.symlink_to(contributing_file)
+
+    logger.info("Finished copying contributing guide to docs directory")
+
+
+@mkdocs.plugins.event_priority(-100)
+def on_shutdown():
+    logger.info("Removing temporary contributing guide in docs directory")
+    target_filepath.unlink()
diff --git a/docs/assets/pydvl.bib b/docs/assets/pydvl.bib
@@ -342,4 +342,21 @@ @InProceedings{kwon_data_2023
   pdf = 	 {https://proceedings.mlr.press/v202/kwon23e/kwon23e.pdf},
   url = 	 {https://proceedings.mlr.press/v202/kwon23e.html},
   abstract = 	 {Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than $2.25$ hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is $100$. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.}
+}
+
+@article{george2018fast,
+  title={Fast approximate natural gradient descent in a kronecker factored eigenbasis},
+  author={George, Thomas and Laurent, C{\'e}sar and Bouthillier, Xavier and Ballas, Nicolas and Vincent, Pascal},
+  journal={Advances in Neural Information Processing Systems},
+  volume={31},
+  year={2018}
+}
+
+@inproceedings{martens2015optimizing,
+  title={Optimizing neural networks with kronecker-factored approximate curvature},
+  author={Martens, James and Grosse, Roger},
+  booktitle={International conference on machine learning},
+  pages={2408--2417},
+  year={2015},
+  organization={PMLR}
 }
diff --git a/docs/css/extra.css b/docs/css/extra.css
@@ -69,6 +69,7 @@ a.autorefs-external:hover::after {
 .nt-card-image:focus {
   filter: invert(32%) sepia(93%) saturate(1535%) hue-rotate(220deg) brightness(102%) contrast(99%);
 }
+
 .md-header__button.md-logo {
     padding: 0;
 }

diff --git a/docs/css/grid-cards.css b/docs/css/grid-cards.css
@@ -0,0 +1,22 @@
+/* Shadow and Hover     */
+.grid.cards > ul > li {
+    box-shadow: 0 2px 2px 0 rgb(0 0 0 / 14%), 0 3px 1px -2px rgb(0 0 0 / 20%), 0 1px 5px 0 rgb(0 0 0 / 12%);
+
+    &:hover {
+        transform: scale(1.05);
+        z-index: 999;
+        background-color: rgba(0, 0, 0, 0.05);
+    }
+}
+
+[data-md-color-scheme="slate"] {
+    .grid.cards > ul > li {
+        box-shadow: 0 2px 2px 0 rgb(4 40 33 / 14%), 0 3px 1px -2px rgb(40 86 94 / 47%), 0 1px 5px 0 rgb(139 252 255 / 64%);
+
+        &:hover {
+            transform: scale(1.05);
+            z-index: 999;
+            background-color: rgba(139, 252, 255, 0.05);
+        }
+    }
+}
diff --git a/docs/css/neoteroi.css b/docs/css/neoteroi.css
diff --git a/docs/getting-started/first-steps.md b/docs/getting-started/first-steps.md
@@ -1,11 +1,11 @@
 ---
-title: Getting Started
+title: First Steps
 alias: 
-  name: getting-started
-  text: Getting Started
+  name: first-steps
+  text: First Steps
 ---
 
-# Getting started
+# First Steps
 
 !!! Warning
     Make sure you have read [[installation]] before using the library. 

diff --git a/docs/index.md b/docs/index.md
@@ -9,26 +9,39 @@ It runs most of them in parallel either locally or in a cluster and supports
 distributed caching of results.
 
 If you're a first time user of pyDVL, we recommend you to go through the
-[[getting-started]] and [[installation]] guides.
+[[installation]] and [[first-steps]] guides in the Getting Started section.
 
-::cards:: cols=2
+<div class="grid cards" markdown>
 
-- title: Installation
-  content: Steps to install and requirements
-  url: getting-started/installation.md
+-   :fontawesome-solid-toolbox:{ .lg .middle } __Installation__
+
+    ---
+    Steps to install and requirements
+
+    [[installation|:octicons-arrow-right-24: Installation]]
+
+-   :fontawesome-solid-scale-unbalanced:{ .lg .middle } __Data valuation__
+
+    ---
 
-- title: Data valuation
-  content: >
     Basics of data valuation and description of the main algorithms
-  url: value/
 
-- title: Influence Function
-  content: >
+    [[data-valuation|:octicons-arrow-right-24: Data Valuation]]
+
+-   :fontawesome-solid-scale-unbalanced-flip:{ .lg .middle } __Influence Function__
+
+    ---
+
     An introduction to the influence function and its computation with pyDVL
-  url: influence/
 
-- title: Browse the API
-  content: Full documentation of the API
-  url: api/pydvl/
+    [[influence-values|:octicons-arrow-right-24: Influence Values]]
+
+-   :fontawesome-regular-file-code:{ .lg .middle } __API Reference__
+
+    ---
+
+    Full documentation of the API
+
+    [:octicons-arrow-right-24: API Reference](api/pydvl/)
 
-::/cards::
+</div>