Skip to content

Commit

Permalink
Merge branch 'release/v0.8.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
AnesBenmerzoug committed Jan 26, 2024
2 parents 70df031 + 63753a2 commit ac4ac7f
Show file tree
Hide file tree
Showing 62 changed files with 4,575 additions and 1,227 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.8.0
current_version = 0.8.1
commit = False
tag = False
allow_dirty = False
Expand Down
1,015 changes: 696 additions & 319 deletions .test_durations

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,31 @@
# Changelog

## 0.8.1 - 🆕 🏗 New method and noteboo, Games with exact shapley values, bug fixes and cleanup

### Added

- Implement new method: `EkfacInfluence`
[PR #451](https://github.com/aai-institute/pyDVL/issues/451)
- New notebook to showcase ekfac for LLMs
[PR #483](https://github.com/aai-institute/pyDVL/pull/483)
- Implemented exact games in Castro et al. 2009 and 2017
[PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)

### Fixed

- Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
for single dimensional arrays [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
- Fix implementations of `to` methods of `TorchInfluenceFunctionModel` implementations
[PR #487](https://github.com/aai-institute/pyDVL/pull/487)
- Fixed bug with checking for converged values in semivalues
[PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)

### Docs

- Add applications of data valuation section, display examples more prominently,
make all sections visible in table of contents, use mkdocs material cards
in the home page [PR #492](https://github.com/aai-institute/pyDVL/pull/492)

## 0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁

### Added
Expand Down
39 changes: 23 additions & 16 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ to make your life easier.

Run the following to set up the pre-commit git hook to run before pushes:

```shell script
```shell
pre-commit install --hook-type pre-push
```

Expand All @@ -32,15 +32,15 @@ pre-commit install --hook-type pre-push
We strongly suggest using some form of virtual environment for working with the
library. E.g. with venv:

```shell script
```shell
python -m venv ./venv
. venv/bin/activate # `venv\Scripts\activate` in windows
pip install -r requirements-dev.txt -r requirements-docs.txt
```

With conda:

```shell script
```shell
conda create -n pydvl python=3.8
conda activate pydvl
pip install -r requirements-dev.txt -r requirements-docs.txt
Expand All @@ -49,7 +49,7 @@ pip install -r requirements-dev.txt -r requirements-docs.txt
A very convenient way of working with your library during development is to
install it in editable mode into your environment by running

```shell script
```shell
pip install -e .
```

Expand All @@ -58,7 +58,7 @@ suite) [pandoc](https://pandoc.org/) is required. Except for OSX, it should be i
automatically as a dependency with `requirements-docs.txt`. Under OSX you can
install pandoc (you'll need at least version 2.11) with:

```shell script
```shell
brew install pandoc
```

Expand Down Expand Up @@ -152,11 +152,11 @@ Two important markers are:
To test the notebooks separately, run (see [below](#notebooks) for details):

```shell
tox -e tests -- notebooks/
tox -e notebook-tests
```

To create a package locally, run:
```shell script
```shell
python setup.py sdist bdist_wheel
```

Expand Down Expand Up @@ -343,8 +343,12 @@ runs](#skipping-ci-runs)).
3. We split the tests based on their duration into groups and run them in parallel.
For that we use [pytest-split](https://jerry-git.github.io/pytest-split)
to first store the duration of all tests with `pytest --store-durations pytest --slow-tests`
to first store the duration of all tests with
`tox -e tests -- --store-durations --slow-tests`
in a `.test_durations` file.
Alternatively, we case use pytest directly
`pytest --store-durations --slow-tests`.
> **Note** This does not have to be done each time a new test or test case
> is added. For new tests and test cases pytes-split assumes
Expand All @@ -359,11 +363,14 @@ runs](#skipping-ci-runs)).
Then we can have as many splits as we want:
```shell
pytest --splits 3 --group 1
pytest --splits 3 --group 2
pytest --splits 3 --group 3
tox -e tests -- --splits 3 --group 1
tox -e tests -- --splits 3 --group 2
tox -e tests -- --splits 3 --group 3
```
Alternatively, we case use pytest directly
`pytest --splits 3 ---group 1`.
Each one of these commands should be run in a separate shell/job
to run the test groups in parallel and decrease the total runtime.
Expand Down Expand Up @@ -510,13 +517,13 @@ Then, a new release can be created using the script
`bumpversion` automatically derive the next release version by bumping the patch
part):
```shell script
```shell
build_scripts/release-version.sh 0.1.6
```
To find out how to use the script, pass the `-h` or `--help` flags:
```shell script
```shell
build_scripts/release-version.sh --help
```
Expand All @@ -542,7 +549,7 @@ create a new release manually by following these steps:
2. When ready to release: From the develop branch create the release branch and
perform release activities (update changelog, news, ...). For your own
convenience, define an env variable for the release version
```shell script
```shell
export RELEASE_VERSION="vX.Y.Z"
git checkout develop
git branch release/${RELEASE_VERSION} && git checkout release/${RELEASE_VERSION}
Expand All @@ -553,7 +560,7 @@ create a new release manually by following these steps:
(the `release` part is ignored but required by bumpversion :rolling_eyes:).
4. Merge the release branch into `master`, tag the merge commit, and push back to the repo.
The CI pipeline publishes the package based on the tagged commit.
```shell script
```shell
git checkout master
git merge --no-ff release/${RELEASE_VERSION}
git tag -a ${RELEASE_VERSION} -m"Release ${RELEASE_VERSION}"
Expand All @@ -564,7 +571,7 @@ create a new release manually by following these steps:
always strictly more recent than the last published release version from
`master`.
6. Merge the release branch into `develop`:
```shell script
```shell
git checkout develop
git merge --no-ff release/${RELEASE_VERSION}
git push origin develop
Expand Down
31 changes: 9 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,13 @@
</p>

<p align="center" style="text-align:center;">
<a href="https://pypi.org/project/pydvl/">
<img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI">
</a>
<a href="https://pypi.org/project/pydvl/">
<img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version">
</a>
<a href="https://pydvl.org">
<img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation">
</a>
<a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE">
<img alt="License" src="https://img.shields.io/pypi/l/pydvl">
</a>
<a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml">
<img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" >
</a>
<a href="https://codecov.io/gh/aai-institute/pyDVL">
<img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/>
</a>
<a href="https://zenodo.org/badge/latestdoi/354117916">
<img src="https://zenodo.org/badge/354117916.svg" alt="DOI">
</a>
<a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/v/pydvl.svg" alt="PyPI"></a>
<a href="https://pypi.org/project/pydvl/"><img src="https://img.shields.io/pypi/pyversions/pydvl.svg" alt="Version"></a>
<a href="https://pydvl.org"><img src="https://img.shields.io/badge/docs-All%20versions-009485" alt="documentation"></a>
<a href="https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE"><img alt="License" src="https://img.shields.io/pypi/l/pydvl"></a>
<a href="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml"><img src="https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg" alt="Build status" ></a>
<a href="https://codecov.io/gh/aai-institute/pyDVL"><img src="https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV"/></a>
<a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
</p>

**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
Expand Down Expand Up @@ -332,7 +318,8 @@ We currently implement the following papers:
- Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov.
[Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052).
In Proceedings of the AAAI-22. arXiv, 2021.
- James Martens, Roger Grosse, [Optimizing Neural Networks with Kronecker-factored Approximate Curvature](https://arxiv.org/abs/1503.05671), International Conference on Machine Learning (ICML), 2015.
- George, Thomas, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent, [Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis](https://arxiv.org/abs/1806.03884), Advances in Neural Information Processing Systems 31,2018.
# License
Expand Down
38 changes: 38 additions & 0 deletions build_scripts/copy_contributing_guide.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import logging
import os
from pathlib import Path

import mkdocs.plugins

logger = logging.getLogger(__name__)

root_dir = Path(__file__).parent.parent
docs_dir = root_dir / "docs"
contributing_file = root_dir / "CONTRIBUTING.md"
target_filepath = docs_dir / contributing_file.name


@mkdocs.plugins.event_priority(100)
def on_pre_build(config):
logger.info("Temporarily copying contributing guide to docs directory")
try:
if os.path.getmtime(contributing_file) <= os.path.getmtime(target_filepath):
logger.info(
f"Contributing guide '{os.fspath(contributing_file)}' hasn't been updated, skipping."
)
return
except FileNotFoundError:
pass
logger.info(
f"Creating symbolic link for '{os.fspath(contributing_file)}' "
f"at '{os.fspath(target_filepath)}'"
)
target_filepath.symlink_to(contributing_file)

logger.info("Finished copying contributing guide to docs directory")


@mkdocs.plugins.event_priority(-100)
def on_shutdown():
logger.info("Removing temporary contributing guide in docs directory")
target_filepath.unlink()
17 changes: 17 additions & 0 deletions docs/assets/pydvl.bib
Original file line number Diff line number Diff line change
Expand Up @@ -342,4 +342,21 @@ @InProceedings{kwon_data_2023
pdf = {https://proceedings.mlr.press/v202/kwon23e/kwon23e.pdf},
url = {https://proceedings.mlr.press/v202/kwon23e.html},
abstract = {Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than $2.25$ hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is $100$. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.}
}

@article{george2018fast,
title={Fast approximate natural gradient descent in a kronecker factored eigenbasis},
author={George, Thomas and Laurent, C{\'e}sar and Bouthillier, Xavier and Ballas, Nicolas and Vincent, Pascal},
journal={Advances in Neural Information Processing Systems},
volume={31},
year={2018}
}

@inproceedings{martens2015optimizing,
title={Optimizing neural networks with kronecker-factored approximate curvature},
author={Martens, James and Grosse, Roger},
booktitle={International conference on machine learning},
pages={2408--2417},
year={2015},
organization={PMLR}
}
1 change: 1 addition & 0 deletions docs/css/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ a.autorefs-external:hover::after {
.nt-card-image:focus {
filter: invert(32%) sepia(93%) saturate(1535%) hue-rotate(220deg) brightness(102%) contrast(99%);
}

.md-header__button.md-logo {
padding: 0;
}
Expand Down
22 changes: 22 additions & 0 deletions docs/css/grid-cards.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/* Shadow and Hover */
.grid.cards > ul > li {
box-shadow: 0 2px 2px 0 rgb(0 0 0 / 14%), 0 3px 1px -2px rgb(0 0 0 / 20%), 0 1px 5px 0 rgb(0 0 0 / 12%);

&:hover {
transform: scale(1.05);
z-index: 999;
background-color: rgba(0, 0, 0, 0.05);
}
}

[data-md-color-scheme="slate"] {
.grid.cards > ul > li {
box-shadow: 0 2px 2px 0 rgb(4 40 33 / 14%), 0 3px 1px -2px rgb(40 86 94 / 47%), 0 1px 5px 0 rgb(139 252 255 / 64%);

&:hover {
transform: scale(1.05);
z-index: 999;
background-color: rgba(139, 252, 255, 0.05);
}
}
}
1 change: 0 additions & 1 deletion docs/css/neoteroi.css

This file was deleted.

8 changes: 4 additions & 4 deletions docs/getting-started/first-steps.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: Getting Started
title: First Steps
alias:
name: getting-started
text: Getting Started
name: first-steps
text: First Steps
---

# Getting started
# First Steps

!!! Warning
Make sure you have read [[installation]] before using the library.
Expand Down
43 changes: 28 additions & 15 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,39 @@ It runs most of them in parallel either locally or in a cluster and supports
distributed caching of results.

If you're a first time user of pyDVL, we recommend you to go through the
[[getting-started]] and [[installation]] guides.
[[installation]] and [[first-steps]] guides in the Getting Started section.

::cards:: cols=2
<div class="grid cards" markdown>

- title: Installation
content: Steps to install and requirements
url: getting-started/installation.md
- :fontawesome-solid-toolbox:{ .lg .middle } __Installation__

---
Steps to install and requirements

[[installation|:octicons-arrow-right-24: Installation]]

- :fontawesome-solid-scale-unbalanced:{ .lg .middle } __Data valuation__

---

- title: Data valuation
content: >
Basics of data valuation and description of the main algorithms
url: value/

- title: Influence Function
content: >
[[data-valuation|:octicons-arrow-right-24: Data Valuation]]

- :fontawesome-solid-scale-unbalanced-flip:{ .lg .middle } __Influence Function__

---

An introduction to the influence function and its computation with pyDVL
url: influence/

- title: Browse the API
content: Full documentation of the API
url: api/pydvl/
[[influence-values|:octicons-arrow-right-24: Influence Values]]

- :fontawesome-regular-file-code:{ .lg .middle } __API Reference__

---

Full documentation of the API

[:octicons-arrow-right-24: API Reference](api/pydvl/)

::/cards::
</div>
Loading

0 comments on commit ac4ac7f

Please sign in to comment.