Create abstraction for caching #458

AnesBenmerzoug · 2023-11-23T16:52:59Z

Description

This PR closes #189 closes #124

I tried at first to rely on joblib's Memory class by just implementing a new backend for Memcached but that proved to be too cumbersome because it relies on I/O operations (opening files, writing to files, etc.) so I just took inspiration from their versions while keeping some details from our previous implementation (e.g. CacheStats, repeated evaluations).

I created a separate issue (#459) for the creation of a notebook showcasing the use of caching and its benefits.

Changes

Replace caching module with caching package.
Create CacheBackend base class for caching backend implementations.
Implement InMemoryCacheBackend, DiskCacheBackend, MemcachedCacheBackend.
Create CachedFunc class to wrap cached functions and methods.
Change default caching time_threshold from 0.3 to 0.0
Create new memcached extra and thus make pymemcache an optional dependency.
Rename config class MemcachedConfig to CachedFuncConfig, remove memcached client config from it.
Move caching configuration to separate config module inside the caching package.
Adapt Utility to caching changes.
Update existing tests and add new ones.
Update and improve installation documentation with a clear definition and description of all the extra dependencies.
Move all documentation related to caching, using memcached and parallelization to the first-steps page.
Remove caching section from readme.
Add link to extras section in documentation to readme.

Checklist

Wrote Unit tests (if necessary)
Updated Documentation (if necessary)
Updated Changelog
~~If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]~~

…ing package

mdbenito · 2023-11-28T16:22:54Z

A high-level question: what is the use-case for the in-memory cache if it doesn't use inter-process shared memory or, easier, a multiprocessing manager?

It kicks in when submitting batches of samples to a worker, correct? But what about single samples?

AnesBenmerzoug · 2023-11-30T06:46:50Z

@mdbenito Yes, it would be used in a single step of computations in a worker e.g. a single permuation.
It is only useful really for permutation based shapley computation.
I see the InMemoryCacheBackend as very similary to @lru_cache. So that if we ever deem the caching system to not be that useful we can just decide to use the latter directly.

mdbenito · 2023-12-01T15:58:49Z

@mdbenito Yes, it would be used in a single step of computations in a worker e.g. a single permuation. It is only useful really for permutation based shapley computation. I see the InMemoryCacheBackend as very similary to @lru_cache. So that if we ever deem the caching system to not be that useful we can just decide to use the latter directly.

Sorry, I think I wasn't clear enough. A process will only benefit of the InMemoryCache (irrespective of the sampling method) if it computes more than one marginal utility and there is a hit. For permutation sampling this can be achieved by batching two or more computations. This is not a default behaviour (remember Markus implemented it, and we left it there as a temporary hack) and the user needs to explicitly batch the samples. Otherwise, different futures will be executed in different processes and only the use of System V type shared memory or a managed dict solves the issue. My question is: why not go for the shared mem?

mdbenito

Left a couple of comments. But my main concern is about an in-memory cache that won't be exploited at all with the default config since worker processes only compute one marginal. Except for semivalues that can batch, but this is not default behaviour and it isn't documented that one should do use it.

src/pydvl/utils/caching/config.py

src/pydvl/utils/utility.py

src/pydvl/utils/caching/disk.py

src/pydvl/utils/caching/memcached.py

AnesBenmerzoug · 2023-12-09T20:10:27Z

@mdbenito Thanks for the review! I addressed the code related comments.

As for the InMemoryCacheBackend class, I think you're wrong when you say that it isn't useful because it is used for only one marginal computation. Except for the semivalues module which submits one marginal computation at a time, all others (e.g. least core, permutation and combinatorial monte carlo) make multiple calls to the utility in a single worker so it is useful for those cases.

I thought about using what you suggested above, inter-process shared memory or, easier, a multiprocessing manager, but I wasn't sure about investing more time in that before making sure that caching is actually useful (#459).

If you still think that the current implementation of InMemoryCacheBackend is not useful, I can remove it.

mdbenito · 2023-12-14T12:25:44Z

As discussed in our meeting, this is currently not working

u1 = Utility(model, data, scorer)
u2 = Utility(model2, data, scorer)

v1 = compute_values(u1)  # misses cache
v2 = compute_values(u2)  # misses cache
v3 = compute_values(u1)  # hits cache

AnesBenmerzoug added 10 commits October 30, 2023 16:33

Move existing caching module to new memcached module withing new cach…

ff476b9

…ing package

Refactor caching into separate classes and add 2 more implementations

5337451

Adapt Utility to caching change

f5c947e

Adapt tests

ab96b81

Remove caching section from readme

0811ede

Rename CacheBackendBase to CacheBackend, improve docstrings

44c5f57

Make pymemcached an optional dependency, define new memcached extra

073745f

Add joblib documentation inventory

6b2e3a7

Update and improve installation and first-steps docs

6b5e6ee

Add link to extras section of docs to readme

82fa952

AnesBenmerzoug self-assigned this Nov 23, 2023

AnesBenmerzoug added 9 commits November 23, 2023 17:55

Update changelog

16ae640

Fix type hints

fbc96cf

Remove leftover uses of enable_cache argument

ab4578a

Use name cache_backend instead of cache

6ce679a

Fix tests

ab9a20a

More fixes

8cb02ac

Handle usage of MemcachedCacheBackend when pymemcache is not installed

89472bc

Fix and improve caching package's docstring

030763d

Fix tests

763f6ec

AnesBenmerzoug requested review from mdbenito and schroedk November 26, 2023 13:41

AnesBenmerzoug linked an issue Dec 4, 2023 that may be closed by this pull request

Offer alternative to memcache outside docker #124

Closed

mdbenito reviewed Dec 8, 2023

View reviewed changes

src/pydvl/utils/caching/config.py Outdated Show resolved Hide resolved

src/pydvl/utils/utility.py Outdated Show resolved Hide resolved

src/pydvl/utils/caching/disk.py Show resolved Hide resolved

src/pydvl/utils/caching/memcached.py Show resolved Hide resolved

AnesBenmerzoug added 3 commits December 9, 2023 20:35

Add test for case when pymemcache is not installed

bc92356

Add test for cache backend serialization

bb69b77

Use newly created temporary directory for DiskCacheBackend

486b43a

AnesBenmerzoug mentioned this pull request Dec 9, 2023

Dynamically determine caching's time_threshold #464

Open

Set default value of cached_func_options to None

ef8bd33

AnesBenmerzoug added 3 commits December 13, 2023 13:15

Merge branch 'develop' into feature/create-abstraction-for-cache

798c232

Set backend time_threshold to 0.3

73e8e54

Fix test of utility with cache

287ca1d

AnesBenmerzoug and others added 6 commits December 14, 2023 20:38

Add hash_prefix parameter to CachedFuncConfig, use it in utility

196b310

Please mypy

02da342

Use builtin hash to compute hash_prefix

a2662f2

Merge branch 'develop' into feature/create-abstraction-for-cache

6b0b60c

Add suggestions from review session

471a64e

Merge branch 'develop' into feature/create-abstraction-for-cache

e0f4fc5

schroedk approved these changes Dec 19, 2023

View reviewed changes

AnesBenmerzoug merged commit d6cb5e6 into develop Dec 19, 2023
18 checks passed

mdbenito deleted the feature/create-abstraction-for-cache branch February 13, 2024 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create abstraction for caching #458

Create abstraction for caching #458

AnesBenmerzoug commented Nov 23, 2023 •

edited by mdbenito

Loading

mdbenito commented Nov 28, 2023 •

edited

Loading

AnesBenmerzoug commented Nov 30, 2023

mdbenito commented Dec 1, 2023

mdbenito left a comment

AnesBenmerzoug commented Dec 9, 2023

mdbenito commented Dec 14, 2023 •

edited

Loading

Create abstraction for caching #458

Create abstraction for caching #458

Conversation

AnesBenmerzoug commented Nov 23, 2023 • edited by mdbenito Loading

Description

Changes

Checklist

mdbenito commented Nov 28, 2023 • edited Loading

AnesBenmerzoug commented Nov 30, 2023

mdbenito commented Dec 1, 2023

mdbenito left a comment

Choose a reason for hiding this comment

AnesBenmerzoug commented Dec 9, 2023

mdbenito commented Dec 14, 2023 • edited Loading

AnesBenmerzoug commented Nov 23, 2023 •

edited by mdbenito

Loading

mdbenito commented Nov 28, 2023 •

edited

Loading

mdbenito commented Dec 14, 2023 •

edited

Loading