Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create abstraction for caching #458

Merged
merged 32 commits into from
Dec 19, 2023

Conversation

AnesBenmerzoug
Copy link
Collaborator

@AnesBenmerzoug AnesBenmerzoug commented Nov 23, 2023

Description

This PR closes #189 closes #124

I tried at first to rely on joblib's Memory class by just implementing a new backend for Memcached but that proved to be too cumbersome because it relies on I/O operations (opening files, writing to files, etc.) so I just took inspiration from their versions while keeping some details from our previous implementation (e.g. CacheStats, repeated evaluations).

I created a separate issue (#459) for the creation of a notebook showcasing the use of caching and its benefits.

Changes

  • Replace caching module with caching package.
  • Create CacheBackend base class for caching backend implementations.
  • Implement InMemoryCacheBackend, DiskCacheBackend, MemcachedCacheBackend.
  • Create CachedFunc class to wrap cached functions and methods.
  • Change default caching time_threshold from 0.3 to 0.0
  • Create new memcached extra and thus make pymemcache an optional dependency.
  • Rename config class MemcachedConfig to CachedFuncConfig, remove memcached client config from it.
  • Move caching configuration to separate config module inside the caching package.
  • Adapt Utility to caching changes.
  • Update existing tests and add new ones.
  • Update and improve installation documentation with a clear definition and description of all the extra dependencies.
  • Move all documentation related to caching, using memcached and parallelization to the first-steps page.
  • Remove caching section from readme.
  • Add link to extras section in documentation to readme.

Checklist

  • Wrote Unit tests (if necessary)
  • Updated Documentation (if necessary)
  • Updated Changelog
  • If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]

@AnesBenmerzoug AnesBenmerzoug self-assigned this Nov 23, 2023
@mdbenito
Copy link
Collaborator

mdbenito commented Nov 28, 2023

A high-level question: what is the use-case for the in-memory cache if it doesn't use inter-process shared memory or, easier, a multiprocessing manager?

It kicks in when submitting batches of samples to a worker, correct? But what about single samples?

@AnesBenmerzoug
Copy link
Collaborator Author

@mdbenito Yes, it would be used in a single step of computations in a worker e.g. a single permuation.
It is only useful really for permutation based shapley computation.
I see the InMemoryCacheBackend as very similary to @lru_cache. So that if we ever deem the caching system to not be that useful we can just decide to use the latter directly.

@mdbenito
Copy link
Collaborator

mdbenito commented Dec 1, 2023

@mdbenito Yes, it would be used in a single step of computations in a worker e.g. a single permuation. It is only useful really for permutation based shapley computation. I see the InMemoryCacheBackend as very similary to @lru_cache. So that if we ever deem the caching system to not be that useful we can just decide to use the latter directly.

Sorry, I think I wasn't clear enough. A process will only benefit of the InMemoryCache (irrespective of the sampling method) if it computes more than one marginal utility and there is a hit. For permutation sampling this can be achieved by batching two or more computations. This is not a default behaviour (remember Markus implemented it, and we left it there as a temporary hack) and the user needs to explicitly batch the samples. Otherwise, different futures will be executed in different processes and only the use of System V type shared memory or a managed dict solves the issue. My question is: why not go for the shared mem?

@AnesBenmerzoug AnesBenmerzoug linked an issue Dec 4, 2023 that may be closed by this pull request
Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. But my main concern is about an in-memory cache that won't be exploited at all with the default config since worker processes only compute one marginal. Except for semivalues that can batch, but this is not default behaviour and it isn't documented that one should do use it.

src/pydvl/utils/caching/config.py Outdated Show resolved Hide resolved
src/pydvl/utils/utility.py Outdated Show resolved Hide resolved
src/pydvl/utils/caching/disk.py Show resolved Hide resolved
src/pydvl/utils/caching/memcached.py Show resolved Hide resolved
@AnesBenmerzoug
Copy link
Collaborator Author

@mdbenito Thanks for the review! I addressed the code related comments.

As for the InMemoryCacheBackend class, I think you're wrong when you say that it isn't useful because it is used for only one marginal computation. Except for the semivalues module which submits one marginal computation at a time, all others (e.g. least core, permutation and combinatorial monte carlo) make multiple calls to the utility in a single worker so it is useful for those cases.

I thought about using what you suggested above, inter-process shared memory or, easier, a multiprocessing manager, but I wasn't sure about investing more time in that before making sure that caching is actually useful (#459).

If you still think that the current implementation of InMemoryCacheBackend is not useful, I can remove it.

@mdbenito
Copy link
Collaborator

mdbenito commented Dec 14, 2023

As discussed in our meeting, this is currently not working

u1 = Utility(model, data, scorer)
u2 = Utility(model2, data, scorer)

v1 = compute_values(u1)  # misses cache
v2 = compute_values(u2)  # misses cache
v3 = compute_values(u1)  # hits cache

@AnesBenmerzoug AnesBenmerzoug merged commit d6cb5e6 into develop Dec 19, 2023
18 checks passed
@mdbenito mdbenito deleted the feature/create-abstraction-for-cache branch February 13, 2024 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create abstraction for Cache Offer alternative to memcache outside docker
3 participants