Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify caching #600

Open
janosg opened this issue Jun 13, 2024 · 1 comment
Open

Simplify caching #600

janosg opened this issue Jun 13, 2024 · 1 comment
Labels
cleanup when code is ugly or unreadable and needs restyling design-problem Problems with internal architecture

Comments

@janosg
Copy link
Collaborator

janosg commented Jun 13, 2024

The new design of data valuation methods avoids repeated computations of the utility function without relying on caching. We could therefore get rid of our current caching implementation based on memcached, which seems overpowered. This would close several issues related to caching (e.g. #517, #475, #464 and #459). Moreover, it could solve problems that arise due to the many files the current caching solution creates.

The only situation where caching ist still really important is when one benchmarks multiple algorithms and wants to use caching to ensure that randomness is kept as constant as possible between different algorithms and to save runtime in the benchmark. We therefore should create an entry point for benchmarking frameworks to enable caching. I see two possible solutions:

  1. Use a simple shared-memory cache to store all utility evaluations and return them as part of the ValuationResult. A benchmarking library could then use these evaluations to build up a cache. All logic to wrap Utility with a cached version would be in the benchmarking library.
  2. We could keep the cache_backend abstraction in the Utility but only implement a much simpler shared-memory backend in pydvl. Users with advanced caching needs could then build their own backends.
@janosg janosg added cleanup when code is ugly or unreadable and needs restyling design-problem Problems with internal architecture labels Jun 13, 2024
@AnesBenmerzoug
Copy link
Collaborator

Now that we only use joblib for the parallelization of data valuation algorithms we could also leverage its caching mechanism through the Memory class and maybe only offer one extension to support caching in a distributed setting.

I tried using it when I refactored the caching backends and couldn't really make it work with memcached because it is implemented as a file-based caching. So I gave up on basing our code on it but I still took heavy inspiration from their interface so perhaps we could consider it again.

@janosg janosg mentioned this issue Jul 9, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup when code is ugly or unreadable and needs restyling design-problem Problems with internal architecture
Projects
None yet
Development

No branches or pull requests

2 participants