-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache batches #109
Comments
I am just starting to work with xbatcher using some fairly large datasets (terabyte scale) and have run into a similar problem. The I like the idea of caching after generating the batches but am wondering if only the indices needs to be stored. Making a second copy of the data as batches is not reasonable at this scale. Ideally we would not need to iterate through the batches to fill BatchGenerator (the indices may be able to be compute directly). I see there are a couple of comments in the code regarding lazily loading compared to the current eager implementation. I'm happy to contribute on this. |
Hi @tjvandal, thanks for trying out xbatcher and commenting on this issue! I agree with your points, in fact this is a topic that has previously come up a couple times. Since this feature matters for more than just caching, I opened #111 for further decisions. Any contributions would be more than welcome! |
Is your feature request related to a problem?
Generating batches can be slow. A typical dataset used in Xbatcher originates in either cloud storage or a local file system, is loaded into Xarray lazily using Dask, and includes some lazily processing. Then Xbatcher does some indexing/transposing/concatenation on the underlying arrays and loads the data into what we think of as a batch. When batching through a dataset multiple times, it is often desirable to cache batches to speed up loading batches later on (e.g. during training a ML model).
It would be nice if Xbatcher included some features that made the process of generating and caching batches less painful.
Describe the solution you'd like
After discussing this with @maxrjones and @norlandrhagen, I think there are two leading approaches to caching that Xbatcher should explore.
This option is similar to what @leifdenby has proposed in #40. This would require users to explicitly pre-process their generator into something optimized for batch generation. In practice, it would look something like this:
I like this approach because it produces a zarr store that could be easily interrogated by an interested user. The main downside is that it requires an explicit cache dump step.
This option would push cache management inside the generator itself such that the generation of batches would first check if the batch exists in a configurable cache. The top level API cloud look something like this:
I like this approach because it has the potential to be highly configurable (when needed) and does not require a manual cache dump. The main downside I see is that the cache will be split into a bunch of small datasets (probably zarr stores).
Describe alternatives you've considered
We should consider not supporting caching in Xbatcher but instead develop a few recipes for how to use Dask caching (e.g. https://docs.dask.org/en/stable/caching.html or https://github.com/radix-ai/graphchain) or caching at the data loader level (e.g. https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cache).
See also this interesting conversation where this feature is discussed in the Pytorch context: pytorch/pytorch#35642
Additional context
No response
The text was updated successfully, but these errors were encountered: