-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
Search before asking
- I had searched in the issues and found no similar feature requirement.
Description
Currently, Datasets are considered immutable. Each operation (e.g., .map) creates new data blocks, leaving the original data block references untouched.
However, this can lead to 2x+ memory usage in the cluster, since Ray cannot safely delete the old blocks, and both block references are at least referenced until the operation finishes (e.g., in ds = ds.map(f)).
There are a few options to resolve this:
Option 1: Have all operations work in-place (mutate the dataset). Add a .clone() method if users want to get the existing behavior.
ds = ray.data.range(1000)
ds.map(f)
ds.random_shuffle() # assignment to ds is optional, the original copy is mutated
Option 2: Have a lazy mode (ds.lazy()) where operations are queued and executed lazily.
ds = ray.data.range(1000)
ds = ds.map(f)
ds = ds.random_shuffle()
ds.take() # actually executes the lazy operation
Option 3: Allow users to select whether operations are in-place/move blocks per operation.
ds = ray.data.range(1000)
ds.map_(f)
ds.random_shuffle_() # pytorch-like "inplace" syntax
Option 4: Rely on lineage reconstruction, and auto delete past blocks by default.
ds0 = ray.data.range(1000)
ds = ds0.map(f) # drops ds0 blocks from memory
ds = ds0.random_shuffle() # reconstructs original ds before shuffling
Currently the thought is that Option 4 is the best, but relies on lineage reconstruction to be on by default, as well as proper ray.put() support (cc @stephanie-wang )
Use case
No response
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!