Skip to content

Optimize memory usage of datasets  #18791

@ericl

Description

@ericl

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Currently, Datasets are considered immutable. Each operation (e.g., .map) creates new data blocks, leaving the original data block references untouched.

However, this can lead to 2x+ memory usage in the cluster, since Ray cannot safely delete the old blocks, and both block references are at least referenced until the operation finishes (e.g., in ds = ds.map(f)).

There are a few options to resolve this:

Option 1: Have all operations work in-place (mutate the dataset). Add a .clone() method if users want to get the existing behavior.

ds = ray.data.range(1000)
ds.map(f)
ds.random_shuffle()  # assignment to ds is optional, the original copy is mutated

Option 2: Have a lazy mode (ds.lazy()) where operations are queued and executed lazily.

ds = ray.data.range(1000)
ds = ds.map(f)
ds = ds.random_shuffle()
ds.take()  # actually executes the lazy operation

Option 3: Allow users to select whether operations are in-place/move blocks per operation.

ds = ray.data.range(1000)
ds.map_(f)
ds.random_shuffle_()  # pytorch-like "inplace" syntax

Option 4: Rely on lineage reconstruction, and auto delete past blocks by default.

ds0 = ray.data.range(1000)
ds = ds0.map(f)  # drops ds0 blocks from memory
ds = ds0.random_shuffle()  # reconstructs original ds before shuffling

Currently the thought is that Option 4 is the best, but relies on lineage reconstruction to be on by default, as well as proper ray.put() support (cc @stephanie-wang )

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksdataRay Data-related issuesenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions