Optimize memory usage of datasets 

### Search before asking

- [X] I had searched in the [issues](https://github.com/ray-project/ray/issues) and found no similar feature requirement.


### Description

Currently, Datasets are considered immutable. Each operation (e.g., `.map`) creates new data blocks, leaving the original data block references untouched.

However, this can lead to 2x+ memory usage in the cluster, since Ray cannot safely delete the old blocks, and both block references are at least referenced until the operation finishes (e.g., in `ds = ds.map(f)`).

There are a few options to resolve this:

**Option 1:** Have all operations work in-place (mutate the dataset). Add a .clone() method if users want to get the existing behavior.

```
ds = ray.data.range(1000)
ds.map(f)
ds.random_shuffle()  # assignment to ds is optional, the original copy is mutated
```

**Option 2:** Have a lazy mode (ds.lazy()) where operations are queued and executed lazily.

```
ds = ray.data.range(1000)
ds = ds.map(f)
ds = ds.random_shuffle()
ds.take()  # actually executes the lazy operation
```

**Option 3:** Allow users to select whether operations are in-place/move blocks per operation.

```
ds = ray.data.range(1000)
ds.map_(f)
ds.random_shuffle_()  # pytorch-like "inplace" syntax
```

**Option 4:** Rely on lineage reconstruction, and auto delete past blocks by default.

```
ds0 = ray.data.range(1000)
ds = ds0.map(f)  # drops ds0 blocks from memory
ds = ds0.random_shuffle()  # reconstructs original ds before shuffling
```
----

Currently the thought is that Option 4 is the best, but relies on lineage reconstruction to be on by default, as well as proper ray.put() support (cc @stephanie-wang )

### Use case

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize memory usage of datasets #18791

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize memory usage of datasets #18791

Description

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions