-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] map_batches support limit_pushdown
#57880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] map_batches support limit_pushdown
#57880
Conversation
Signed-off-by: You-Cheng Lin <[email protected]>
map_batches support limit_pushdownmap_batches support limit_pushdown
Signed-off-by: You-Cheng Lin <[email protected]>
| fn_constructor_kwargs: Optional[Dict[str, Any]] = None, | ||
| min_rows_per_bundled_input: Optional[int] = None, | ||
| compute: Optional[ComputeStrategy] = None, | ||
| preserve_row_count: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preserves_row_count
| assert result_with == expected | ||
|
|
||
|
|
||
| def test_limit_pushdown_preserve_row_count_with_map_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a parameterized test?
python/ray/data/dataset.py
Outdated
| worker. | ||
| memory: The heap memory in bytes to reserve for each parallel map worker. | ||
| concurrency: This argument is deprecated. Use ``compute`` argument. | ||
| preserve_row_count: Set to True only if the UDF always emits the same number of records it receives (no drops or duplicates). When true, the optimizer can push downstream limits past this transform for better pruning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 2nd sentence: When set to True, the logical optimizer, in the presence of a limit(limit=k), will only scan k rows prior to executing the UDF, thereby saving on compute resources.
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Just a few cosmetic changes
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
2bc26a1 to
93e8111
Compare
Signed-off-by: You-Cheng Lin <[email protected]>
## Description This PR adds a `preserve_row` option to `map_batches`. When `preserve_row` is true, the limit operator can be pushed down through this `map_batches` call for optimization. Note: `map_group` is built on `map_batches`, but limit pushdown support for `map_group` is out of scope for this PR, so `preserve_row_count` is set to false for it. ## Related issues ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]> Signed-off-by: xgui <[email protected]>
## Description This PR adds a `preserve_row` option to `map_batches`. When `preserve_row` is true, the limit operator can be pushed down through this `map_batches` call for optimization. Note: `map_group` is built on `map_batches`, but limit pushdown support for `map_group` is out of scope for this PR, so `preserve_row_count` is set to false for it. ## Related issues ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>
## Description This PR adds a `preserve_row` option to `map_batches`. When `preserve_row` is true, the limit operator can be pushed down through this `map_batches` call for optimization. Note: `map_group` is built on `map_batches`, but limit pushdown support for `map_group` is out of scope for this PR, so `preserve_row_count` is set to false for it. ## Related issues ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
## Description This PR adds a `preserve_row` option to `map_batches`. When `preserve_row` is true, the limit operator can be pushed down through this `map_batches` call for optimization. Note: `map_group` is built on `map_batches`, but limit pushdown support for `map_group` is out of scope for this PR, so `preserve_row_count` is set to false for it. ## Related issues ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]> Signed-off-by: Future-Outlier <[email protected]>
Description
This PR adds a
preserve_rowoption to map_batches. When preserve_rowis true, the limit operator can be pushed down through this map_batchescall for optimization.Note:
map_groupis built on map_batches, but limit pushdown support for map_groupis out of scope for this PR, so preserve_row_countis set to false for it.Related issues
Additional information