Skip to content

Conversation

@owenowenisme
Copy link
Member

@owenowenisme owenowenisme commented Oct 18, 2025

Description

This PR adds a ‎preserve_row option to ‎map_batches. When ‎preserve_row is true, the limit operator can be pushed down through this ‎map_batches call for optimization.

Note: ‎map_group is built on ‎map_batches, but limit pushdown support for ‎map_group is out of scope for this PR, so ‎preserve_row_count is set to false for it.

Related issues

Additional information

Signed-off-by: You-Cheng Lin <[email protected]>
@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label Oct 18, 2025
@owenowenisme owenowenisme changed the title [Data map_batches support limit_pushdown [Data] map_batches support limit_pushdown Oct 18, 2025
@owenowenisme owenowenisme marked this pull request as ready for review October 20, 2025 05:53
@owenowenisme owenowenisme requested a review from a team as a code owner October 20, 2025 05:53
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 20, 2025
fn_constructor_kwargs: Optional[Dict[str, Any]] = None,
min_rows_per_bundled_input: Optional[int] = None,
compute: Optional[ComputeStrategy] = None,
preserve_row_count: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preserves_row_count

assert result_with == expected


def test_limit_pushdown_preserve_row_count_with_map_batches(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a parameterized test?

worker.
memory: The heap memory in bytes to reserve for each parallel map worker.
concurrency: This argument is deprecated. Use ``compute`` argument.
preserve_row_count: Set to True only if the UDF always emits the same number of records it receives (no drops or duplicates). When true, the optimizer can push downstream limits past this transform for better pruning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 2nd sentence: When set to True, the logical optimizer, in the presence of a limit(limit=k), will only scan k rows prior to executing the UDF, thereby saving on compute resources.

Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Just a few cosmetic changes

Signed-off-by: You-Cheng Lin <[email protected]>
cursor[bot]

This comment was marked as outdated.

@owenowenisme owenowenisme force-pushed the data/map-batches-support-limit branch from 2bc26a1 to 93e8111 Compare October 24, 2025 04:26
Signed-off-by: You-Cheng Lin <[email protected]>
@alexeykudinkin alexeykudinkin merged commit 978ca10 into ray-project:master Oct 24, 2025
6 checks passed
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
## Description

This PR adds a ‎`preserve_row` option to ‎`map_batches`. When
‎`preserve_row` is true, the limit operator can be pushed down through
this ‎`map_batches` call for optimization.

Note: ‎`map_group` is built on ‎`map_batches`, but limit pushdown
support for ‎`map_group` is out of scope for this PR, so
‎`preserve_row_count` is set to false for it.

## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: You-Cheng Lin <[email protected]>
Signed-off-by: xgui <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Description

This PR adds a ‎`preserve_row` option to ‎`map_batches`. When
‎`preserve_row` is true, the limit operator can be pushed down through
this ‎`map_batches` call for optimization.

Note: ‎`map_group` is built on ‎`map_batches`, but limit pushdown
support for ‎`map_group` is out of scope for this PR, so
‎`preserve_row_count` is set to false for it.


## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: You-Cheng Lin <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description

This PR adds a ‎`preserve_row` option to ‎`map_batches`. When
‎`preserve_row` is true, the limit operator can be pushed down through
this ‎`map_batches` call for optimization.

Note: ‎`map_group` is built on ‎`map_batches`, but limit pushdown
support for ‎`map_group` is out of scope for this PR, so
‎`preserve_row_count` is set to false for it.

## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: You-Cheng Lin <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
## Description

This PR adds a ‎`preserve_row` option to ‎`map_batches`. When
‎`preserve_row` is true, the limit operator can be pushed down through
this ‎`map_batches` call for optimization.

Note: ‎`map_group` is built on ‎`map_batches`, but limit pushdown
support for ‎`map_group` is out of scope for this PR, so
‎`preserve_row_count` is set to false for it.

## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: You-Cheng Lin <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants