Skip to content

Conversation

@SubhamSinghal
Copy link

What changes were proposed in this pull request?

This pr adds a new API for coalesce in Dataset; users can specify the custom coalescer which reduces an input Dataset into fewer partitions. This coalescer implementation is the same with the one in RDD#coalesce added in #11865 (SPARK-14042).

This is the rework of #18861.

How was this patch tested?

Added tests in DatasetSuite.

@SubhamSinghal SubhamSinghal changed the title [SPARK-19426][SQL][WIP] Custom coalescer for Dataset [SPARK-19426][SQL] Custom coalescer for Dataset May 13, 2024
@hvanhovell
Copy link
Contributor

Can you walk me through the actual use case for this? Coalesce - historically - is incredibly hard to use for most end user, so before adding this I'd like to understand why.

@SubhamSinghal
Copy link
Author

SubhamSinghal commented May 14, 2024

Coalesce does not enforce uniform data distribution across partitions. We would like to pass custom size based coalescer to have more uniform data distribution. This would avoid using repartition and shuffle at places.
Custom coalesce support is available in RDD and it would be better to have this in Dataframe as well.

@SubhamSinghal
Copy link
Author

SubhamSinghal commented May 21, 2024

@hvanhovell will you be able to add review here or tag relevant folks?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Aug 30, 2024
@github-actions github-actions bot closed this Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants