[SPARK-19426][SQL] Custom coalescer for Dataset #46541

SubhamSinghal · 2024-05-12T14:36:15Z

What changes were proposed in this pull request?

This pr adds a new API for coalesce in Dataset; users can specify the custom coalescer which reduces an input Dataset into fewer partitions. This coalescer implementation is the same with the one in RDD#coalesce added in #11865 (SPARK-14042).

This is the rework of #18861.

How was this patch tested?

Added tests in DatasetSuite.

hvanhovell · 2024-05-14T14:07:43Z

Can you walk me through the actual use case for this? Coalesce - historically - is incredibly hard to use for most end user, so before adding this I'd like to understand why.

SubhamSinghal · 2024-05-14T15:59:20Z

Coalesce does not enforce uniform data distribution across partitions. We would like to pass custom size based coalescer to have more uniform data distribution. This would avoid using repartition and shuffle at places.
Custom coalesce support is available in RDD and it would be better to have this in Dataframe as well.

SubhamSinghal · 2024-05-21T05:54:49Z

@hvanhovell will you be able to add review here or tag relevant folks?

github-actions · 2024-08-30T00:22:14Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

subham611 added 2 commits May 12, 2024 10:29

Add support for custom partitionCoalescer

bb43535

fix uts

db71631

github-actions bot added SQL CORE PYTHON labels May 12, 2024

subham611 added 6 commits May 12, 2024 20:27

fix scala lint

e8f089d

Fix lint issue

d52473d

Fix UT

ba3c963

Fix UT

192f2f7

Adds UT in CollapseRepartitionSuite

fa6ed6f

Fix lint

f3ebeb2

SubhamSinghal changed the title ~~[SPARK-19426][SQL][WIP] Custom coalescer for Dataset~~ [SPARK-19426][SQL] Custom coalescer for Dataset May 13, 2024

github-actions bot added the Stale label Aug 30, 2024

github-actions bot closed this Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

Uh oh!

SubhamSinghal commented May 12, 2024

Uh oh!

hvanhovell commented May 14, 2024

Uh oh!

SubhamSinghal commented May 14, 2024 •

edited

Loading

Uh oh!

SubhamSinghal commented May 21, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

Uh oh!

Conversation

SubhamSinghal commented May 12, 2024

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented May 14, 2024

Uh oh!

SubhamSinghal commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SubhamSinghal commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SubhamSinghal commented May 14, 2024 •

edited

Loading

SubhamSinghal commented May 21, 2024 •

edited

Loading