Skip to content

Choose broadcast join when one side is small the other side is unknown#19978

Merged
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:join_bcast
Jul 3, 2023
Merged

Choose broadcast join when one side is small the other side is unknown#19978
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:join_bcast

Conversation

@feilong-liu
Copy link
Contributor

In current cost based join type check, when the estimated size of one side of the join is small, i.e. within broadcast limit, but the other side is unknown, we will end up with partitioned join following syntactic order. This PR adds an option to choose broadcast join with the smaller side to be build input.

It's controlled by a session parameter which is default to false.

Test plan - (Please fill in how you tested your changes)

Test locally end to end.

== RELEASE NOTES ==

General Changes
* Add a session parameter `use_broadcast_when_buildsize_small_probeside_unknown` to choose join distribution type
   This session is default to false. When enabled, broadcast join will be chosen when one side of input is within broadcast limit and the other side is unknow.

Copy link
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add some tests with real query or at least plantest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of default to false, and roll out gradually.

@feilong-liu
Copy link
Contributor Author

Also add some tests with real query or at least plantest?

Add plan test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from sizeBasedJoin here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size based join only considers the size of input table, even when the estimation of immediate probe/build input is available.

For example, if I have a query like
with B as (select * from t1 join t2 using (key)) select * from A Join B using (key),
if we have estimation of A unknown, but estimation of B is small, we will not have broadcast with size based join. As in size based join, the size of t1 and t2 considered not representative of size of B after join operation. This is one of the case this PR is trying to solve.

I also thought about patching to the size based join, however, the size based join produces query plans for cases which I do not need here, and I will still need to have a separate session parameter inside the size based join implementation to control it. And it also makes the logic of size based join more complex. Hence I chose to have it as a separate part here.

@feilong-liu feilong-liu requested a review from pranjalssh June 27, 2023 00:03
Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, but the only thing I don't like about this is we have too many session parameters now.

Ideally, this should be enabled by default - and all breaking queries should override necesarry params. But migration is hard. For example, size-based-join is disabled in our clusters as well.

Can we rename this to something like experimental-join-distribution-type-enabled, so its clear we intend to fully roll this out and then eventually remove this?

@pranjalssh pranjalssh self-requested a review June 28, 2023 03:37
Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting in advance

@feilong-liu
Copy link
Contributor Author

It looks good, but the only thing I don't like about this is we have too many session parameters now.

Ideally, this should be enabled by default - and all breaking queries should override necesarry params. But migration is hard. For example, size-based-join is disabled in our clusters as well.

Can we rename this to something like experimental-join-distribution-type-enabled, so its clear we intend to fully roll this out and then eventually remove this?

Do you mean to change the session param name to something like "experiment_broadcast_when_buildsize_small_probeside_unknown". Will it be better to add a TODO comment than changing the name? I feel that this naming could confuse users. But definitely agree that we need to simplify our session params, and will make it a default behavior after fully rolling it out.

@feilong-liu feilong-liu force-pushed the join_bcast branch 2 times, most recently from c4c1a63 to 13d4275 Compare July 3, 2023 17:15
In current cost based join type check, when the estimated size of one
side of the join is small, i.e. within broadcast limit, but the other
side is unknown, we will end up with partitioned join following syntactic
order. This PR adds an option to choose broadcast join with the smaller
side to be build input.

It's controlled by a session parameter which is default to false.
@feilong-liu feilong-liu merged commit 56b3dfb into prestodb:master Jul 3, 2023
@feilong-liu feilong-liu deleted the join_bcast branch July 3, 2023 18:33
@wanglinsong wanglinsong mentioned this pull request Jul 27, 2023
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants