Skip to content

Fix merge join plan generation for scenarios of multi partitions and grouped-execution#17699

Closed
kewang1024 wants to merge 2 commits intoprestodb:masterfrom
kewang1024:fix_merge_join
Closed

Fix merge join plan generation for scenarios of multi partitions and grouped-execution#17699
kewang1024 wants to merge 2 commits intoprestodb:masterfrom
kewang1024:fix_merge_join

Conversation

@kewang1024
Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 commented Apr 27, 2022

== NO RELEASE NOTE ==

This PR fixes three issues

  1. When grouped-execution is not enabled, merge join shouldn't be enabled either.
  • We fix it by adding session property check before generating the merge join node
  • During fragmenting phase, we check if merge join node and its children nodes are capable of grouped execution
  1. When join keys don't contain partition key and we query multiple partitions of data, we can't enable merge join.
    We fix it by adding stream property check

  2. When grouped execution is enabled, its GroupedExecutionProperties is missing when we generate Merge join node
    We fix it by adding the visitMergeJoin function in GroupedExecutionTagger

Next step

  1. Currently Merge join feature only supports cases where the sorting order is ASC and in case of multiple keys, the order of the keys is the same as in "criteria", next step would be introducing additional fields in MergeJoinNode to support more scenarios

@kewang1024 kewang1024 requested a review from yuanzhanhku April 29, 2022 16:43
@kewang1024 kewang1024 requested a review from a team as a code owner May 2, 2022 17:34
@kewang1024 kewang1024 force-pushed the fix_merge_join branch 3 times, most recently from 77c8f9b to 5a86d2e Compare May 5, 2022 23:30
@kewang1024 kewang1024 changed the title Fix merge join Fix merge join plan generation for multi partitions and grouped-execution May 5, 2022
@kewang1024 kewang1024 changed the title Fix merge join plan generation for multi partitions and grouped-execution Fix merge join plan generation for scenarios of multi partitions and grouped-execution May 5, 2022
@prestodb prestodb deleted a comment from yuanzhanhku May 10, 2022
1. Merge join requires grouped execution, thus add check if session property of
grouped_execution is turned on before generating MergeJoinNode
2. Merge join also requires data to be sharded between splits/files, thus add check of
left and right side stream properties
visitMergeJoin function was missing from GroupedExecutionTagger which makes it
impossible to obtain the grouped execution property for MergeJoinNode
@rschlussel
Copy link
Copy Markdown
Contributor

Can you explain why grouped execution is needed for merge join? I thought we just needed sorted input? Also, does that mean the table needs to be bucketed too?

@yuanzhanhku
Copy link
Copy Markdown
Contributor

Can you explain why grouped execution is needed for merge join? I thought we just needed sorted input? Also, does that mean the table needs to be bucketed too?

The current implementation of Merge Join only works for bucketed and sorted tables. We need to make sure the two join inputs have the same key spaces to make the merge join work.

@kewang1024 kewang1024 closed this May 14, 2022
@kewang1024 kewang1024 deleted the fix_merge_join branch March 12, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants