Skip to content

Fix merge join#17673

Closed
kewang1024 wants to merge 1 commit intoprestodb:masterfrom
kewang1024:merge_join_fix
Closed

Fix merge join#17673
kewang1024 wants to merge 1 commit intoprestodb:masterfrom
kewang1024:merge_join_fix

Conversation

@kewang1024
Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 commented Apr 20, 2022

Test plan - (Please fill in how you tested your changes)
Unit tests

== NO RELEASE NOTE ==

depends on fix from #17458 for the unit test to pass

@kewang1024 kewang1024 marked this pull request as ready for review April 20, 2022 23:00
@kewang1024 kewang1024 requested a review from yuanzhanhku April 20, 2022 23:00
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not needed as the if above returns.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could partitioningColumns be empty?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also address Masha's comment in the merge join PR by adding comments about these assumptions to the merge join node:
https://github.com/facebookexternal/presto_cpp/pull/692

MergeJoinNode seems to be missing information about the sort order of the join keys. Is there an assumption that the sorting order is ASC and in case of multiple keys the order of the keys is the same as in "criteria"? 

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where merge join can be planned but they cannot pass this check.
For example, if the two tables are both sorted by [A, B], and we join them on [B, A]. In this case, merge join should work.
The order in the leftJoinColumns and buildHashVariables doesn't matter.
We may need to add another field preSortedColumns to the merge join node(as Masha suggested) to indicate which columns are pre-sorted. The order of these columns could be different from the join key orders. In the above example, the join keys are [B, A], the preSortedColumns are [A, B].
Adding this field will also make it possible to support partial merge joins.

This can be done in a separate PR as it is a new feature.

Copy link
Copy Markdown
Collaborator Author

@kewang1024 kewang1024 Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this comment by adding new logic; Can you help review if the new logic makes sense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GroupedExecutionEnabled session property is just a hint right? Are there cases where it is set to true but group execution is not enabled?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be supported at execution time. Currently we don't have a preSortedColumns to indicate the sort by columns. We are using the order of the join keys as the order of the sorted keys at the execution time. This plan means the left table is order by nationkey, custkey which is incorrect.

We need to either add a PreSortedColumns field or disable this case(join order doesn't match sort by column order) for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I made this comment in the test file, this is actually about the new logic.

Copy link
Copy Markdown
Collaborator Author

@kewang1024 kewang1024 Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I only added the new verification logic and didn't add the PreSortedColumns because I wanted to discuss with you about the official format in our 1-1

But just add two fields in the PR as initial plan, let me know what you think

@yuanzhanhku yuanzhanhku requested a review from mbasmanova April 26, 2022 01:17
@mbasmanova
Copy link
Copy Markdown
Contributor

@kewang1024 Ke, would you update commit message and PR description to explain the problem you are fixing and the solution you chose.

@kewang1024
Copy link
Copy Markdown
Collaborator Author

@yuanzhanhku (the PR is not finalized) Can you help review if the new logic makes sense

@kewang1024 kewang1024 marked this pull request as draft April 26, 2022 01:22
@JsonProperty ("left") PlanNode left,
@JsonProperty ("right") PlanNode right,
@JsonProperty ("criteria") List<JoinNode.EquiJoinClause> criteria,
@JsonProperty ("leftSortingProperties") List<SortingProperty<VariableReferenceExpression>> leftSortingProperties,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, let's separate this into two PRs:
One PR to fix the correctness issues. Make sure we only plan merge join for the supported use cases. Disable the merge join planning for cases like the join column order doesn't match sort column order and the cases with DESC orders. This PR should not introduce any new fields so it doesn't require CPP side changes. This can unblock the CPP benchmark.

Another PR to support more use cases as mentioned above by adding these additional fields. These are nice-to-have features(low priority) and are not blockers for the benchmark. It requires CPP side change to update the protocols.

@kewang1024 kewang1024 closed this May 14, 2022
@kewang1024 kewang1024 deleted the merge_join_fix branch March 12, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants