Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join #41398

Closed

Conversation

szehon-ho
Copy link
Contributor

@szehon-ho szehon-ho commented May 31, 2023

What changes were proposed in this pull request?

Add support for shuffle-hash join for following scenarios:

  • left outer join with left-side build
  • right outer join with right-side build

The algorithm is similar to SPARK-32399, which supports shuffle-hash join for full outer join.

The same methods fullOuterJoinWithUniqueKey and fullOuterJoinWithNonUniqueKey are improved to support the new case. These methods are called after the HashedRelation is already constructed of the build side, and do these two iterations:

  1. Iterate Stream side.
    a. If find match on build side, mark.
    b. If no match on build side, join with null build-side row and add to result
  2. Iterate build side.
    a. If find marked for match, add joined row to result
    b. If no match marked, join with null stream-side row

The left outer join with left-side build, and right outer join with right-side build, need only a subset of these logics, namely replacing 1b above with a no-op.

Codegen is left for a follow-up PR.

Why are the changes needed?

For joins of these types, shuffle-hash join can be more performant than sort-merge join, especially if the big table is large, as it skips an expensive sort of the big table.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test in JoinSuite.scala

@github-actions github-actions bot added the SQL label May 31, 2023
@szehon-ho szehon-ho closed this May 31, 2023
@szehon-ho szehon-ho reopened this May 31, 2023
@szehon-ho szehon-ho force-pushed the same_side_outer_build_join_master branch from 9d44062 to ea32dff Compare May 31, 2023 03:53
@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @szehon-ho .

@szehon-ho szehon-ho force-pushed the same_side_outer_build_join_master branch from 937f1ee to 532964b Compare May 31, 2023 04:52
@dongjoon-hyun
Copy link
Member

Also, cc @viirya , @huaxingao , @sunchao , too.

@szehon-ho szehon-ho force-pushed the same_side_outer_build_join_master branch from dbd8960 to 6089beb Compare May 31, 2023 16:00
@huaxingao
Copy link
Contributor

@szehon-ho Thanks for the PR! The change looks reasonable to me. I have left a few minor comments.

@cloud-fan
Copy link
Contributor

cc @maryannxue

@@ -57,6 +57,8 @@ case class ShuffledHashJoinExec(

override def outputOrdering: Seq[SortOrder] = joinType match {
case FullOuter => Nil
case LeftOuter if buildSide == BuildLeft => Nil
case RightOuter if buildSide == BuildRight => Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some comments to explain why the ordering can't be preserved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a comment as per my understanding (please let me know if I misunderstand something)

My thought was, because the second iteration on the build-side (for outer-join semantic) is on a hashedRelation, the result cannot be in order.

@szehon-ho szehon-ho force-pushed the same_side_outer_build_join_master branch from 86c86d0 to 505e234 Compare June 1, 2023 15:25
@huaxingao
Copy link
Contributor

LGTM

@huaxingao
Copy link
Contributor

@szehon-ho Could you re-trigger the failed CI pipeline?

@szehon-ho szehon-ho closed this Jun 2, 2023
@szehon-ho szehon-ho reopened this Jun 2, 2023
@szehon-ho
Copy link
Contributor Author

Yea I couldn't reproduce errors, trying again.

2023-06-01T19:34:30.5788895Z �[0m[�[0m�[31merror�[0m] �[0m�[0m	org.apache.spark.sql.errors.QueryCompilationErrorsSuite�[0m
2023-06-01T19:34:30.5791234Z �[0m[�[0m�[31merror�[0m] �[0m�[0m	org.apache.spark.sql.errors.QueryExecutionErrorsSuite�[0m

@@ -489,10 +489,16 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
assertShuffleHashJoin(
sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil)), BuildLeft)
assertShuffleHashJoin(
sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildRight)
sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildLeft)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this situation, t1 is smaller than t2, so it now picks t1. Before it was not possible to pick t1 and so t2 was picked.

Copy link
Member

@viirya viirya Jun 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant that the original test coverage (BuildRight) is removed and lost.

Comment on lines +1320 to +1323
val shjDF = df2.join(df1.hint("SHUFFLE_HASH"), joinExprs, "rightouter")
assert(collect(shjDF.queryExecution.executedPlan) {
case _: ShuffledHashJoinExec => true
}.size === 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to verify build side of ShuffledHashJoinExec is BuildLeft here? Or hint is always working?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two minor comments left, otherwise looks good to me.

@huaxingao
Copy link
Contributor

Only two minor comments left, otherwise looks good to me.

@viirya Thanks a lot for taking a look! Since these are minor comments for tests, I will merge this PR first, we will follow up after @szehon-ho comes back from vacation.

@huaxingao huaxingao closed this in 0effbec Jun 2, 2023
@huaxingao
Copy link
Contributor

Merged to master. Thanks @szehon-ho and et al.

@sunchao
Copy link
Member

sunchao commented Jun 2, 2023

cc @c21 too

@szehon-ho
Copy link
Contributor Author

Thanks everyone for the warm welcome to Spark, and really fast reviews!

As I'm out of town, I will look at any follow up improvements when I'm back.

czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
…join build right in shuffled hash join

### What changes were proposed in this pull request?
Add support for shuffle-hash join for following scenarios:

* left outer join with left-side build
* right outer join with right-side build

The algorithm is similar to SPARK-32399, which supports shuffle-hash join for full outer join.

The same methods fullOuterJoinWithUniqueKey and fullOuterJoinWithNonUniqueKey are improved to support the new case. These methods are called after the HashedRelation is already constructed of the build side, and do these two iterations:

1.  Iterate Stream side.
  a. If find match on build side, mark.
  b. If no match on build side, join with null build-side row and add to result
2. Iterate build side.
  a. If find marked for match, add joined row to result
  b. If no match marked, join with null stream-side row

The left outer join with left-side build, and right outer join with right-side build, need only a subset of these logics, namely replacing 1b above with a no-op.

Codegen is left for a follow-up PR.

### Why are the changes needed?
For joins of these types, shuffle-hash join can be more performant than sort-merge join, especially if the big table is large, as it skips an expensive sort of the big table.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test in JoinSuite.scala

Closes apache#41398 from szehon-ho/same_side_outer_build_join_master.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: huaxingao <[email protected]>
szehon-ho added a commit to szehon-ho/spark that referenced this pull request Jun 14, 2023
 ### What changes were proposed in this pull request?
Codegen of shuffled hash join of build side outer join (ie, left outer join build left or right outer join build right)

The implementation of apache#41398 was only for non-codegen version, and codegen was disabled in this scenario.

No

New unit test in WholeStageCodegenSuite
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…join build right in shuffled hash join

### What changes were proposed in this pull request?
Add support for shuffle-hash join for following scenarios:

* left outer join with left-side build
* right outer join with right-side build

The algorithm is similar to SPARK-32399, which supports shuffle-hash join for full outer join.

The same methods fullOuterJoinWithUniqueKey and fullOuterJoinWithNonUniqueKey are improved to support the new case. These methods are called after the HashedRelation is already constructed of the build side, and do these two iterations:

1.  Iterate Stream side.
  a. If find match on build side, mark.
  b. If no match on build side, join with null build-side row and add to result
2. Iterate build side.
  a. If find marked for match, add joined row to result
  b. If no match marked, join with null stream-side row

The left outer join with left-side build, and right outer join with right-side build, need only a subset of these logics, namely replacing 1b above with a no-op.

Codegen is left for a follow-up PR.

### Why are the changes needed?
For joins of these types, shuffle-hash join can be more performant than sort-merge join, especially if the big table is large, as it skips an expensive sort of the big table.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test in JoinSuite.scala

Closes apache#41398 from szehon-ho/same_side_outer_build_join_master.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: huaxingao <[email protected]>
(cherry picked from commit 0effbec)
szehon-ho added a commit to szehon-ho/spark that referenced this pull request Jun 30, 2023
 ### What changes were proposed in this pull request?
Codegen of shuffled hash join of build side outer join (ie, left outer join build left or right outer join build right)

 ### Why are the changes needed?

The implementation of apache#41398 was only for non-codegen version, and codegen was disabled in this scenario.

 ### Does this PR introduce _any_ user-facing change?

No

 ### How was this patch tested?

New unit test in WholeStageCodegenSuite
huaxingao pushed a commit that referenced this pull request Jul 1, 2023
### What changes were proposed in this pull request?
Codegen of shuffled hash join of build side outer join (ie, left outer join build left or right outer join build right)

 ### Why are the changes needed?
The implementation of #41398 was only for non-codegen version, and codegen was disabled in this scenario.

 ### Does this PR introduce _any_ user-facing change?
No

 ### How was this patch tested?
New unit test in WholeStageCodegenSuite

Closes #41614 from szehon-ho/same_side_outer_join_codegen_master.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: huaxingao <[email protected]>
cloud-fan added a commit that referenced this pull request Aug 29, 2024
…without codegen

### What changes were proposed in this pull request?

This is a re-submitting of #43938 to fix a join correctness bug caused by #41398 . Credits go to mcdull-zhang

### Why are the changes needed?

correctness fix

### Does this PR introduce _any_ user-facing change?

Yes, the query result will be corrected.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47905 from cloud-fan/join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Aug 29, 2024
…without codegen

This is a re-submitting of #43938 to fix a join correctness bug caused by #41398 . Credits go to mcdull-zhang

correctness fix

Yes, the query result will be corrected.

new test

no

Closes #47905 from cloud-fan/join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit af5e0a2)
Signed-off-by: Wenchen Fan <[email protected]>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…without codegen

### What changes were proposed in this pull request?

This is a re-submitting of apache#43938 to fix a join correctness bug caused by apache#41398 . Credits go to mcdull-zhang

### Why are the changes needed?

correctness fix

### Does this PR introduce _any_ user-facing change?

Yes, the query result will be corrected.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47905 from cloud-fan/join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…without codegen

### What changes were proposed in this pull request?

This is a re-submitting of apache#43938 to fix a join correctness bug caused by apache#41398 . Credits go to mcdull-zhang

### Why are the changes needed?

correctness fix

### Does this PR introduce _any_ user-facing change?

Yes, the query result will be corrected.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47905 from cloud-fan/join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…without codegen

### What changes were proposed in this pull request?

This is a re-submitting of apache#43938 to fix a join correctness bug caused by apache#41398 . Credits go to mcdull-zhang

### Why are the changes needed?

correctness fix

### Does this PR introduce _any_ user-facing change?

Yes, the query result will be corrected.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47905 from cloud-fan/join.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants