-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-22345: Fix sort-merge joins with conditions and codegen. #19568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,7 +18,8 @@ | |
| package org.apache.spark.sql.execution.joins | ||
|
|
||
| import org.apache.spark.sql.{DataFrame, Row} | ||
| import org.apache.spark.sql.catalyst.expressions.Expression | ||
| import org.apache.spark.sql.catalyst.expressions.{And, BinaryExpression, Expression, Predicate} | ||
| import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback | ||
| import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys | ||
| import org.apache.spark.sql.catalyst.plans.Inner | ||
| import org.apache.spark.sql.catalyst.plans.logical.Join | ||
|
|
@@ -124,7 +125,8 @@ class InnerJoinSuite extends SparkPlanTest with SharedSQLContext { | |
| rightPlan: SparkPlan) = { | ||
| val sortMergeJoin = joins.SortMergeJoinExec(leftKeys, rightKeys, Inner, boundCondition, | ||
| leftPlan, rightPlan) | ||
| EnsureRequirements(spark.sessionState.conf).apply(sortMergeJoin) | ||
| EnsureRequirements(spark.sessionState.conf) | ||
| .apply(ProjectExec(sortMergeJoin.output, sortMergeJoin)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why we need to change this?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In 2.1.1, an extra project causes I guess I could add a testing case to |
||
| } | ||
|
|
||
| test(s"$testName using BroadcastHashJoin (build=left)") { | ||
|
|
@@ -228,6 +230,27 @@ class InnerJoinSuite extends SparkPlanTest with SharedSQLContext { | |
| ) | ||
| ) | ||
|
|
||
| testInnerJoin( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It still can pass without the changes in this PR. What is the purpose of this test case?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test fails in 2.1.1 and versions before 6b6dd68. I'm not sure how to exercise the code generated by CodegenFallback with that fix, but this test is valid for the 2.1.1 branch. |
||
| "inner join with CodegenFallback filter", | ||
| myUpperCaseData, | ||
| myLowerCaseData, | ||
| () => { | ||
| // add a second equality check that is implemented with a CodegenFallback | ||
| // this expression is in the test so that no one implements codegen for it | ||
| And( | ||
| (myUpperCaseData.col("N") === myLowerCaseData.col("n")).expr, | ||
| EqNoCodegen( | ||
| org.apache.spark.sql.functions.lower(myUpperCaseData.col("L")).expr, | ||
| myLowerCaseData.col("l").expr)) | ||
| }, | ||
| Seq( | ||
| (1, "A", 1, "a"), | ||
| (2, "B", 2, "b"), | ||
| (3, "C", 3, "c"), | ||
| (4, "D", 4, "d") | ||
| ) | ||
| ) | ||
|
|
||
| { | ||
| lazy val left = myTestData1.where("a = 1") | ||
| lazy val right = myTestData2.where("a = 1") | ||
|
|
@@ -287,3 +310,10 @@ class InnerJoinSuite extends SparkPlanTest with SharedSQLContext { | |
| (Row(2, 2), "L2", Row(2, 2), "R2"))) | ||
| } | ||
| } | ||
|
|
||
| case class EqNoCodegen(left: Expression, right: Expression) extends BinaryExpression | ||
| with CodegenFallback with Serializable with Predicate { | ||
| override protected def nullSafeEval(left: Any, right: Any): Boolean = { | ||
| left == right | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to do this when there is
CodegenFallbackin the condition expressions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The joined row should always be used for correctness. We don't know what code the expression will generate, so we should plan on always passing the correct input row. Setting left and right on a joined row is a cheap operation, so I'd rather do it correctly than rely on something brittle like
isInstanceOf[CodegenFallback].Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ended up being a bit more complicated. There are two problems (in 2.0.0 and 2.1.1 at least). The first is what this fixes, which is that the INPUT_ROW in the codegen context points to the wrong row. This is fixed and now has a test that fails if you uncomment the line that sets INPUT_ROW.
The second problem is in the check for
CodegenFallbackfails to check whether the condition supports codegen in some plans. To get the test to fail, I had to add a projection to exercise the path where this happens. I'll add a second commit for this problem.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second problem was fixed in this commit: 6b6dd68
I still think that the codegen problem should be fixed. Detecting
CodgenFallbackis imperfect, but will still generate code and run it. I think we should either remove codegen fromCodegenFallbackor add this fix to ensure that code works, even if we don't expect to run it.