-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows metric' for some joins in SQLMetricSuite #28330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…etric' for some joins in SQLMetricSuite
|
ok to test |
| } | ||
|
|
||
| test("ShuffledHashJoin(outer) metrics") { | ||
| withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "40", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40 -> -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting it to -1 would trigger SortMergeJoin instead of ShuffledHashJoin based on the rules in Spark Strategies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using a hint to control join physical plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest commit.
|
|
||
| Seq(("right_outer", 0L, df1, df2, false), ("left_outer", 0L, df2, df1, false), | ||
| ("right_outer", 0L, df1, df2, true), ("left_outer", 0L, df2, df1, true)) | ||
| .foreach { case (joinType, nodeId, df1, df2, enableWholeStage) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: df1 -> leftDf and df2 -> rightDf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the latest commit
|
|
||
| test("ShuffledHashJoin(outer) metrics") { | ||
| withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "40", | ||
| SQLConf.SHUFFLE_PARTITIONS.key -> "2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to set this value for the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my latest commit combined three different tests into one. Not setting this parameter will not trigger the correct join type based on the rules in the Spark Strategies file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: #28330 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest commit.
| val df = df2.join(df1.hint("shuffle_hash"), $"key" === $"key2", "left_semi") | ||
| testSparkPlanMetrics(df, 1, Map( | ||
| nodeId -> (("ShuffledHashJoin", Map( | ||
| "number of output rows" -> 2L)))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to split this join test into three parts? It seems the only metric value is different between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the latest commit
|
Test build #121822 has finished for PR 28330 at commit
|
|
Test build #121850 has finished for PR 28330 at commit
|
|
Test build #121849 has finished for PR 28330 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
Outdated
Show resolved
Hide resolved
| nodeId -> (("BroadcastHashJoin", Map( | ||
| "number of output rows" -> numRows)))), | ||
| enableWholeStage | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wrong indents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
|
Test build #121909 has finished for PR 28330 at commit
|
|
Test build #121910 has finished for PR 28330 at commit
|
| nodeId3 -> (("Exchange", Map( | ||
| "shuffle records written" -> 10L, | ||
| "records read" -> 10L)))), | ||
| enableWholeStage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this indentation is correct. I know it is just a small cosmetic change and probably doesn't need to be included in this PR. I will remove it if you think this should not be there.
| } | ||
| } | ||
|
|
||
| test("ShuffledHashJoin(left,outer) metrics") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (left, outer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the latest commit.
| } | ||
|
|
||
| test("ShuffledHashJoin(left,outer) metrics") { | ||
| withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs this setting, even though the hint used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I should have removed these in the last commit itself. Fixed in the latest commit.
| testSparkPlanMetrics(df, 2, Map( | ||
| nodeId -> (("BroadcastHashJoin", Map( | ||
| "number of output rows" -> numRows)))), | ||
| nodeId -> (("BroadcastHashJoin", Map("number of output rows" -> numRows)))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
| val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value") | ||
| val df2 = Seq((1, "1"), (2, "2"), (3, "3"), (4, "4")).toDF("key2", "value") | ||
| // Assume the execution plan is | ||
| // ... -> BroadcastHashJoin(nodeId = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need this comment? I think the code below is clear without this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
|
Test build #122012 has finished for PR 28330 at commit
|
| "number of output rows" -> 12L)))), | ||
| enableWholeStage | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: format(wrong indents)
Seq((leftQuery, false), (rightQuery, false), (leftQuery, true), (rightQuery, true))
.foreach { case (query, enableWholeStage) =>
val df = spark.sql(query)
testSparkPlanMetrics(df, 2, Map(
0L -> (("BroadcastNestedLoopJoin", Map(
"number of output rows" -> 12L)))),
enableWholeStage
)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit.
|
Test build #122087 has finished for PR 28330 at commit
|
|
@maropu Could you please let me know if there are any other changes needed in this PR. If not could you merge this PR. Thank you. |
|
@maropu Could you please take a look at this when you have a moment. |
| SQLConf.SHUFFLE_PARTITIONS.key -> "2", | ||
| SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { | ||
| SQLConf.SHUFFLE_PARTITIONS.key -> "2", | ||
| SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: plz avoid unnecesary changes where possible.
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
retest this please |
|
Test build #123049 has finished for PR 28330 at commit
|
|
retest this please |
|
Test build #123057 has finished for PR 28330 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
Add unit tests to the 'number of output rows metric' for some join types in the SQLMetricSuite. A list of unit tests added are as follows.
Why are the changes needed?
For some combinations of JoinType and Join algorithm there is no test coverage for the 'number of output rows' metric.
Does this PR introduce any user-facing change?
No
How was this patch tested?
I added debug statements in the code to ensure the correct combination if JoinType and Join algorithms are triggered.
I further used Intellij debugger to test the same.