[SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows metric' for some joins in SQLMetricSuite #28330

sririshindra · 2020-04-24T19:40:48Z

What changes were proposed in this pull request?

Add unit tests to the 'number of output rows metric' for some join types in the SQLMetricSuite. A list of unit tests added are as follows.

ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi
BroadcastNestedLoopJoin: RightOuter
BroadcastHashJoin: LeftAnti

Why are the changes needed?

For some combinations of JoinType and Join algorithm there is no test coverage for the 'number of output rows' metric.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I added debug statements in the code to ensure the correct combination if JoinType and Join algorithms are triggered.
I further used Intellij debugger to test the same.

…etric' for some joins in SQLMetricSuite

maropu · 2020-04-26T00:42:31Z

ok to test

maropu · 2020-04-26T00:44:10Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

  }

+  test("ShuffledHashJoin(outer) metrics") {
+    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "40",


40 -> -1?

Setting it to -1 would trigger SortMergeJoin instead of ShuffledHashJoin based on the rules in Spark Strategies.

How about using a hint to control join physical plans?

Fixed in the latest commit.

maropu · 2020-04-26T00:45:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

+
+      Seq(("right_outer", 0L, df1, df2, false), ("left_outer", 0L, df2, df1, false),
+        ("right_outer", 0L, df1, df2, true), ("left_outer", 0L, df2, df1, true))
+        .foreach { case (joinType, nodeId, df1, df2, enableWholeStage) =>


nit: df1 -> leftDf and df2 -> rightDf

fixed in the latest commit

maropu · 2020-04-26T00:46:15Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala


+  test("ShuffledHashJoin(outer) metrics") {
+    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "40",
+      SQLConf.SHUFFLE_PARTITIONS.key -> "2",


We need to set this value for the test?

In my latest commit combined three different tests into one. Not setting this parameter will not trigger the correct join type based on the rules in the Spark Strategies file.

ditto: #28330 (comment)

Fixed in the latest commit.

maropu · 2020-04-26T00:52:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

+        val df = df2.join(df1.hint("shuffle_hash"), $"key" === $"key2", "left_semi")
+        testSparkPlanMetrics(df, 1, Map(
+          nodeId -> (("ShuffledHashJoin", Map(
+            "number of output rows" -> 2L)))),


We need to split this join test into three parts? It seems the only metric value is different between them.

fixed in the latest commit

SparkQA · 2020-04-26T05:21:46Z

Test build #121822 has finished for PR 28330 at commit 2af93ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-26T21:03:18Z

Test build #121850 has finished for PR 28330 at commit 7565906.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-26T22:18:45Z

Test build #121849 has finished for PR 28330 at commit 0d16bdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

maropu · 2020-04-27T00:10:25Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

+          nodeId -> (("BroadcastHashJoin", Map(
+            "number of output rows" -> numRows)))),
+          enableWholeStage
+        )


nit: wrong indents

Fixed in latest commit

SparkQA · 2020-04-27T20:04:14Z

Test build #121909 has finished for PR 28330 at commit 882ba72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-27T20:27:27Z

Test build #121910 has finished for PR 28330 at commit 42d7ec4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-04-27T23:23:44Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

+            nodeId3 -> (("Exchange", Map(
+              "shuffle records written" -> 10L,
+              "records read" -> 10L)))),
+            enableWholeStage


unnecessary changes?

I think this indentation is correct. I know it is just a small cosmetic change and probably doesn't need to be included in this PR. I will remove it if you think this should not be there.

maropu · 2020-04-27T23:24:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

    }
  }

+  test("ShuffledHashJoin(left,outer) metrics") {


nit: (left, outer)

fixed in the latest commit.

maropu · 2020-04-27T23:25:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

  }

+  test("ShuffledHashJoin(left,outer) metrics") {
+    withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2",


Still needs this setting, even though the hint used?

You are right. I should have removed these in the last commit itself. Fixed in the latest commit.

maropu · 2020-04-27T23:26:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

      testSparkPlanMetrics(df, 2, Map(
-        nodeId -> (("BroadcastHashJoin", Map(
-          "number of output rows" -> numRows)))),
+        nodeId -> (("BroadcastHashJoin", Map("number of output rows" -> numRows)))),


unnecessary changes?

Fixed in latest commit

maropu · 2020-04-27T23:27:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

+    val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
+    val df2 = Seq((1, "1"), (2, "2"), (3, "3"), (4, "4")).toDF("key2", "value")
+    // Assume the execution plan is
+    // ... -> BroadcastHashJoin(nodeId = 1)


Need this comment? I think the code below is clear without this comment.

Fixed in latest commit

SparkQA · 2020-04-28T22:38:01Z

Test build #122012 has finished for PR 28330 at commit 9f7f98e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-04-28T23:47:25Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

+                "number of output rows" -> 12L)))),
+              enableWholeStage
+            )
+          }


nit: format(wrong indents)

Seq((leftQuery, false), (rightQuery, false), (leftQuery, true), (rightQuery, true)) .foreach { case (query, enableWholeStage) => val df = spark.sql(query) testSparkPlanMetrics(df, 2, Map( 0L -> (("BroadcastNestedLoopJoin", Map( "number of output rows" -> 12L)))), enableWholeStage ) }

Fixed in latest commit.

SparkQA · 2020-04-30T00:52:08Z

Test build #122087 has finished for PR 28330 at commit 830dfbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sririshindra · 2020-05-04T15:23:35Z

@maropu Could you please let me know if there are any other changes needed in this PR. If not could you merge this PR. Thank you.

sririshindra · 2020-05-18T17:52:19Z

@maropu Could you please take a look at this when you have a moment.

maropu · 2020-05-18T23:53:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala

-        SQLConf.SHUFFLE_PARTITIONS.key -> "2",
-        SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
+      SQLConf.SHUFFLE_PARTITIONS.key -> "2",
+      SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {


nit: plz avoid unnecesary changes where possible.

maropu

cc: @HyukjinKwon @dongjoon-hyun

HyukjinKwon · 2020-05-24T06:08:27Z

retest this please

SparkQA · 2020-05-24T07:05:02Z

Test build #123049 has finished for PR 28330 at commit 830dfbf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-24T08:49:12Z

retest this please

SparkQA · 2020-05-24T13:25:06Z

Test build #123057 has finished for PR 28330 at commit 830dfbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-25T03:43:53Z

Merged to master.

[SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows m…

2af93ea

…etric' for some joins in SQLMetricSuite

probot-autolabeler bot added the SQL label Apr 24, 2020

maropu reviewed Apr 26, 2020

View reviewed changes

sririshindra added 2 commits April 26, 2020 10:51

incorporating code review comments

0d16bdf

incorporating code review comments

7565906

sririshindra requested a review from maropu April 26, 2020 18:29

maropu reviewed Apr 27, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala Outdated Show resolved Hide resolved

maropu reviewed Apr 27, 2020

View reviewed changes

sririshindra added 2 commits April 27, 2020 07:28

incorporating code review comments

882ba72

incorporating code review comments

42d7ec4

sririshindra requested a review from maropu April 27, 2020 19:53

maropu reviewed Apr 27, 2020

View reviewed changes

incorporating code review comments

9f7f98e

sririshindra requested a review from maropu April 28, 2020 17:52

maropu reviewed Apr 28, 2020

View reviewed changes

sririshindra requested a review from maropu April 29, 2020 14:27

incorporating code review comments

830dfbf

maropu reviewed May 18, 2020

View reviewed changes

maropu approved these changes May 18, 2020

View reviewed changes

HyukjinKwon closed this in b90e10c May 25, 2020

[SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows metric' for some joins in SQLMetricSuite #28330

[SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows metric' for some joins in SQLMetricSuite #28330

Uh oh!

Conversation

sririshindra commented Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maropu commented Apr 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2020

Uh oh!

SparkQA commented Apr 26, 2020

Uh oh!

SparkQA commented Apr 26, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2020

Uh oh!

SparkQA commented Apr 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 28, 2020

Uh oh!

maropu Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sririshindra commented Apr 24, 2020 •

edited

Loading

maropu Apr 27, 2020 •

edited

Loading

maropu Apr 28, 2020 •

edited

Loading

HyukjinKwon commented May 25, 2020 •

edited

Loading