Skip to content

Conversation

@sririshindra
Copy link
Contributor

@sririshindra sririshindra commented Apr 24, 2020

What changes were proposed in this pull request?

Add unit tests to the 'number of output rows metric' for some join types in the SQLMetricSuite. A list of unit tests added are as follows.

  • ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi
  • BroadcastNestedLoopJoin: RightOuter
  • BroadcastHashJoin: LeftAnti

Why are the changes needed?

For some combinations of JoinType and Join algorithm there is no test coverage for the 'number of output rows' metric.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I added debug statements in the code to ensure the correct combination if JoinType and Join algorithms are triggered.
I further used Intellij debugger to test the same.

@maropu
Copy link
Member

maropu commented Apr 26, 2020

ok to test

}

test("ShuffledHashJoin(outer) metrics") {
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "40",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40 -> -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting it to -1 would trigger SortMergeJoin instead of ShuffledHashJoin based on the rules in Spark Strategies.

Copy link
Member

@maropu maropu Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a hint to control join physical plans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit.


Seq(("right_outer", 0L, df1, df2, false), ("left_outer", 0L, df2, df1, false),
("right_outer", 0L, df1, df2, true), ("left_outer", 0L, df2, df1, true))
.foreach { case (joinType, nodeId, df1, df2, enableWholeStage) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: df1 -> leftDf and df2 -> rightDf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the latest commit


test("ShuffledHashJoin(outer) metrics") {
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "40",
SQLConf.SHUFFLE_PARTITIONS.key -> "2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to set this value for the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my latest commit combined three different tests into one. Not setting this parameter will not trigger the correct join type based on the rules in the Spark Strategies file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit.

val df = df2.join(df1.hint("shuffle_hash"), $"key" === $"key2", "left_semi")
testSparkPlanMetrics(df, 1, Map(
nodeId -> (("ShuffledHashJoin", Map(
"number of output rows" -> 2L)))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to split this join test into three parts? It seems the only metric value is different between them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the latest commit

@SparkQA
Copy link

SparkQA commented Apr 26, 2020

Test build #121822 has finished for PR 28330 at commit 2af93ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sririshindra sririshindra requested a review from maropu April 26, 2020 18:29
@SparkQA
Copy link

SparkQA commented Apr 26, 2020

Test build #121850 has finished for PR 28330 at commit 7565906.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2020

Test build #121849 has finished for PR 28330 at commit 0d16bdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

nodeId -> (("BroadcastHashJoin", Map(
"number of output rows" -> numRows)))),
enableWholeStage
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrong indents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit

@sririshindra sririshindra requested a review from maropu April 27, 2020 19:53
@SparkQA
Copy link

SparkQA commented Apr 27, 2020

Test build #121909 has finished for PR 28330 at commit 882ba72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2020

Test build #121910 has finished for PR 28330 at commit 42d7ec4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

nodeId3 -> (("Exchange", Map(
"shuffle records written" -> 10L,
"records read" -> 10L)))),
enableWholeStage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this indentation is correct. I know it is just a small cosmetic change and probably doesn't need to be included in this PR. I will remove it if you think this should not be there.

}
}

test("ShuffledHashJoin(left,outer) metrics") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (left, outer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the latest commit.

}

test("ShuffledHashJoin(left,outer) metrics") {
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs this setting, even though the hint used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I should have removed these in the last commit itself. Fixed in the latest commit.

testSparkPlanMetrics(df, 2, Map(
nodeId -> (("BroadcastHashJoin", Map(
"number of output rows" -> numRows)))),
nodeId -> (("BroadcastHashJoin", Map("number of output rows" -> numRows)))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit

val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
val df2 = Seq((1, "1"), (2, "2"), (3, "3"), (4, "4")).toDF("key2", "value")
// Assume the execution plan is
// ... -> BroadcastHashJoin(nodeId = 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need this comment? I think the code below is clear without this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit

@sririshindra sririshindra requested a review from maropu April 28, 2020 17:52
@SparkQA
Copy link

SparkQA commented Apr 28, 2020

Test build #122012 has finished for PR 28330 at commit 9f7f98e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"number of output rows" -> 12L)))),
enableWholeStage
)
}
Copy link
Member

@maropu maropu Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: format(wrong indents)

        Seq((leftQuery, false), (rightQuery, false), (leftQuery, true), (rightQuery, true))
          .foreach { case (query, enableWholeStage) =>
          val df = spark.sql(query)
          testSparkPlanMetrics(df, 2, Map(
            0L -> (("BroadcastNestedLoopJoin", Map(
              "number of output rows" -> 12L)))),
            enableWholeStage
          )
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit.

@sririshindra sririshindra requested a review from maropu April 29, 2020 14:27
@SparkQA
Copy link

SparkQA commented Apr 30, 2020

Test build #122087 has finished for PR 28330 at commit 830dfbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sririshindra
Copy link
Contributor Author

@maropu Could you please let me know if there are any other changes needed in this PR. If not could you merge this PR. Thank you.

@sririshindra
Copy link
Contributor Author

@maropu Could you please take a look at this when you have a moment.

SQLConf.SHUFFLE_PARTITIONS.key -> "2",
SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
SQLConf.SHUFFLE_PARTITIONS.key -> "2",
SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: plz avoid unnecesary changes where possible.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 24, 2020

Test build #123049 has finished for PR 28330 at commit 830dfbf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 24, 2020

Test build #123057 has finished for PR 28330 at commit 830dfbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 25, 2020

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants