[SPARK-30528][SQL] Turn off DPP subquery duplication by default by maryannxue · Pull Request #27551 · apache/spark

maryannxue · 2020-02-12T15:57:15Z

What changes were proposed in this pull request?

This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression.
When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression.
This PR turns off the subquery duplication DPP filter by:

adding a config spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly and setting it true by default.
removing the existing meaningless config spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast since we always want to reuse broadcast results if possible.

Why are the changes needed?

This is to fix a potential performance regression caused by DPP.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated DynamicPartitionPruningSuite to test the new configuration.

…ication

maryannxue · 2020-02-12T15:59:09Z

cc @cloud-fan @gatorsmile

cloud-fan · 2020-02-12T16:10:19Z

sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala

      withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true",
-        SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "true") {
+        SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "true",
+        SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") {


shall we move it to the outer withSQLConf and just below SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST_ONLY.key -> "false"?

cloud-fan · 2020-02-12T16:11:13Z

sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala

      withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true",
-        SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "false") {
+        SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "false",
+        SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") {


SparkQA · 2020-02-12T20:33:35Z

Test build #118310 has finished for PR 27551 at commit 3514cf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-12T23:13:12Z

Test build #118317 has finished for PR 27551 at commit 29f5ae8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-13T11:32:58Z

LGTM, merging to master/3.0!

### What changes were proposed in this pull request? This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression. When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression. This PR turns off the subquery duplication DPP filter by: 1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default. 2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible. ### Why are the changes needed? This is to fix a potential performance regression caused by DPP. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated DynamicPartitionPruningSuite to test the new configuration. Closes #27551 from maryannxue/spark-30528. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 453d526) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression. When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression. This PR turns off the subquery duplication DPP filter by: 1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default. 2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible. ### Why are the changes needed? This is to fix a potential performance regression caused by DPP. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated DynamicPartitionPruningSuite to test the new configuration. Closes apache#27551 from maryannxue/spark-30528. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-30528] Potential performance regression with DPP subquery dupl…

3514cf4

…ication

cloud-fan reviewed Feb 12, 2020

View reviewed changes

address review comments

29f5ae8

maryannxue changed the title ~~[SPARK-30528] Turn off DPP subquery duplication by default~~ [SPARK-30528][SQL] Turn off DPP subquery duplication by default Feb 12, 2020

cloud-fan closed this in 453d526 Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-30528][SQL] Turn off DPP subquery duplication by default#27551

[SPARK-30528][SQL] Turn off DPP subquery duplication by default#27551
maryannxue wants to merge 2 commits intoapache:masterfrom
maryannxue:spark-30528

maryannxue commented Feb 12, 2020 •

edited

Loading

Uh oh!

maryannxue commented Feb 12, 2020

Uh oh!

cloud-fan Feb 12, 2020 •

edited

Loading

Uh oh!

cloud-fan Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

cloud-fan commented Feb 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

maryannxue commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maryannxue commented Feb 12, 2020

Uh oh!

cloud-fan Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 12, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

cloud-fan commented Feb 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maryannxue commented Feb 12, 2020 •

edited

Loading

cloud-fan Feb 12, 2020 •

edited

Loading