Skip to content

Comments

[SPARK-30528][SQL] Turn off DPP subquery duplication by default#27551

Closed
maryannxue wants to merge 2 commits intoapache:masterfrom
maryannxue:spark-30528
Closed

[SPARK-30528][SQL] Turn off DPP subquery duplication by default#27551
maryannxue wants to merge 2 commits intoapache:masterfrom
maryannxue:spark-30528

Conversation

@maryannxue
Copy link
Contributor

@maryannxue maryannxue commented Feb 12, 2020

What changes were proposed in this pull request?

This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression.
When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression.
This PR turns off the subquery duplication DPP filter by:

  1. adding a config spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly and setting it true by default.
  2. removing the existing meaningless config spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast since we always want to reuse broadcast results if possible.

Why are the changes needed?

This is to fix a potential performance regression caused by DPP.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated DynamicPartitionPruningSuite to test the new configuration.

@maryannxue
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true",
SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "true") {
SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "true",
SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") {
Copy link
Contributor

@cloud-fan cloud-fan Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move it to the outer withSQLConf and just below SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST_ONLY.key -> "false"?

withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true",
SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "false") {
SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "false",
SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118310 has finished for PR 27551 at commit 3514cf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maryannxue maryannxue changed the title [SPARK-30528] Turn off DPP subquery duplication by default [SPARK-30528][SQL] Turn off DPP subquery duplication by default Feb 12, 2020
@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118317 has finished for PR 27551 at commit 29f5ae8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 453d526 Feb 13, 2020
@cloud-fan
Copy link
Contributor

LGTM, merging to master/3.0!

cloud-fan pushed a commit that referenced this pull request Feb 13, 2020
### What changes were proposed in this pull request?
This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression.
When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression.
This PR turns off the subquery duplication DPP filter by:
1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default.
2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible.

### Why are the changes needed?
This is to fix a potential performance regression caused by DPP.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Updated DynamicPartitionPruningSuite to test the new configuration.

Closes #27551 from maryannxue/spark-30528.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 453d526)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression.
When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression.
This PR turns off the subquery duplication DPP filter by:
1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default.
2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible.

### Why are the changes needed?
This is to fix a potential performance regression caused by DPP.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Updated DynamicPartitionPruningSuite to test the new configuration.

Closes apache#27551 from maryannxue/spark-30528.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants