[SPARK-30528][SQL] Turn off DPP subquery duplication by default#27551
Closed
maryannxue wants to merge 2 commits intoapache:masterfrom
Closed
[SPARK-30528][SQL] Turn off DPP subquery duplication by default#27551maryannxue wants to merge 2 commits intoapache:masterfrom
maryannxue wants to merge 2 commits intoapache:masterfrom
Conversation
Contributor
Author
cloud-fan
reviewed
Feb 12, 2020
| withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true", | ||
| SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "true") { | ||
| SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "true", | ||
| SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") { |
Contributor
There was a problem hiding this comment.
shall we move it to the outer withSQLConf and just below SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST_ONLY.key -> "false"?
cloud-fan
reviewed
Feb 12, 2020
| withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true", | ||
| SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "false") { | ||
| SQLConf.DYNAMIC_PARTITION_PRUNING_USE_STATS.key -> "false", | ||
| SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") { |
|
Test build #118310 has finished for PR 27551 at commit
|
|
Test build #118317 has finished for PR 27551 at commit
|
Contributor
|
LGTM, merging to master/3.0! |
cloud-fan
pushed a commit
that referenced
this pull request
Feb 13, 2020
### What changes were proposed in this pull request? This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression. When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression. This PR turns off the subquery duplication DPP filter by: 1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default. 2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible. ### Why are the changes needed? This is to fix a potential performance regression caused by DPP. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated DynamicPartitionPruningSuite to test the new configuration. Closes #27551 from maryannxue/spark-30528. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 453d526) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho
pushed a commit
to sjincho/spark
that referenced
this pull request
Apr 15, 2020
### What changes were proposed in this pull request? This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression. When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression. This PR turns off the subquery duplication DPP filter by: 1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default. 2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible. ### Why are the changes needed? This is to fix a potential performance regression caused by DPP. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated DynamicPartitionPruningSuite to test the new configuration. Closes apache#27551 from maryannxue/spark-30528. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression.
When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression.
This PR turns off the subquery duplication DPP filter by:
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnlyand setting ittrueby default.spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastsince we always want to reuse broadcast results if possible.Why are the changes needed?
This is to fix a potential performance regression caused by DPP.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Updated DynamicPartitionPruningSuite to test the new configuration.