-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42480][SQL] Improve the performance of drop partitions #40069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| import org.apache.spark.unsafe.types.UTF8String | ||
|
|
||
| private[sql] object PartitioningUtils { | ||
| private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too idealistic, not all partition tables follow this rule. For example, we can use
alter table ... partition(...) set location ... to relocate the partition to any directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the data corresponding to the partition a=1 is stored in dir /1/, will there be a bad case with this pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang Thanks for your review, partition name is always followed this rule in Hive makePartName.
Partition name is only related to partition keys and values, other partition fields like location will not affect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't try to add any hive-related dependencies to the catalyst module
|
Addressed comments. @LuciferYang |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK, but I wonder if we can add a few more tests for it. Scenarios I can think of:
- partitions added with external tables, e.g.,
ALTER TABLE ... ADD PARTITION ... LOCATION - partition names with special characters, like
%,=, etc.
We should also add a config to turn on/off this feature, in case there are edge cases that we haven't thought of, so users can fallback to the old behavior.
|
Add a conf |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master, thanks! |
| buildConf("spark.sql.hive.dropPartitionByName.enabled") | ||
| .doc("When true, Spark will get partition name rather than partition object " + | ||
| "to drop partition, which can improve the performance of drop partition.") | ||
| .version("3.4.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @sunchao . You need to backport this to branch-3.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do backporting still if you need this. Otherwise, we need to change this to 3.5.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing out @dongjoon-hyun ! yes, let me backport this to 3.4.0 too and update the JIRA accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's pretty safe to backport to branch-3.4 since the feature is turned off by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the decision. I also support your decision. Here is my +1.
### What changes were proposed in this pull request? 1. Change to get matching partition names rather than partition objects when drop partitions ### Why are the changes needed? 1. Partition names are enough to drop partitions 2. It can reduce the time overhead and driver memory overhead. ### Does this PR introduce _any_ user-facing change? Yes, we have add a new sql conf to enable this feature: `spark.sql.hive.dropPartitionByName.enabled` ### How was this patch tested? Add new tests. Closes #40069 from wecharyu/SPARK-42480. Authored-by: wecharyu <[email protected]> Signed-off-by: Chao Sun <[email protected]>
### What changes were proposed in this pull request? 1. Change to get matching partition names rather than partition objects when drop partitions ### Why are the changes needed? 1. Partition names are enough to drop partitions 2. It can reduce the time overhead and driver memory overhead. ### Does this PR introduce _any_ user-facing change? Yes, we have add a new sql conf to enable this feature: `spark.sql.hive.dropPartitionByName.enabled` ### How was this patch tested? Add new tests. Closes apache#40069 from wecharyu/SPARK-42480. Authored-by: wecharyu <[email protected]> Signed-off-by: Chao Sun <[email protected]>
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
Yes, we have add a new sql conf to enable this feature:
spark.sql.hive.dropPartitionByName.enabledHow was this patch tested?
Add new tests.