[SPARK-42480][SQL] Improve the performance of drop partitions #40069

wecharyu · 2023-02-17T18:24:46Z

What changes were proposed in this pull request?

Change to get matching partition names rather than partition objects when drop partitions

Why are the changes needed?

Partition names are enough to drop partitions
It can reduce the time overhead and driver memory overhead.

Does this PR introduce any user-facing change?

Yes, we have add a new sql conf to enable this feature: spark.sql.hive.dropPartitionByName.enabled

How was this patch tested?

Add new tests.

LuciferYang · 2023-02-17T19:17:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala

 import org.apache.spark.unsafe.types.UTF8String

 private[sql] object PartitioningUtils {
+  private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r


Too idealistic, not all partition tables follow this rule. For example, we can use
alter table ... partition(...) set location ... to relocate the partition to any directory

So if the data corresponding to the partition a=1 is stored in dir /1/, will there be a bad case with this pr?

@LuciferYang Thanks for your review, partition name is always followed this rule in Hive makePartName.
Partition name is only related to partition keys and values, other partition fields like location will not affect it.

I remember seeing similar cases in the production environment, but I can't remember the details. Need to have tests to check the corner scenes we can think of

cc @wangyum @sunchao FYI

LuciferYang

Please don't try to add any hive-related dependencies to the catalyst module

wecharyu · 2023-02-22T06:08:37Z

Addressed comments. @LuciferYang
And gentle ping @wangyum @sunchao: could you also take a look?

sunchao

Looks OK, but I wonder if we can add a few more tests for it. Scenarios I can think of:

partitions added with external tables, e.g., ALTER TABLE ... ADD PARTITION ... LOCATION
partition names with special characters, like %, =, etc.

We should also add a config to turn on/off this feature, in case there are edge cases that we haven't thought of, so users can fallback to the old behavior.

wecharyu · 2023-02-23T18:30:14Z

Add a conf spark.sql.hive.dropPartitionByName.enabled and two tests. cc: @sunchao

sunchao

LGTM

sunchao · 2023-03-09T00:30:22Z

Merged to master, thanks!

dongjoon-hyun · 2023-03-09T00:32:45Z

Thank youso much, @wecharyu, @sunchao and all!

dongjoon-hyun · 2023-03-09T00:33:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.hive.dropPartitionByName.enabled")
+      .doc("When true, Spark will get partition name rather than partition object " +
+           "to drop partition, which can improve the performance of drop partition.")
+      .version("3.4.0")


Hi, @sunchao . You need to backport this to branch-3.4.

You can do backporting still if you need this. Otherwise, we need to change this to 3.5.0.

thanks for pointing out @dongjoon-hyun ! yes, let me backport this to 3.4.0 too and update the JIRA accordingly.

I think it's pretty safe to backport to branch-3.4 since the feature is turned off by default.

Thank you for the decision. I also support your decision. Here is my +1.

### What changes were proposed in this pull request? 1. Change to get matching partition names rather than partition objects when drop partitions ### Why are the changes needed? 1. Partition names are enough to drop partitions 2. It can reduce the time overhead and driver memory overhead. ### Does this PR introduce _any_ user-facing change? Yes, we have add a new sql conf to enable this feature: `spark.sql.hive.dropPartitionByName.enabled` ### How was this patch tested? Add new tests. Closes #40069 from wecharyu/SPARK-42480. Authored-by: wecharyu <[email protected]> Signed-off-by: Chao Sun <[email protected]>

### What changes were proposed in this pull request? 1. Change to get matching partition names rather than partition objects when drop partitions ### Why are the changes needed? 1. Partition names are enough to drop partitions 2. It can reduce the time overhead and driver memory overhead. ### Does this PR introduce _any_ user-facing change? Yes, we have add a new sql conf to enable this feature: `spark.sql.hive.dropPartitionByName.enabled` ### How was this patch tested? Add new tests. Closes apache#40069 from wecharyu/SPARK-42480. Authored-by: wecharyu <[email protected]> Signed-off-by: Chao Sun <[email protected]>

[SPARK-42480][SQL] Improve the performance of drop partitions

ac729f6

github-actions bot added the SQL label Feb 17, 2023

LuciferYang reviewed Feb 17, 2023

View reviewed changes

resolve partition name should support special chars

2c7446e

github-actions bot added the BUILD label Feb 18, 2023

LuciferYang requested changes Feb 19, 2023

View reviewed changes

remove hive dependency from catalyst module and fix test

8101e4b

github-actions bot removed the BUILD label Feb 19, 2023

sunchao reviewed Feb 22, 2023

View reviewed changes

add spark.sql.hive.dropPartitionByName.enabled conf

a4acae8

add rename partition test

6a806a6

sunchao approved these changes Mar 7, 2023

View reviewed changes

sunchao closed this in 153ace7 Mar 9, 2023

dongjoon-hyun reviewed Mar 9, 2023

View reviewed changes

[SPARK-42480][SQL] Improve the performance of drop partitions #40069

[SPARK-42480][SQL] Improve the performance of drop partitions #40069

Uh oh!

Conversation

wecharyu commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

LuciferYang Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Feb 17, 2023

Choose a reason for hiding this comment

Uh oh!

wecharyu Feb 18, 2023

Choose a reason for hiding this comment

Uh oh!

LuciferYang Feb 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

wecharyu commented Feb 22, 2023

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

wecharyu commented Feb 23, 2023

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 9, 2023

Uh oh!

dongjoon-hyun Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

sunchao Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wecharyu commented Feb 17, 2023 •

edited

Loading

LuciferYang Feb 17, 2023 •

edited

Loading

LuciferYang Feb 18, 2023 •

edited

Loading

sunchao commented Mar 9, 2023 •

edited

Loading

dongjoon-hyun Mar 9, 2023 •

edited

Loading