Skip to content

Conversation

@wecharyu
Copy link
Contributor

@wecharyu wecharyu commented Feb 17, 2023

What changes were proposed in this pull request?

  1. Change to get matching partition names rather than partition objects when drop partitions

Why are the changes needed?

  1. Partition names are enough to drop partitions
  2. It can reduce the time overhead and driver memory overhead.

Does this PR introduce any user-facing change?

Yes, we have add a new sql conf to enable this feature: spark.sql.hive.dropPartitionByName.enabled

How was this patch tested?

Add new tests.

@github-actions github-actions bot added the SQL label Feb 17, 2023
import org.apache.spark.unsafe.types.UTF8String

private[sql] object PartitioningUtils {
private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r
Copy link
Contributor

@LuciferYang LuciferYang Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too idealistic, not all partition tables follow this rule. For example, we can use
alter table ... partition(...) set location ... to relocate the partition to any directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the data corresponding to the partition a=1 is stored in dir /1/, will there be a bad case with this pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang Thanks for your review, partition name is always followed this rule in Hive makePartName.
Partition name is only related to partition keys and values, other partition fields like location will not affect it.

Copy link
Contributor

@LuciferYang LuciferYang Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember seeing similar cases in the production environment, but I can't remember the details. Need to have tests to check the corner scenes we can think of

cc @wangyum @sunchao FYI

@github-actions github-actions bot added the BUILD label Feb 18, 2023
Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't try to add any hive-related dependencies to the catalyst module

@github-actions github-actions bot removed the BUILD label Feb 19, 2023
@wecharyu
Copy link
Contributor Author

Addressed comments. @LuciferYang
And gentle ping @wangyum @sunchao: could you also take a look?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, but I wonder if we can add a few more tests for it. Scenarios I can think of:

  • partitions added with external tables, e.g., ALTER TABLE ... ADD PARTITION ... LOCATION
  • partition names with special characters, like %, =, etc.

We should also add a config to turn on/off this feature, in case there are edge cases that we haven't thought of, so users can fallback to the old behavior.

@wecharyu
Copy link
Contributor Author

Add a conf spark.sql.hive.dropPartitionByName.enabled and two tests. cc: @sunchao

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunchao sunchao closed this in 153ace7 Mar 9, 2023
@sunchao
Copy link
Member

sunchao commented Mar 9, 2023

Merged to master, thanks!

@dongjoon-hyun
Copy link
Member

Thank youso much, @wecharyu, @sunchao and all!

buildConf("spark.sql.hive.dropPartitionByName.enabled")
.doc("When true, Spark will get partition name rather than partition object " +
"to drop partition, which can improve the performance of drop partition.")
.version("3.4.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @sunchao . You need to backport this to branch-3.4.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do backporting still if you need this. Otherwise, we need to change this to 3.5.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing out @dongjoon-hyun ! yes, let me backport this to 3.4.0 too and update the JIRA accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty safe to backport to branch-3.4 since the feature is turned off by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the decision. I also support your decision. Here is my +1.

sunchao pushed a commit that referenced this pull request Mar 9, 2023
### What changes were proposed in this pull request?
1. Change to get matching partition names rather than partition objects when drop partitions

### Why are the changes needed?
1. Partition names are enough to drop partitions
2. It can reduce the time overhead and driver memory overhead.

### Does this PR introduce _any_ user-facing change?
Yes, we have add a new sql conf to enable this feature: `spark.sql.hive.dropPartitionByName.enabled`

### How was this patch tested?
Add new tests.

Closes #40069 from wecharyu/SPARK-42480.

Authored-by: wecharyu <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?
1. Change to get matching partition names rather than partition objects when drop partitions

### Why are the changes needed?
1. Partition names are enough to drop partitions
2. It can reduce the time overhead and driver memory overhead.

### Does this PR introduce _any_ user-facing change?
Yes, we have add a new sql conf to enable this feature: `spark.sql.hive.dropPartitionByName.enabled`

### How was this patch tested?
Add new tests.

Closes apache#40069 from wecharyu/SPARK-42480.

Authored-by: wecharyu <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants