fix: add and remove partition transform on same column failed when use v1 metadata #2691

vinson0526 · 2021-06-10T05:19:55Z

When use v1 metadata, we cannot add a new partition that previously existed.
Further more, exception will thrown when we update partition spec after dropped partition transform more than once on same column.

I use iceberg master branch and spark 3.0.2.
we can reproduce follow these step.

create table test(create_time timestamp) using iceberg;
alter table test add partition field years(create_time);
alter table test drop partition field years(create_time);
alter table test add partition field months(create_time);
alter table test drop partition field months(create_time);
alter table test add partition field days(create_time);

and exception thrown

Multiple entries with same key: (1, void)=1001: create_time_month: void(1) and (1, void)=1000: create_time_year: void(1)

I try to fix it by this PR.

…e v1 metadata

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

pvary · 2021-06-10T09:16:25Z

Started the tests. Let's see if we break something, or not 😄

Left one question in review, but generally looks good to me

…Spec

pvary · 2021-06-11T07:52:27Z

@marton-bod, @lcspinter: Could you please take a look as well?

Thanks,
Peter

marton-bod

Looks good to me, just one question

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

rdblue · 2021-06-17T00:12:25Z

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

+      partitionName = PartitionSpecVisitor.visit(schema, newField, PartitionNameGenerator.INSTANCE);
+    }
+
+    adds.add(Pair.of(newField, partitionName));


Why not create a new PartitionField with its name set to partitionName? Then you wouldn't need to change as much, including the type of adds.

If name is null, we generate default name for it at the end of apply() method. We need all name of added field before detect the conflict. So i change the type of adds to save generated name , avoid generate it twice.
if i don't change the type of adds, i think there are four other options:

just generate name twice

add a set method setName(String name) to PartitionField

generate newField again with generated name: just like

if (newField.name() == null) { String partitionName = PartitionSpecVisitor.visit(schema, newField, PartitionNameGenerator.INSTANCE); newField = new PartitionField( newField.sourceId(), newField.fieldId(), partitionName, newField.transform()); } adds.add(newField);

write a new static visit method in PartitionSpecVisitor, the signature is like:

static <R> R visit(Schema schema, int sourceId, Transform<?, ?> transform, PartitionSpecVisitor<R> visitor)

Which do you think is better?

The PartitionField is created in this method, as is partitionName. I think that by reordering a couple of statements here, you can use the correct name on the field from the start and avoid more changes.

Regenerate PartitionField with partitionName if name is null.

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

rdblue · 2021-06-17T00:17:18Z

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

    }
  }

+  private boolean isVoidTransform(PartitionField field) {


+1 for this approach to detect void transforms.

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

rdblue · 2021-06-17T20:53:08Z

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

+    PartitionField existingField = nameToField.get(newName);
+    if (existingField != null && isVoidTransform(existingField)) {
+      // rename the old deleted field that is being replaced by the new field
+      renameField(existingField.name(), existingField.name() + "_" + UUID.randomUUID());


Why use UUID.randomUUID() instead of the partition field ID? I think it makes more sense to use existingField.fieldId().

change to existingField.fieldId()

rdblue · 2021-06-17T20:53:45Z

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

+    if (existingField != null) {
+      if (isVoidTransform(existingField)) {
+        // rename the old deleted field that is being replaced by the new field
+        renameField(existingField.name(), existingField.name() + "_" + UUID.randomUUID());


Here as well, I'd prefer not to use a UUID. This should be able to use the existing field's ID instead.

use the existing field's ID instead.

rdblue · 2021-06-18T16:39:03Z

Thanks for the update, @vinson0526! I kicked off the tests and will merge once they pass. The changes look great!

pvary · 2021-06-21T11:29:30Z

Since every test passed and got +1 from @rdblue, merged the change.

Thanks for the fix @vinson0526 and @rdblue, @marton-bod for the review!

fix: add and remove partition transform on same column failed when us…

563fca3

…e v1 metadata

github-actions bot added API core labels Jun 10, 2021

pvary reviewed Jun 10, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java Outdated Show resolved Hide resolved

zhangwenxin01 added 2 commits June 10, 2021 18:30

fix ut failed in TestUpdatePartitionSpec and TestTableUpdatePartition…

a9035f2

…Spec

fix ut failed in TestAlterTablePartitionFields

4877bdc

github-actions bot added the spark label Jun 10, 2021

pvary approved these changes Jun 11, 2021

View reviewed changes

marton-bod reviewed Jun 11, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 12, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java Outdated Show resolved Hide resolved

marton-bod approved these changes Jun 15, 2021

View reviewed changes

rename void transfrom partition name when conflict is detected

adf19d7

rdblue reviewed Jun 17, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 17, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java Outdated Show resolved Hide resolved

move conflict detect and handle code to addField and renameField

c9a429b

rdblue reviewed Jun 17, 2021

View reviewed changes

avoid change adds type by regenerate PartitionField if name is null

00a48a3

rdblue approved these changes Jun 18, 2021

View reviewed changes

pvary merged commit 619603c into apache:master Jun 21, 2021

vinson0526 deleted the fix_partition_evo_failed branch June 22, 2021 06:37

fix: add and remove partition transform on same column failed when use v1 metadata #2691

fix: add and remove partition transform on same column failed when use v1 metadata #2691

Uh oh!

Conversation

vinson0526 commented Jun 10, 2021

Uh oh!

Uh oh!

pvary commented Jun 10, 2021

Uh oh!

pvary commented Jun 11, 2021

Uh oh!

marton-bod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 18, 2021

Uh oh!

pvary commented Jun 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants