Skip to content

Conversation

@vinson0526
Copy link
Contributor

When use v1 metadata, we cannot add a new partition that previously existed.
Further more, exception will thrown when we update partition spec after dropped partition transform more than once on same column.

I use iceberg master branch and spark 3.0.2.
we can reproduce follow these step.

create table test(create_time timestamp) using iceberg;
alter table test add partition field years(create_time);
alter table test drop partition field years(create_time);
alter table test add partition field months(create_time);
alter table test drop partition field months(create_time);
alter table test add partition field days(create_time);

and exception thrown

Multiple entries with same key: (1, void)=1001: create_time_month: void(1) and (1, void)=1000: create_time_year: void(1)

I try to fix it by this PR.

@pvary
Copy link
Contributor

pvary commented Jun 10, 2021

Started the tests. Let's see if we break something, or not 😄

Left one question in review, but generally looks good to me

@github-actions github-actions bot added the spark label Jun 10, 2021
@pvary
Copy link
Contributor

pvary commented Jun 11, 2021

@marton-bod, @lcspinter: Could you please take a look as well?

Thanks,
Peter

Copy link
Collaborator

@marton-bod marton-bod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just one question

partitionName = PartitionSpecVisitor.visit(schema, newField, PartitionNameGenerator.INSTANCE);
}

adds.add(Pair.of(newField, partitionName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not create a new PartitionField with its name set to partitionName? Then you wouldn't need to change as much, including the type of adds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If name is null, we generate default name for it at the end of apply() method. We need all name of added field before detect the conflict. So i change the type of adds to save generated name , avoid generate it twice.
if i don't change the type of adds, i think there are four other options:

  1. just generate name twice
  2. add a set method setName(String name) to PartitionField
  3. generate newField again with generated name: just like
if (newField.name() == null) {
    String partitionName = PartitionSpecVisitor.visit(schema, newField, PartitionNameGenerator.INSTANCE);
    newField = new PartitionField(
        newField.sourceId(), newField.fieldId(), partitionName, newField.transform());
}

adds.add(newField);
  1. write a new static visit method in PartitionSpecVisitor, the signature is like:
static <R> R visit(Schema schema, int sourceId, Transform<?, ?> transform, PartitionSpecVisitor<R> visitor)

Which do you think is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PartitionField is created in this method, as is partitionName. I think that by reordering a couple of statements here, you can use the correct name on the field from the start and avoid more changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regenerate PartitionField with partitionName if name is null.

}
}

private boolean isVoidTransform(PartitionField field) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this approach to detect void transforms.

PartitionField existingField = nameToField.get(newName);
if (existingField != null && isVoidTransform(existingField)) {
// rename the old deleted field that is being replaced by the new field
renameField(existingField.name(), existingField.name() + "_" + UUID.randomUUID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use UUID.randomUUID() instead of the partition field ID? I think it makes more sense to use existingField.fieldId().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to existingField.fieldId()

if (existingField != null) {
if (isVoidTransform(existingField)) {
// rename the old deleted field that is being replaced by the new field
renameField(existingField.name(), existingField.name() + "_" + UUID.randomUUID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, I'd prefer not to use a UUID. This should be able to use the existing field's ID instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the existing field's ID instead.

@rdblue
Copy link
Contributor

rdblue commented Jun 18, 2021

Thanks for the update, @vinson0526! I kicked off the tests and will merge once they pass. The changes look great!

@pvary pvary merged commit 619603c into apache:master Jun 21, 2021
@pvary
Copy link
Contributor

pvary commented Jun 21, 2021

Since every test passed and got +1 from @rdblue, merged the change.

Thanks for the fix @vinson0526 and @rdblue, @marton-bod for the review!

@vinson0526 vinson0526 deleted the fix_partition_evo_failed branch June 22, 2021 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants