-
Notifications
You must be signed in to change notification settings - Fork 2.9k
fix: add and remove partition transform on same column failed when use v1 metadata #2691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java
Outdated
Show resolved
Hide resolved
|
Started the tests. Let's see if we break something, or not 😄 Left one question in review, but generally looks good to me |
|
@marton-bod, @lcspinter: Could you please take a look as well? Thanks, |
marton-bod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just one question
core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java
Outdated
Show resolved
Hide resolved
| partitionName = PartitionSpecVisitor.visit(schema, newField, PartitionNameGenerator.INSTANCE); | ||
| } | ||
|
|
||
| adds.add(Pair.of(newField, partitionName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not create a new PartitionField with its name set to partitionName? Then you wouldn't need to change as much, including the type of adds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If name is null, we generate default name for it at the end of apply() method. We need all name of added field before detect the conflict. So i change the type of adds to save generated name , avoid generate it twice.
if i don't change the type of adds, i think there are four other options:
- just generate name twice
- add a set method
setName(String name)toPartitionField - generate newField again with generated name: just like
if (newField.name() == null) {
String partitionName = PartitionSpecVisitor.visit(schema, newField, PartitionNameGenerator.INSTANCE);
newField = new PartitionField(
newField.sourceId(), newField.fieldId(), partitionName, newField.transform());
}
adds.add(newField);- write a new static
visitmethod inPartitionSpecVisitor, the signature is like:
static <R> R visit(Schema schema, int sourceId, Transform<?, ?> transform, PartitionSpecVisitor<R> visitor)Which do you think is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PartitionField is created in this method, as is partitionName. I think that by reordering a couple of statements here, you can use the correct name on the field from the start and avoid more changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regenerate PartitionField with partitionName if name is null.
core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| private boolean isVoidTransform(PartitionField field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this approach to detect void transforms.
core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java
Outdated
Show resolved
Hide resolved
| PartitionField existingField = nameToField.get(newName); | ||
| if (existingField != null && isVoidTransform(existingField)) { | ||
| // rename the old deleted field that is being replaced by the new field | ||
| renameField(existingField.name(), existingField.name() + "_" + UUID.randomUUID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use UUID.randomUUID() instead of the partition field ID? I think it makes more sense to use existingField.fieldId().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to existingField.fieldId()
| if (existingField != null) { | ||
| if (isVoidTransform(existingField)) { | ||
| // rename the old deleted field that is being replaced by the new field | ||
| renameField(existingField.name(), existingField.name() + "_" + UUID.randomUUID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, I'd prefer not to use a UUID. This should be able to use the existing field's ID instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the existing field's ID instead.
|
Thanks for the update, @vinson0526! I kicked off the tests and will merge once they pass. The changes look great! |
|
Since every test passed and got +1 from @rdblue, merged the change. Thanks for the fix @vinson0526 and @rdblue, @marton-bod for the review! |
When use v1 metadata, we cannot add a new partition that previously existed.
Further more, exception will thrown when we update partition spec after dropped partition transform more than once on same column.
I use iceberg master branch and spark 3.0.2.
we can reproduce follow these step.
and exception thrown
I try to fix it by this PR.