-
Notifications
You must be signed in to change notification settings - Fork 3k
Add UpdatePartitionSpec implementation #1919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return add(sourceId, fieldId, name, Transforms.fromString(column.type(), transform)); | ||
| } | ||
|
|
||
| Builder add(int sourceId, int fieldId, String name, Transform<?, ?> transform) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to add fields using Transform, not a string.
| public Builder alwaysNull(String sourceName, String targetName) { | ||
| checkAndAddPartitionName(targetName); | ||
| Types.NestedField sourceColumn = findSourceColumn(sourceName); | ||
| checkAndAddPartitionName(targetName, sourceColumn.fieldId()); // can duplicate a source column name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to allow identity partitions that have been deleted.
| * will not be resolved and will result in a {@link CommitFailedException}. | ||
| */ | ||
| public interface UpdatePartitionSpec extends PendingUpdate<PartitionSpec> { | ||
| UpdatePartitionSpec addField(String sourceName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add Javadoc for these methods after there is agreement on this API.
| return results; | ||
| } | ||
|
|
||
| static <R> R visit(Schema schema, PartitionField field, PartitionSpecVisitor<R> visitor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored this out of the method above to visit individual fields, for setting default partition names based on the transform.
9d39ee1 to
b141533
Compare
|
@jun-he, could you look at this? It implements what we discussed for partition spec evolution. |
b141533 to
d1bd036
Compare
| import static org.apache.iceberg.expressions.Expressions.truncate; | ||
| import static org.apache.iceberg.expressions.Expressions.year; | ||
|
|
||
| public class TestUpdatePartitionSpecV1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a test for .removeField("myField").addField("myField")? What's the expected outcome for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected result of any combination should be the same as if the two were done as separate changes, except for cases that signal some problem. For example, renaming a field and deleting the original field shows that there are two conflicting changes, even though we could apply one and then fail the second.
I agree that this should have a test for that case, as well as possibly some additional checks for consistency with rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, so would order matter?
is .removeField("myField").addField("myField") different from .addField("myField").removeField("myField")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Imagine I have partition shard=bucket(id, 16) and ran .removeField("shard").addField("shard", bucket(id, 32)). That replaces a 16-bin bucket scheme with a 32-bin bucket scheme. The opposite would add a bucket partition and then remove it, which looks like a mistake. So the second one would be rejected, the first would be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added quite a few more test cases for situations like this and improved error messages. Now, if you attempt to rename a field that was added you get an error that clearly says you can't rename an added field, instead of an error that says the field is unknown. Same thing for combinations of adds and deletes.
I also added a check to not allow adding duplicate time columns, like days("ts") and hours("ts"). That's allowed if the partition already existed, but not when adding partition fields.
danielcweeks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API looks fine to me. The only open question I have is what about type promotion of partition fields. We allow for add/remove/rename, but what if you want to promote int to String (etc.)?
Partitions fields store the type output by the partition transform. In a lot of cases, that won't change. For example, bucket always produces ints no matter what the input type. Some transforms, like identity, will produce values of the input type. The only way to change the output type for those transforms is to change the input type through column promotion. When an identity partition column is promoted from int to long or float to double, the partition column is promoted automatically as well and the existing values are promoted on read. In case you're wondering, we were also careful to make sure type promotion doesn't change the result of any partition functions. If the partition transform doesn't produce the same value when it is promoted after transformation, then promotion must be done before the transformation: |
|
Thanks for reviewing, @danielcweeks and @johnclara! I'll merge this. |
This adds a new API for partition spec evolution and an implementation to produce updated specs. This doesn't include adding the new API to table and transaction, adding a last assigned partition ID to metadata, or documentation for the new API. This is primarily the implementation and tests, to be followed with more integration and docs.