Add UpdatePartitionSpec implementation #1919

rdblue · 2020-12-12T01:57:32Z

This adds a new API for partition spec evolution and an implementation to produce updated specs. This doesn't include adding the new API to table and transaction, adding a last assigned partition ID to metadata, or documentation for the new API. This is primarily the implementation and tests, to be followed with more integration and docs.

rdblue · 2020-12-12T01:58:07Z

api/src/main/java/org/apache/iceberg/PartitionSpec.java

+      return add(sourceId, fieldId, name, Transforms.fromString(column.type(), transform));
+    }
+
+    Builder add(int sourceId, int fieldId, String name, Transform<?, ?> transform) {


Needed to add fields using Transform, not a string.

rdblue · 2020-12-12T01:58:27Z

api/src/main/java/org/apache/iceberg/PartitionSpec.java

    public Builder alwaysNull(String sourceName, String targetName) {
-      checkAndAddPartitionName(targetName);
      Types.NestedField sourceColumn = findSourceColumn(sourceName);
+      checkAndAddPartitionName(targetName, sourceColumn.fieldId()); // can duplicate a source column name


Needed to allow identity partitions that have been deleted.

rdblue · 2020-12-12T01:58:45Z

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java

+ * will not be resolved and will result in a {@link CommitFailedException}.
+ */
+public interface UpdatePartitionSpec extends PendingUpdate<PartitionSpec> {
+  UpdatePartitionSpec addField(String sourceName);


Will add Javadoc for these methods after there is agreement on this API.

rdblue · 2020-12-12T01:59:31Z

api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java

    return results;
  }
+
+  static <R> R visit(Schema schema, PartitionField field, PartitionSpecVisitor<R> visitor) {


Refactored this out of the method above to visit individual fields, for setting default partition names based on the transform.

rdblue · 2020-12-12T02:02:22Z

@jun-he, could you look at this? It implements what we discussed for partition spec evolution.

johnclara · 2020-12-12T23:33:10Z

core/src/test/java/org/apache/iceberg/TestUpdatePartitionSpecV1.java

+import static org.apache.iceberg.expressions.Expressions.truncate;
+import static org.apache.iceberg.expressions.Expressions.year;
+
+public class TestUpdatePartitionSpecV1 {


Maybe add a test for .removeField("myField").addField("myField")? What's the expected outcome for that?

The expected result of any combination should be the same as if the two were done as separate changes, except for cases that signal some problem. For example, renaming a field and deleting the original field shows that there are two conflicting changes, even though we could apply one and then fail the second.

I agree that this should have a test for that case, as well as possibly some additional checks for consistency with rename.

Cool, so would order matter?
is .removeField("myField").addField("myField") different from .addField("myField").removeField("myField")?

Yes. Imagine I have partition shard=bucket(id, 16) and ran .removeField("shard").addField("shard", bucket(id, 32)). That replaces a 16-bin bucket scheme with a 32-bin bucket scheme. The opposite would add a bucket partition and then remove it, which looks like a mistake. So the second one would be rejected, the first would be allowed.

I added quite a few more test cases for situations like this and improved error messages. Now, if you attempt to rename a field that was added you get an error that clearly says you can't rename an added field, instead of an error that says the field is unknown. Same thing for combinations of adds and deletes.

I also added a check to not allow adding duplicate time columns, like days("ts") and hours("ts"). That's allowed if the partition already existed, but not when adding partition fields.

danielcweeks

API looks fine to me. The only open question I have is what about type promotion of partition fields. We allow for add/remove/rename, but what if you want to promote int to String (etc.)?

rdblue · 2020-12-16T01:50:46Z

The only open question I have is what about type promotion of partition fields.

Partitions fields store the type output by the partition transform. In a lot of cases, that won't change. For example, bucket always produces ints no matter what the input type. Some transforms, like identity, will produce values of the input type. The only way to change the output type for those transforms is to change the input type through column promotion. When an identity partition column is promoted from int to long or float to double, the partition column is promoted automatically as well and the existing values are promoted on read.

In case you're wondering, we were also careful to make sure type promotion doesn't change the result of any partition functions. If the partition transform doesn't produce the same value when it is promoted after transformation, then promotion must be done before the transformation: bucket(int_col, width) is actually bucket((long) int_col, width). (This is also one reason why promotion to String is dangerous and not supported.)

rdblue · 2020-12-16T01:52:09Z

Thanks for reviewing, @danielcweeks and @johnclara! I'll merge this.

github-actions bot added API core labels Dec 12, 2020

rdblue commented Dec 12, 2020

View reviewed changes

rdblue requested review from aokolnychyi and danielcweeks December 12, 2020 02:00

rdblue force-pushed the add-partition-spec-evolution branch from 9d39ee1 to b141533 Compare December 12, 2020 02:01

Add UpdatePartitionSpec implementation.

d1bd036

rdblue force-pushed the add-partition-spec-evolution branch from b141533 to d1bd036 Compare December 12, 2020 21:11

Add tests and fix context messages.

a0e9dcd

johnclara reviewed Dec 12, 2020

View reviewed changes

rdblue added 3 commits December 14, 2020 17:37

Combine v1 and v2 tests.

1b1d1c8

Add more test cases.

cdf03c9

Catch duplicate time partitions and improve error messages.

ce56c31

danielcweeks approved these changes Dec 16, 2020

View reviewed changes

rdblue merged commit ab6a5e9 into apache:master Dec 16, 2020

rdblue mentioned this pull request Dec 16, 2020

Add UpdatePartitionSpec operation for tables. #1942

Merged

rdblue mentioned this pull request Jan 8, 2021

Add partition spec evolution to Table API #281

Closed

Add UpdatePartitionSpec implementation #1919

Add UpdatePartitionSpec implementation #1919

Uh oh!

Conversation

rdblue commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 12, 2020

Uh oh!

johnclara Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 14, 2020

Choose a reason for hiding this comment

Uh oh!

johnclara Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

danielcweeks left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Dec 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue commented Dec 12, 2020 •

edited

Loading

johnclara Dec 14, 2020 •

edited

Loading

rdblue Dec 14, 2020 •

edited

Loading

rdblue commented Dec 16, 2020 •

edited

Loading