Add partition spec evolution #922

jun-he · 2020-04-15T07:40:48Z

This PR implements change-based APIs to support partition spec evolution

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java

rdblue · 2020-04-15T23:12:10Z

Thanks for working on this, @jun-he!

Besides my comment about replacing the entire spec, I have two other main concerns. First, I think we might want to consider a change-based API, which is related to my comment about not replacing the whole spec. For updating schemas, the API expresses things like addColumn, renameColumn, etc. that correspond to SQL DDL statements. I think it would make sense to have a similar API here:

addField(String source, ...)
renameField(String name, String newName)
replaceField(String name, ...)
removeField(String name)

I'm not sure how we should pass the transform to addField and replaceField just yet. We have expressions for the various transforms, so we could use an expression like addField("ts_hour", hour("ts")). We could also add a lot more variations of the method names, like addHourField, addDayField, etc.

Second, I think we are going to need a v1 version and a v2 version. In v1, we have to maintain the order and number of fields, so we are more limited for partition spec evolution: we can rename partition fields, replace existing partition fields with the void transform to "remove" them, and add new partition fields at the end of the spec. But in v2, we have reliable IDs so we can actually remove fields, replace a field with a different transform (drop days and add hours in the same position), also rename and add partition fields. I think the easiest thing is to build separate implementations that validate and apply those changes.

rdblue · 2020-04-15T23:49:14Z

#924 just made it in, so this can implement dropField for v1 by replacing the field with a void transform. That will avoid ID problems.

jun-he · 2020-04-24T17:07:55Z

@rdblue Thanks for the comments.
I am thinking we will always create a new spec with a new spec id. So the change-based API will first clone an existing spec (either the latest one or another one specified by its spec id). From there, we mutate the cloned one. This will maintain the history of the spec evolution and keep the existing spec immutable. Then, the change based API will be similar to how to handle new spec.
But this means the existing data will still use the original spec because we never modify that.

If we keep the existing spec immutable,

addField, renameField, and removeField will be straightforward for V1 and V2.
replaceField will be equivalent to removeField and then addField. One issue is to re-use the same field ID. This should be OK for V1 as V1 always reset the field Ids for a new spec. But for V2, we cannot reuse the same field ID as it has already been used by by a partition field in one or multiple old partition specs.

jun-he · 2020-05-25T06:57:01Z

@rdblue Wondering if it is better to have another separate PR to implement change-based partition spec evolution as this PR has already addressed #836 and added the basic evolution features to Table API. Thanks.

api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java

core/src/main/java/org/apache/iceberg/PartitionSpecUpdate.java

rdblue · 2020-05-25T17:07:07Z

@jun-he, thanks for working on this.

I don't think that the API here is the right one. I think that the change-based API is the only one we need, so it doesn't make sense to have a replacement API in addition. My comment above has more context.

jun-he · 2020-06-04T02:24:42Z

@rdblue I updated the PR with all change based APIs. Can you please take a look? Thanks.

rdblue · 2020-06-04T17:31:04Z

Looks like the test failure is due to importing Guava classes instead of the relocated versions. We now rely on a bundled and relocated version of Guava, introduced in #1068.

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java

api/src/main/java/org/apache/iceberg/PartitionSpec.java

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java

core/src/main/java/org/apache/iceberg/PartitionSpecUpdate.java

jun-he · 2020-06-28T22:18:31Z

@rdblue thanks for the comment. I address the comments and update the PR accordingly.
In this change, UpdatePartitionSpec now ignores all soft-deleted fields in V1 and then applies all changes to create a V2 partition spec. After that, it then fill each of partition filed id gaps with a VoidTransform field.

In this way, we can also support partition spec evolution, e.g.

reuse a field id previously removed in V1
keep spec concise without useless trailing void fields if needed

Additionally, it does not need to consider V1 case before committing and make the code cleaner and easier to read.

Can you please take another look and let me know your comments? Thanks!

rdblue · 2020-06-30T17:07:02Z

Thanks for the update, @jun-he! I'll take a look.

rdblue · 2020-07-05T21:35:36Z

api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java

  public void testMultipleTimestampPartitions() {
    AssertHelpers.assertThrows("Should not allow year(ts) and year(ts)",
        IllegalArgumentException.class, "Cannot use partition name more than once",
+        () -> PartitionSpec.builderFor(SCHEMA).year("ts", "year").year("another_ts", "year").build());


The context for these checks is incorrect. The problem is not a duplicate transform for a field, it is that the name is reused.

Also, these should not be in this test case for multiple timestamp partitions. These should be in a test case for duplicate names.

I think this change is the side effect if the duplicate partition detection takes precedence over duplicate name detection as discussed in #922 (comment).

The context, Should not allow year(ts) and year(ts) isn't correct because the call uses two different source columns, ts and another_ts. It should be Should not allow partition fields with the same name.

And since this is passing in the partition name, you can change it to avoid hitting the wrong error case. You could move this to a new test for name collisions, and update this to avoid the name collision by removing the explicit partition field name.

Thanks for the explanation! Updated accordingly.

rdblue · 2020-07-05T21:36:05Z

api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java

+        () -> PartitionSpec.builderFor(SCHEMA)
+            .bucket("id", 8, "id_bucket1")
+            .bucket("id", 16, "id_bucket2").build());
+


Nit: unnecessary empty line.

rdblue · 2020-07-05T21:45:26Z

core/src/main/java/org/apache/iceberg/PartitionSpecUpdate.java

+  private final Schema schema;
+  private final Map<String, PartitionField> curSpecFields;
+  private final List<Consumer<PartitionSpec.Builder>> newSpecFields = Lists.newArrayList();
+  private final Map<String, PartitionField> newRemovedFields = Maps.newHashMap();


I think the way that this operation keeps track of state makes it very confusing to read and understand. I can see the appeal of trying to just delegating to the spec builder, but the current approach requires complicated checks after the builder is configured (like checkIfRemoved on line 97) and methods that need to rewrite the entire spec anyway (like freshSpecFieldIds and fillGapsByNonNullFields).

I think that this would be easier to understand if it did validation and assignment incrementally. Instead of adding a method to rewrite the spec with different field IDs, I think this should check for deleted fields when the configuration method is called and use the correct ID from the start. This should help avoid the need for work-arounds like getLastPartitionField in the spec builder.

@rdblue Thanks for the comments. I think actually we can remove newRemovedFields , checkIfRemoved , and getLastPartitionField as there is no need to check if a newly added field is just removed.
As we discussed in #922 (comment), the main concern for removing a field and then adding it back is to pollute the metadata.
In the current implementation, it won't add duplications and then equivalent to do nothing or rename the field because the partition field ID reuse.
I added a few unit tests for those cases.

I also updated the code to use the correct ID from the start to avoid freshSpecFieldIds at the end.

rdblue · 2020-07-05T21:49:10Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

+      } else {
+        builder.add(spec);
+      }
+    }


I'm not sure that this is the right change. I think it's a good idea to allow calling this method with the same spec, but that should be detected and should return the exact same TableMetadata. That way, the commit will appear to succeed, but won't actually change the table (no-op commits are used elsewhere, too).

@rdblue Thanks for the comment. It is a good idea to support noop commit.
The main reason to remove the precondition check here is to support rename a partition field. In renaming case, the partition spec is compatible but it is different and need to be committed.
Updated the code to support no-op commit and added unit tests.

rdblue · 2020-07-05T21:50:13Z

core/src/test/java/org/apache/iceberg/TestPartitionSpecParser.java


  @Test
-  public void testToJsonForV1Table() {
+  public void testToJson() {


Instead of changing existing test cases, let's add new test cases for the case you want to exercise.

Thanks for the comment. This test was updated because TableTestBase was removed. There is no new test cases added into it..
I updated the test name because, with the current implementation, this test is not specific for V1 Table and there is no special toJson handling needed for V2 table.

rdblue · 2020-07-05T21:52:52Z

api/src/main/java/org/apache/iceberg/PartitionSpec.java

      PartitionField field = new PartitionField(
          sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.day(sourceColumn.type()));
      checkForRedundantPartitions(field);
+      checkAndAddPartitionName(targetName);


I think this was needed because of my comment that the duplicate field should throw an exception before the duplicate name.

rdblue · 2020-07-05T21:56:17Z

api/src/main/java/org/apache/iceberg/PartitionSpec.java

+      } else if (transform.startsWith("bucket[")) {
+        type = "bucket";
+      } else {
+        return null;


What about using the transform name in this situation? Then this would always catch duplicate transforms.

Also, should we add a case for truncate with different lengths?

Yep, that is better. I add a getName() method to Transform interface.

Similar to bucket, we may only allow one truncate transform for a given field.
Do you know if there is a valid use case with multiple truncate for a field?

jun-he · 2020-07-23T04:38:05Z

@rdblue can you take another look? Thanks.

jun-he · 2020-07-29T04:32:37Z

@rdblue can you let me know your comments? Thanks.

rdblue · 2020-12-17T01:39:44Z

Most of the changes here are included in #1942 so I'll close this. It would be nice to get some of the test updates in another PR, though. Thanks @jun-he!

jun-he mentioned this pull request Apr 15, 2020

Add persistent IDs to partition fields #845

Merged

rdblue mentioned this pull request Apr 15, 2020

Add void transform that always produces null #924

Merged

rdblue reviewed Apr 15, 2020

View reviewed changes

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 15, 2020

View reviewed changes

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java Outdated Show resolved Hide resolved

chenjunjiedada mentioned this pull request Apr 16, 2020

Create a table API for partition evolution #836

Closed

jun-he force-pushed the jun/partitionspec-evolution branch from 966a6b5 to 3e6111d Compare April 17, 2020 06:23

jun-he force-pushed the jun/partitionspec-evolution branch from 3e6111d to befa310 Compare April 25, 2020 06:24

jun-he force-pushed the jun/partitionspec-evolution branch from befa310 to 57fe84c Compare May 25, 2020 06:47

jun-he changed the title ~~[WIP] Add partition spec evolution~~ Add partition spec evolution May 25, 2020

jun-he changed the title ~~Add partition spec evolution~~ [Part 1] Add partition spec evolution May 25, 2020

rdblue reviewed May 25, 2020

View reviewed changes

api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java Show resolved Hide resolved

rdblue reviewed May 25, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionSpecUpdate.java Outdated Show resolved Hide resolved

rdblue mentioned this pull request Jun 3, 2020

Why didn't interface org.apache.iceberg.Table provide function updatePartitionSpec(newSpec)？ #1091

Closed

jun-he changed the title ~~[Part 1] Add partition spec evolution~~ Add partition spec evolution Jun 4, 2020