Add persistent IDs to partition fields #845

jun-he · 2020-03-16T08:11:45Z

For issue #280.

I took a different approach from the previous WIP PR (#499).
Instead of adding additional states and logics into TableMetadata, I created PartitionSpecUpdate to handle that.
Also, it is unnecessary to keep lastAssignedPartitonFieldId in TableMetadata as it can be lazily derived from all specs (previous specs are still there) stored in TableMetadata.

jun-he · 2020-03-23T04:21:47Z

@rdblue can you please take a look? Thanks.

jun-he · 2020-03-29T20:55:49Z

Hi @rdblue, can you please take a look? Thanks.

chenjunjiedada

Thanks @jun-he! This addresses my open issue for adding partition evolution.

Jut left some comments for you.

chenjunjiedada · 2020-04-03T09:52:26Z

api/src/main/java/org/apache/iceberg/PartitionSpec.java

  private transient volatile ListMultimap<Integer, PartitionField> fieldsBySourceId = null;
  private transient volatile Class<?>[] lazyJavaClasses = null;
  private transient volatile List<PartitionField> fieldList = null;
+  private final int lastAssignedFieldId;


Do we want to use TypeUtil::NextID?

This is tracking something different. Here, this is the highest ID assigned to any partition, so that the next ID assigned will be unique.

chenjunjiedada · 2020-04-03T09:59:32Z

api/src/main/java/org/apache/iceberg/PartitionField.java

-  PartitionField(int sourceId, String name, Transform<?, ?> transform) {
+  PartitionField(int sourceId, int fieldId, String name, Transform<?, ?> transform) {
    this.sourceId = sourceId;
+    this.fieldId = fieldId;


Do we need to check fieldId is larger than 1000?

I don't think so. That is a convention that we use, but not strictly required by the spec.

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

chenjunjiedada · 2020-04-03T11:45:21Z

core/src/main/java/org/apache/iceberg/PartitionSpecParser.java


-      builder.add(sourceId, name, transform);
+      // partition field ids are missing in old PartitionSpec, they always auto-increment from PARTITION_DATA_ID_START
+      if (!hasFieldId) {


How about the forward compatibility? Is it possible that an old reader reads the new spec? then it still parses the new spec field id start from 1000?

For forward-compatibility, I think that this should detect breaking changes to specs and throw an exception.

If IDs are removed by an older writer, then the IDs will be reassigned. That means that IDs must be assigned starting at 1000 and should have no gaps. If there are IDs, this should validate that assumption by checking that the field actually has the ID that is expected.

We should make a similar change on the write path: for each field, check that it's field ID is what would be assigned if it were removed by an older writer. That will prevent newer writers from creating specs that will break.

Instead of making these changes here, I think that this should be verified in TableMetadata. That would accomplish the same thing, but make it easier to check the table version.

api/src/main/java/org/apache/iceberg/PartitionField.java

api/src/main/java/org/apache/iceberg/PartitionSpec.java

rdblue · 2020-04-03T18:38:13Z

core/src/main/java/org/apache/iceberg/PartitionSpecParser.java

+    if (elements.hasNext() && elements.next().has(FIELD_ID)) {
+      hasFieldId = true;
+    }
+    elements = json.elements();


I think it would be easier to follow and would result in a better error message if we put this logic inside the elements loop.

How about adding a counter for field IDs that are present and after the loop throwing an exception if the counter is not equal to the number of fields? Then each field would be handled independently (using has(FIELD_ID)).

I like the idea of a check here that states there were missing field IDs.

rdblue · 2020-04-03T19:13:14Z

I'm only about half-way done reviewing this, but I wanted to capture some thoughts about forward-compatibility that was raised by @chenjunjiedada.

If there are already multiple partition specs, then the IDs may be reused and can even conflict. This isn't something we can change because manifest files embed the field IDs in their schemas. That means assignment when there are no IDs must be from 1000 and should be independent across different partition specs.

If an older version writes to the table, then it may remove any assigned partition IDs. That means that for any format v1 table, we must remain compatible with the current assignment strategy. That way, IDs can be removed by an old writer and will be the same when they are reassigned.

This also means that evolution is limited in v1 tables. To ensure that IDs can be reassigned correctly if they are removed, partition fields cannot be dropped or reordered in any way. Otherwise, reassignment would be incorrect. That means no removing partition fields, no reordering partition fields, and no adding partition fields unless they are added at the end of the spec.

We will be able to make more evolution changes when we can guarantee that all partition fields have IDs that won't be removed. We'll make the IDs a requirement in v2 tables.

core/src/main/java/org/apache/iceberg/TableMetadata.java

rdblue · 2020-04-03T19:43:10Z

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java

+ * When committing, these changes will be applied to the current table metadata. Commit conflicts
+ * will not be resolved and will result in a {@link CommitFailedException}.
+ */
+public interface UpdatePartitionSpec extends PendingUpdate<PartitionSpec> {


Can we remove the new interface and methods from this PR?

I don't think this is needed for this PR, and I'd like to minimize the number of changes. In addition, I don't think we want to move to a model where users create a new spec and apply it to a table. I think we instead want to evolve the partition spec of a table. So this API will probably be different when we release that feature.

I agree that it is better to move the update to a new PR, which addresses #281.
The main reason I put it here is to have additional unit tests to make sure it works as expected.

Additionally, I think we may support the table partition spec evolution in two ways

table.updatePartitionSpec() .update(spec) .commit();

table.updatePartitionSpec().newSpec(schema) .identity(...) .bucket(...) ... .commit();

The first approach may be used if clients want to define a spec and manage it by their codes, e.g. use defined spec object in multiple places.

For tests, you can use TableMetadata and commit directly:

TableOperations ops = table.operations(); TableMetadata base = ops.current() TableMetadata updated = base.updatePartitionSpec(newSpec); ops.commit(base, updated);

rdblue · 2020-04-03T19:44:58Z

Thanks @jun-he, this is great progress on partition field IDs! I think I understand how compatibility will work with field IDs and I tried to add that context to my comments in this PR. If it isn't clear, please let me know what doesn't make sense.

jun-he · 2020-04-10T07:59:34Z

I will move UpdatePartitionSpec and all related unit tests to another PR.

api/src/main/java/org/apache/iceberg/Transaction.java

core/src/main/java/org/apache/iceberg/PartitionSpecParser.java

core/src/main/java/org/apache/iceberg/TableMetadata.java

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

jun-he · 2020-04-15T18:08:41Z

@chenjunjiedada FYI, I refactored the code and the partition spec evolution changes and related tests are now in #922. Thanks.

rdblue · 2020-04-15T20:06:48Z

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

            .appendManifest(manifestWithDeletedFiles)
            .commit());
  }
+


Nit: unnecessary newline.

Thanks. Will remove it in #922.

rdblue · 2020-04-15T20:13:50Z

Awesome work, @jun-he! I'm merging this.

Python port of #845.

…dule (apache#845) * Internal: Updates AppleAwsSparkConnect util to 1.0.9 * Internal: Allow AppleAwsConnectionFactory to use Proxy Hosts * Internal: Allows AppleAwsConnectionFactory to use Spark/Hadoop Configuration

chenjunjiedada mentioned this pull request Mar 25, 2020

Create a table API for partition evolution #836

Closed

chenjunjiedada reviewed Apr 3, 2020

View reviewed changes

rdblue reviewed Apr 3, 2020

View reviewed changes

api/src/main/java/org/apache/iceberg/PartitionField.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2020

View reviewed changes

api/src/main/java/org/apache/iceberg/PartitionSpec.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2020

View reviewed changes

api/src/main/java/org/apache/iceberg/PartitionSpec.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/TableMetadata.java Show resolved Hide resolved

rdblue reviewed Apr 3, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/TableMetadata.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/TableMetadata.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 3, 2020

View reviewed changes

rdblue mentioned this pull request Apr 6, 2020

Add persistent IDs to partition fields (WIP) #499

Closed

jun-he added 4 commits April 10, 2020 00:20

Add persistent IDs to partition fields

b702aff

add some unit tests

89d0fc6

add additional unit tests

fdbd57d

address the comments

cd75cbd

jun-he force-pushed the jun/partition-field-id branch from 6d804af to cd75cbd Compare April 10, 2020 07:59