-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: update schema constructor callers to include fresh identifier #2556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,10 +22,12 @@ | |
| import java.util.List; | ||
| import java.util.Locale; | ||
| import java.util.Map; | ||
| import java.util.Objects; | ||
| import java.util.Set; | ||
| import java.util.concurrent.atomic.AtomicInteger; | ||
| import java.util.function.Predicate; | ||
| import java.util.function.Supplier; | ||
| import java.util.stream.Collectors; | ||
| import org.apache.iceberg.Schema; | ||
| import org.apache.iceberg.relocated.com.google.common.base.Preconditions; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; | ||
|
|
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId) | |
| * @return a structurally identical schema with new ids assigned by the nextId function | ||
| */ | ||
| public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) { | ||
| return new Schema(TypeUtil | ||
| .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId)) | ||
| .asNestedType() | ||
| .fields()); | ||
| Types.StructType freshSchemaStruct = TypeUtil | ||
| .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId)) | ||
| .asStructType(); | ||
| return new Schema(freshSchemaStruct.fields(), refreshIdentifierFields(freshSchemaStruct, schema)); | ||
| } | ||
|
|
||
| /** | ||
| * Get the identifier fields in the fresh schema based on the identifier fields in the base schema. | ||
| * @param freshSchema fresh schema | ||
| * @param baseSchema base schema | ||
| * @return idnetifier fields in the fresh schema | ||
jackye1995 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| */ | ||
| public static Set<Integer> refreshIdentifierFields(Types.StructType freshSchema, Schema baseSchema) { | ||
| Map<String, Integer> nameToId = TypeUtil.indexByName(freshSchema); | ||
| return baseSchema.identifierFieldNames().stream() | ||
| .map(nameToId::get) | ||
| .filter(Objects::nonNull) | ||
openinx marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| .collect(Collectors.toSet()); | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -213,7 +229,7 @@ public static Schema assignIncreasingFreshIds(Schema schema) { | |
| */ | ||
| public static Schema reassignIds(Schema schema, Schema idSourceSchema) { | ||
| Types.StructType struct = visit(schema, new ReassignIds(idSourceSchema)).asStructType(); | ||
| return new Schema(struct.fields()); | ||
| return new Schema(struct.fields(), refreshIdentifierFields(struct, schema)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since I wonder if we want to use
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer to call |
||
| } | ||
|
|
||
| public static Type find(Schema schema, Predicate<Type> predicate) { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,8 @@ | |
| package org.apache.iceberg.types; | ||
|
|
||
| import org.apache.iceberg.Schema; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.Sets; | ||
| import org.junit.Assert; | ||
| import org.junit.Test; | ||
|
|
||
|
|
@@ -42,6 +44,25 @@ public void testReassignIdsDuplicateColumns() { | |
| Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testReassignIdsWithIdentifier() { | ||
| Schema schema = new Schema( | ||
| Lists.newArrayList( | ||
| required(0, "a", Types.IntegerType.get()), | ||
| required(1, "A", Types.IntegerType.get())), | ||
| Sets.newHashSet(0) | ||
| ); | ||
| Schema sourceSchema = new Schema( | ||
| Lists.newArrayList( | ||
| required(1, "a", Types.IntegerType.get()), | ||
| required(2, "A", Types.IntegerType.get())), | ||
| Sets.newHashSet(1) | ||
| ); | ||
| final Schema actualSchema = TypeUtil.reassignIds(schema, sourceSchema); | ||
| Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct()); | ||
| Assert.assertEquals(sourceSchema.identifierFieldIds(), actualSchema.identifierFieldIds()); | ||
| } | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we want to add a case to call |
||
| @Test(expected = IllegalArgumentException.class) | ||
| public void testReassignIdsIllegalArgumentException() { | ||
| Schema schema = new Schema( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except this
assignFreshIds, other methods that have the same mehold name should also refresh its identified field id list , right ? I also think we will need more unit tests to address this changes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing (unrelated to this PR but I think it's important): when we reconstruct the
SchemainassignFreshIds, looks like we've ignored theMap<String, Integer> aliases, that not seems the correct behavior, right ? I mean we should use the existingaliasesfrom the old schema to construct the refreshed new schema .There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry forgot those methods, updated. So far I don't see a place those methods are called and need to rely on the alias, I will continue to look, if there is an exception I will put it in another PR.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, a
schemais composed byschemaId,fields,aliases,identifierFieldIds. We will need to maintain all those members when refresh or reassign field ids based on the old schema, by default we should pass the old schema's info to the fresh schema if people don't provide a necessary info to fill.Getting this work into a new PR looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think alias is a bit different here, I think alias is used mostly for integration purpose from converting a file schema to iceberg schema for easier lookup (i.e. as a form of helper method), and isn't part of the the iceberg schema itself and isn't written to table metadata; so I think they may not be strictly required when constructing new schemas from this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes practically speaking the alias is not used in any code path related, that was why I did not care about that when making the change. But from correctness perspective I agree with @openinx that if the alias exists, we should do the conversion just in case it is somehow used somewhere for that purpose in the future.
What I am trying out is to change the
AssignFreshIdsvisitor so that it can update the id along the way foraliasandidentifier. Will update the PR after completing the implementation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@openinx so after reading the code a bit more, I think it does not make sense to convert alias in the methods. The reason is that, as the documentation suggests:
So this happens in methods such as
AvroSchemaUtil.toIceberg,ParquetSchemaUtil.convert, etc. However, these alias are never persisted in the actual table metadata. As a proof, theTypeUtil.assignFreshIdsis called for every table metadata replacement, but the alias is never passed in. So changing the method to pass in the alias is a change of behavior and we should not do that. So I think the current implementation should be good enough, I will add a few tests based on what you and Yan suggested.