-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Support non-optional union types for Avro #4242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9ae16e6
3378015
8d7d50e
5d08255
1f90a4d
b072885
0b1700e
e23ae25
3e29473
287fc92
ad99bf8
52c3bd1
605735d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -52,8 +52,21 @@ public static <T> T visit(Schema schema, AvroSchemaVisitor<T> visitor) { | |
| case UNION: | ||
| List<Schema> types = schema.getTypes(); | ||
| List<T> options = Lists.newArrayListWithExpectedSize(types.size()); | ||
| for (Schema type : types) { | ||
| options.add(visit(type, visitor)); | ||
| if (AvroSchemaUtil.isOptionSchema(schema)) { | ||
| for (Schema type : types) { | ||
| options.add(visit(type, visitor)); | ||
| } | ||
| } else { | ||
| // complex union case | ||
| int nonNullIdx = 0; | ||
| for (Schema type : types) { | ||
| if (type.getType() != Schema.Type.NULL) { | ||
| options.add(visitWithName("field" + nonNullIdx, type, visitor)); | ||
| nonNullIdx += 1; | ||
| } else { | ||
| options.add(visit(type, visitor)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not visit with the field name? |
||
| } | ||
| } | ||
| } | ||
| return visitor.union(schema, options); | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -79,11 +79,30 @@ private static <T> T visitRecord(Types.StructType struct, Schema record, AvroSch | |
| private static <T> T visitUnion(Type type, Schema union, AvroSchemaWithTypeVisitor<T> visitor) { | ||
| List<Schema> types = union.getTypes(); | ||
| List<T> options = Lists.newArrayListWithExpectedSize(types.size()); | ||
| for (Schema branch : types) { | ||
| if (branch.getType() == Schema.Type.NULL) { | ||
| options.add(visit((Type) null, branch, visitor)); | ||
| } else { | ||
| options.add(visit(type, branch, visitor)); | ||
|
|
||
| // simple union case | ||
| if (AvroSchemaUtil.isOptionSchema(union)) { | ||
| for (Schema branch : types) { | ||
| if (branch.getType() == Schema.Type.NULL) { | ||
| options.add(visit((Type) null, branch, visitor)); | ||
| } else { | ||
| options.add(visit(type, branch, visitor)); | ||
| } | ||
| } | ||
| } else { // complex union case | ||
| Preconditions.checkArgument(type instanceof Types.StructType, | ||
| "Cannot visit invalid Iceberg type: %s for Avro complex union type: %s", type, union); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should also check whether the schema with type visitor has the Along the same lines, what happens if the struct is projected or out of order? I'd prefer to look up the struct field for each option in the union by field ID, just like we do with struct fields. For a struct field, we get the field ID from the Avro schema and use that to find the corresponding field in the Iceberg struct. If you end up using field IDs, I think the challenge is getting those field IDs in the Avro schema. I'm assuming that you're using If not, I think you'd want to align fields by using the field name from the Iceberg struct. For example,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about aligning by the type?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR is relevant too (still internal but will be brought upstream soon): linkedin/iceberg#108
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think that's what we will need to do at some point, but this visitor assumes that both schemas have field IDs. I think for this, the right way to handle it is to get the field ID from the union type. It would mean rewriting the Avro schema ahead of time to look like this: [
"null",
{"type": "int", "field-id": 34},
{"type": "string", "field-id": 35}
]That's why I'm wondering about how to attach the field IDs in the name mapping. In the name mapping, we could allow a nested level to represent the union. Names in that level could be types rather than names, so the mapping to produce the union above would be @wmoustafa, what do you think?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about the second option, where we expect them to be in the same order? (see this PR to support missing fields in the case of projection pruning)? This approach will also make sure that the deep types also match. That said, does it need to be retrofit into name mapping? I feel we could implement it without extending the name mapping as long as we have functions to deeply compare types. I think as a starting point, we could assume they must align in terms of order (but do not necessarily be the same, as we could skip some in the struct), and later we can implement deep type comparisons, which will relax the ordering expectation (which is also forward compatible with the type-based check).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We can fail gracefully. For example, if there are two records, we throw an exception in name mapping that multiple records aren't supported. If you think this is a common case, we can go further to find a solution for arbitrary nested types. That isn't too hard, actually. We could do it based on the child field set, which would have to match or at least have some overlap. I don't think that doing this by order is a good idea. That could easily lead to worse cases where we're returning the wrong data.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is the concern because of an out-of-order schema (e.g., the reader schema or the expected schema)? @yiqiangin has tried both cases and an out of order reader schema throws an exception and an out of order expected schema still returns correct results (both after applying this patch to address missing fields/projection pruning, so we may need to take that into account).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that there can be identical branches in a union and order-based resolution could do this incorrectly. The name mapping approach allows this to be entirely by ID here, and the name mapping could do deeper validation because it has child field IDs for all nested types. |
||
|
|
||
| List<Types.NestedField> fields = type.asStructType().fields(); | ||
| // start index from 1 because 0 is the tag field which doesn't exist in the original Avro schema | ||
| int index = 1; | ||
funcheetah marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| for (Schema branch : types) { | ||
| if (branch.getType() == Schema.Type.NULL) { | ||
| options.add(visit((Type) null, branch, visitor)); | ||
| } else { | ||
| options.add(visit(fields.get(index).type(), branch, visitor)); | ||
| index += 1; | ||
| } | ||
| } | ||
| } | ||
| return visitor.union(type, union, options); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,6 +19,7 @@ | |
|
|
||
| package org.apache.iceberg.avro; | ||
|
|
||
| import java.util.Iterator; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Objects; | ||
|
|
@@ -154,16 +155,48 @@ public Schema.Field field(Schema.Field field, Supplier<Schema> fieldResult) { | |
|
|
||
| @Override | ||
| public Schema union(Schema union, Iterable<Schema> options) { | ||
| Preconditions.checkState(AvroSchemaUtil.isOptionSchema(union), | ||
| "Invalid schema: non-option unions are not supported: %s", union); | ||
| Schema nonNullOriginal = AvroSchemaUtil.fromOption(union); | ||
| Schema nonNullResult = AvroSchemaUtil.fromOptions(Lists.newArrayList(options)); | ||
| if (AvroSchemaUtil.isOptionSchema(union)) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be using the
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you suggesting refactoring |
||
| Schema nonNullOriginal = AvroSchemaUtil.fromOption(union); | ||
| Schema nonNullResult = AvroSchemaUtil.fromOptions(Lists.newArrayList(options)); | ||
|
|
||
| if (!Objects.equals(nonNullOriginal, nonNullResult)) { | ||
| return AvroSchemaUtil.toOption(nonNullResult); | ||
| } | ||
| if (!Objects.equals(nonNullOriginal, nonNullResult)) { | ||
| return AvroSchemaUtil.toOption(nonNullResult); | ||
| } | ||
|
|
||
| return union; | ||
| } else { // Complex union | ||
| Preconditions.checkArgument(current instanceof Types.StructType, | ||
| "Incompatible projected type: %s for Avro complex union type: %s", current, union); | ||
|
|
||
| Types.StructType asStructType = current.asStructType(); | ||
|
|
||
| long nonNullBranchesCount = union.getTypes().stream() | ||
| .filter(branch -> branch.getType() != Schema.Type.NULL).count(); | ||
| Preconditions.checkState(asStructType.fields().size() > nonNullBranchesCount, | ||
| "Column projection on struct converted from Avro complex union type: %s is not supported", union); | ||
|
|
||
| Iterator<Schema> resultBranchIterator = options.iterator(); | ||
|
|
||
| // we start index from 1 because 0 is the tag field which doesn't exist in the original Avro | ||
| int index = 1; | ||
| List<Schema> resultBranches = Lists.newArrayListWithExpectedSize(union.getTypes().size()); | ||
|
|
||
| return union; | ||
| try { | ||
| for (Schema originalBranch : union.getTypes()) { | ||
| if (originalBranch.getType() == Schema.Type.NULL) { | ||
| resultBranches.add(resultBranchIterator.next()); | ||
| } else { | ||
| this.current = asStructType.fields().get(index).type(); | ||
| resultBranches.add(resultBranchIterator.next()); | ||
| index += 1; | ||
| } | ||
| } | ||
|
|
||
| return Schema.createUnion(resultBranches); | ||
| } finally { | ||
| this.current = asStructType; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.