-
Notifications
You must be signed in to change notification settings - Fork 37
Avro: Change union read schema from hive to trino #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,7 +25,7 @@ | |
| import org.junit.Test; | ||
|
|
||
|
|
||
| public class TestAvroComplexUnion { | ||
| public class TestUnionSchemaConversions { | ||
|
|
||
| @Test | ||
| public void testRequiredComplexUnion() { | ||
|
|
@@ -43,7 +43,8 @@ public void testRequiredComplexUnion() { | |
|
|
||
| org.apache.iceberg.Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema); | ||
| String expectedIcebergSchema = "table {\n" + | ||
| " 0: unionCol: required struct<1: tag_0: optional int, 2: tag_1: optional string>\n" + "}"; | ||
| " 0: unionCol: required struct<1: tag: required int, 2: field0: optional int, 3: field1: optional string>\n" + | ||
| "}"; | ||
|
|
||
| Assert.assertEquals(expectedIcebergSchema, icebergSchema.toString()); | ||
| } | ||
|
|
@@ -65,32 +66,15 @@ public void testOptionalComplexUnion() { | |
| .endRecord(); | ||
|
|
||
| org.apache.iceberg.Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema); | ||
| String expectedIcebergSchema = | ||
| "table {\n" + " 0: unionCol: optional struct<1: tag_0: optional int, 2: tag_1: optional string>\n" + "}"; | ||
|
|
||
| Assert.assertEquals(expectedIcebergSchema, icebergSchema.toString()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testSingleComponentUnion() { | ||
| Schema avroSchema = SchemaBuilder.record("root") | ||
| .fields() | ||
| .name("unionCol") | ||
| .type() | ||
| .unionOf() | ||
| .intType() | ||
| .endUnion() | ||
| .noDefault() | ||
| .endRecord(); | ||
|
|
||
| org.apache.iceberg.Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema); | ||
| String expectedIcebergSchema = "table {\n" + " 0: unionCol: required struct<1: tag_0: optional int>\n" + "}"; | ||
| String expectedIcebergSchema = "table {\n" + | ||
| " 0: unionCol: optional struct<1: tag: required int, 2: field0: optional int, 3: field1: optional string>\n" + | ||
| "}"; | ||
|
|
||
| Assert.assertEquals(expectedIcebergSchema, icebergSchema.toString()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testOptionSchema() { | ||
| public void testSimpleUnionSchema() { | ||
| Schema avroSchema = SchemaBuilder.record("root") | ||
| .fields() | ||
| .name("optionCol") | ||
|
|
@@ -108,22 +92,4 @@ public void testOptionSchema() { | |
|
|
||
| Assert.assertEquals(expectedIcebergSchema, icebergSchema.toString()); | ||
| } | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this considered invalid at all? or what's the reason to get rid of this case ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this schema itself shouldn't appear in the first place, user shouldn't define this kind of schema. |
||
| @Test | ||
| public void testNullUnionSchema() { | ||
| Schema avroSchema = SchemaBuilder.record("root") | ||
| .fields() | ||
| .name("nullUnionCol") | ||
| .type() | ||
| .unionOf() | ||
| .nullType() | ||
| .endUnion() | ||
| .noDefault() | ||
| .endRecord(); | ||
|
|
||
| org.apache.iceberg.Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema); | ||
| String expectedIcebergSchema = "table {\n" + " 0: nullUnionCol: optional struct<>\n" + "}"; | ||
|
|
||
| Assert.assertEquals(expectedIcebergSchema, icebergSchema.toString()); | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -292,7 +292,7 @@ protected void set(InternalRow struct, int pos, Object value) { | |
| } | ||
| } | ||
|
|
||
| static class UnionReader implements ValueReader<InternalRow> { | ||
| private static class UnionReader implements ValueReader<InternalRow> { | ||
| private final Schema schema; | ||
| private final ValueReader[] readers; | ||
|
|
||
|
|
@@ -316,20 +316,30 @@ public InternalRow read(Decoder decoder, Object reuse) throws IOException { | |
| break; | ||
| } | ||
| } | ||
| InternalRow struct = new GenericInternalRow(nullIndex >= 0 ? alts.size() - 1 : alts.size()); | ||
|
|
||
| int index = decoder.readIndex(); | ||
| if (index == nullIndex) { | ||
| // if it is a null data, directly return null as the whole union result | ||
| return null; | ||
| } | ||
|
|
||
| // otherwise, we need to return an InternalRow as a struct data | ||
| InternalRow struct = new GenericInternalRow(nullIndex >= 0 ? alts.size() : alts.size() + 1); | ||
| for (int i = 0; i < struct.numFields(); i += 1) { | ||
| struct.setNullAt(i); | ||
| } | ||
|
|
||
| int index = decoder.readIndex(); | ||
| Object value = this.readers[index].read(decoder, reuse); | ||
| Object value = readers[index].read(decoder, reuse); | ||
|
|
||
| if (nullIndex < 0) { | ||
| struct.update(index, value); | ||
| struct.update(index + 1, value); | ||
| struct.setInt(0, index); | ||
| } else if (index < nullIndex) { | ||
| struct.update(index + 1, value); | ||
| struct.setInt(0, index); | ||
| } else { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if I follow this, what does the relative position between value index and null index have anything to do with how we assign the value in the struct ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because in avro the nullability of a union type is presented by the existence of a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess what I didn't know earlier was, if NULL type appears in the union it seems it has to be the first position in Will this lead to a silent failure if you don't check this first?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that's just a recommendation from the Avro spec, not mandated, in fact, in our ecosystem there are many avro schemas which failed to put And that's why I'm specifically computing the null index here and branching on it. |
||
| struct.update(index, value); | ||
| } else if (index > nullIndex) { | ||
| struct.update(index - 1, value); | ||
| struct.setInt(0, index - 1); | ||
| } | ||
|
|
||
| return struct; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an official definition of what is complex and what is simple ? If we target to contribute it back, maybe better to clarify that thru javadoc -- If this is merely internally usage I am fine with what we have now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple union is
[sometype, null]while complex union is[sometype1, sometype2, ...], where there are at least 2 non-null types in the union, I think we can probably add a java doc inAvroSchemaUtil.isOptionSchema, which is an existing method in upstream iceberg.