Add support for union record type by sudssf · Pull Request #1046 · apache/iceberg

sudssf · 2020-05-18T02:18:46Z

#189
Add test using union record
reference: use conversion similar to spark-avro which will convert union type of size > 2 to struct type.

sudssf · 2020-05-18T23:00:08Z

FYI @rdsr @rdblue

apache#189 Add test using union record reference: use conversion similar to spark-avro which will convert union type of size > 2 to struct type.

core/src/test/java/org/apache/iceberg/avro/AvroDataUnionRecordTest.java

rdsr · 2020-05-21T14:20:48Z

api/src/main/java/org/apache/iceberg/types/Types.java

+    /**
+     * @return true if struct represents union schema
+     */
+    public boolean isUnionSchema() {


Why do we need this in the main API? The API should not expose unions or info about union IMO.

I will check if we can make it package private, Avro visitor needs it so that it can add property in schema to reflect union type.

I agree. Iceberg's type system should have no knowledge of union types because unions aren't supported.

I think this is naming issue. this flag indicate that , field is storing union type by converting it to record/struct.

rdsr · 2020-05-21T14:23:19Z

Thanks @sudssf there's some similar work done on this feature by @shardulm94. @shardulm94 please review and see if there's possibility of collaboration here.

rdsr · 2020-05-21T14:34:16Z

core/src/main/java/org/apache/iceberg/avro/GenericAvroWriter.java

    }
  }
+
+  public static class UnionSchemaWriter<V extends Object> implements ValueWriter<V> {


Do we need to support writing out union types? Iceberg types does not support union and almost all compute engines do not support unions as well. I'm not sure if there's a usecase here to support it.

On the read side it makes sense to us to support reading unions, because that data could have been written by non Iceberg writers.

What are your thoughts?

correct,
since iceberg does not support union type but avro does, this approach is similar to spark-avro where union types are converted to record.
I dont think it make sense to support union types as it will mean adding support in schema for that which is larger change and has more blast radius.
spark does not support union types so it make sense to align with spark and keep scope limited IMO

Files written by Iceberg must always conform to the Iceberg spec. Iceberg tables can allow reading data that is imported, but should never write data that cannot be read by a generic implementation of the Iceberg spec.

sudssf · 2020-05-21T16:54:21Z

Thanks @sudssf there's some similar work done on this feature by @shardulm94. @shardulm94 please review and see if there's possibility of collaboration here.

@shardulm94 can you please share PR?

rdblue · 2020-05-21T19:31:04Z

core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java

-                               String tableName) {
+  public static Schema convert(
+      org.apache.iceberg.Schema schema,
+      String tableName) {


Minor: please revert non-functional changes like this one. That makes it easier to review and less likely to conflict with other commits.

ack I did not do it intentionally , formatting style from .baseline applied these changes.

core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java

rdblue · 2020-05-21T19:48:08Z

core/src/main/java/org/apache/iceberg/avro/SchemaToType.java

+      List<Types.NestedField> fields = Lists.newArrayListWithExpectedSize(options.size());
+      for (int i = 0; i < options.size(); i += 1) {
+        Type fieldType = options.get(i);
+        if (fieldType == null) {


Why would fieldType be null?

for avro type null , fieldType is null. this basically ignore null from union

I think I see. The reason is that all branches of the union are optional and there is no way to encode whether one branch (let alone only one branch) will be non-null?

rdblue · 2020-05-21T19:48:37Z

core/src/main/java/org/apache/iceberg/avro/SchemaToType.java

      } else if (
          logical instanceof LogicalTypes.TimeMillis ||
-          logical instanceof LogicalTypes.TimeMicros) {
+              logical instanceof LogicalTypes.TimeMicros) {


Minor: this file also has lots of formatting changes that should be reverted before committing.

ack, I did not intentionally do these changes , formatting style from repo did these changes. I suggest same in our internal repo :) , I will revert these changes.

rdblue · 2020-05-21T19:50:09Z

core/src/main/java/org/apache/iceberg/avro/TypeToSchema.java


    recordSchema = Schema.createRecord(recordName, null, null, false, fields);
-
+    if (struct.isUnionSchema()) {


Is there some reason why this needs to be a round-trip conversion? I think it should be fine to convert back to a record instead of a union schema. That's what we should be writing into a table with a schema converted from an Avro union schema anyway.

this is needed when avro writer is java client which is writing directly by consuming avro records.

for spark , spark avro convert its into record so no changes needed there. I will add test for spark

only way for GenericAvroWriter to know if record represents union schema type is to use property set here. otherwise ValueWriters will fail to write union schema record as they will expect new record schema instead of union record schema.

rdblue · 2020-05-21T19:51:16Z

core/src/test/java/org/apache/iceberg/avro/AvroDataUnionRecordTest.java

+    switch (key) {
+      case "boolean":
+        return rec
+            .set("member1", map.get(key))


Is it possible to use "option_1" instead of "member1"? Avro refers to these as union options.

rdblue · 2020-05-21T19:55:25Z

core/src/test/java/org/apache/iceberg/avro/AvroDataUnionRecordTest.java

+    });
+  }
+
+  private GenericData.Record getGenericRecordForUnionType(


Do you know what types of union evolution are allowed by Avro but not supported here?

If we convert a schema with a union and create a table from it, then for the union's struct we will generate a name mapping to read existing data files, like {"id" 6, "member1"}, {"id" 7, "member2"}, . . .. If the union drops member2, this assignment becomes invalid because member3 will now convert to use the name member2.

If we are going to support reading unions, it would be good to note that you can't remove members, they must have a stable order, and members can only be added to the end. Anything else that I'm missing?

since schema get stored with avro file I don't think ordering matters. reader will always reassignIds from writer schema. member names are hardcoded for test for union type it should work. I will check if I can add test for schema evolution.

The order does matter. If the second option in the union is always named "member2" then what happens if the actual type of member2 is different between two file schemas?

file schema should match original schema version, all rules for struct datatype evolution still apply. GenericAvroReader will always resolve schema against reader schema
fileSchema <- schema used by avro writer when avro file was created
readSchema <- schema of iceberg table or projection which may be reflect new fields after schema evolution.

return DecoderFactory.get().resolvingDecoder(fileSchema, readSchema, null);

my understanding is reassignIds should take care of schema evolution. may be I am wrong?

Data files with unions must be written by non-Iceberg producers because Iceberg will not write invalid data files. A union schema cannot contain Iceberg field IDs because there is no defined way to encode them -- union has no extra properties and while types could be replaced with a JSON object with properties, there isn't anything that pulls IDs out.

For external data files, what would happen is the user would convert a schema and create a table from it, at which point the IDs get assigned. Next, the user would create a name mapping with that schema's names to Iceberg field IDs. That mapping is how we recover IDs from field names. That works because Avro is name-based so we can map its schema evolution into Iceberg.

But name-based doesn't work with unions that are converted to structs because we are automatically assigning the names using a counter. If the definition of member2 changes, we will still map it to the same Iceberg ID by name.

Data files with unions must be written by non-Iceberg producers because Iceberg , usecase here is we are writing to iceberg table using iceberg client ( avro writer) and input stream has union schema. current alternative is to transform incoming generic record on-fly to record schema and write. with this approach no such transformation is needed.
since this code path is totally new , none of existing functionality will be changed.

I think I understand issue with naming based on counter. simple solution is support evolution by adding field to end of the supported types. let me add test case to validate behaviour

What do you mean the input stream has a union schema? The records that you're trying to write?

Incoming records are required to match the write schema. Because the write schema cannot contain a union, the records also cannot contain a union. The union must be represented in memory as a struct of the union types.

The union must be represented in memory as a struct of the union types.
since generic records generated by producer are generated with union schema we are transforming records to struct of union schema in stream processing client which writes to iceberg table.
transformation is expensive at high volume as it needs to generate new generic record with iceberg table schema and copy all fields from incoming generic record except union schema field.
with this PR ( proof of concept) , GenericAvroWriter is adding support so that no transformation is needed.
"The union must be represented in memory as a struct of the union types." is not possible unless avro deserialization supports this. ( we are using kafka for stream and kafka consumer gets generic record which gets ingested to iceberg table)

If you want to deserialize directly to that representation, you can use the same readers to read an single encoded message. Here's the Iceberg generics decoder, which is similar but uses Iceberg generics: https://github.com/apache/incubator-iceberg/blob/master/data/src/main/java/org/apache/iceberg/data/avro/IcebergDecoder.java

I dont think it will work without some of above changes as call to AvroSchemaUtil.convert(writeSchema, "table") will fail on union schema but I will take a look thanks for the link. I will try to document problem statement and motive behind proposing this change.

null should always be first in union schema ( as per avro spec)

…uct conversion. this should not impact existing code path.

rdblue · 2020-05-25T16:33:07Z

api/src/main/java/org/apache/iceberg/types/Types.java

-    public static StructType of(List<NestedField> fields, boolean isUnionSchema) {
-      return new StructType(fields, isUnionSchema);
+    public static StructType of(List<NestedField> fields, boolean convertedFromUnionSchema) {
+      return new StructType(fields, convertedFromUnionSchema);


I still don't think that Iceberg needs to model that this was converted. The conversion back to a union schema is not something that Iceberg should support. Iceberg must only write Iceberg data files and those are not allowed to contain Avro unions. The use case that we aim to support is reading data files with existing unions, if I understand correctly.

there is no other way to pass this information for GenericAvroWriter as it needs to know that input generic record has union schema and valuewriters needs to be created accordingly. otherwise we get cast exception. this avoids creating new generic object with record schema ( which is really expensive IMO). what alternative is possible if don't want to add convertedFromUnionSchema?

there is no other way to pass this information for GenericAvroWriter

I understand that there is no other way to pass the information. What I'm saying is that GenericAvroWriter cannot write non-option unions anyway. Iceberg must not write anything that is not allowed by the spec. This feature can be used to read existing Avro data files, but cannot be used to write Avro data files with unions.

+1 We shouldn't support writing unions.

but cannot be used to write Avro data files with unions. this feature is converting union schema to record similar to spark-avro and allows iceberg to ingest data even if there is union schema (similar to spark avro). there is open questions about schema evolution which needs to be addressed.
why we need this : avro support union schema and writing such schema to iceberg fails as it is not supported. with this feature we will support writing and transformation will be taken care by iceberg client instead of performing transformation externally.

Reading data written with a union schema is okay, but Iceberg cannot write data files with a union schema (other than options).

teabot · 2020-05-28T09:11:22Z

Hello, can anyone point me to the discussion/reasoning for not supporting the union type within Iceberg? I do understand that traditionally support has been very poor in nearly all data processing frameworks, however, more recently the union appears to be a good solve for event sourcing type applications where ordering is important and events may be of disparate types. While it is possible to do this by modelling as a sparse record, it is contractually weaker - only business logic can ensure that only one field is non-null.

shardulm94 · 2020-05-28T09:19:52Z

Thanks @rdsr for the tag here, however I feel the motivation of this PR is a bit different my motivation which was to allow reading data files written with non-Iceberg writers which may contain union types. I have been thinking about this on and off for a while now and have tried to jot down my thoughts in https://docs.google.com/document/d/1Go2NrOoeCKfrDJw8MAZnsMYbE2KZbso1zkSY8cpLzaQ/edit#heading=h.gd2qof50gbzs. Since the motivation of this PR seems different though, we may want to continue this conversation in #189.

rdblue · 2020-05-28T16:06:29Z

@teabot, I think the main problem with unions is the support in processing engines. It is unlikely that Spark, Presto, Impala, etc. will add support for unions. Since Iceberg is a format that we want to be suitable for a common at-rest store, it doesn't make sense to have a type that requires work-around in those engines but has only a small benefit (ensuring only one option is non-null).

I think it is unlikely that processing engines will support unions because it isn't clear how users would interact with them in SQL. For example, how do I filter to just records with a particular option of the union? That might seem easy, but it exposes underlying problems with unions and schema evolution, like identifying union fields. If we generate names based on position, what happens when that position changes? If we do it based on ID, then we're exposing internal IDs to users.

Also, what if a file is written with a version of the schema that has a new union option that isn't in the table schema? Do we choose another incorrect branch (null or default) or do we throw an exception?

I think it is the right choice to continue using the more standard and well-defined types rather than adding union, since it would make it much harder to integrate Iceberg into processing engines.

teabot · 2020-05-28T17:25:37Z

Thanks for the comprehensive reply @rdblue. The SQL experience argument is a compelling one, and I agree that if we look at formats that support union, we find that support for this type is often (and perhaps understandably) overlooked/avoided within integrations and that this would probably be the case with Iceberg also. In our platforms we've tended to 'prohibit' producers from using unions because they effectively end up creating unreadable batch datasets downstream. The choice that Iceberg has made seems a sensible one.

That said, we're now seeing use cases in the stream domain that in my opinion are better modelled with union (multi-type streams). In practice these are currently modelled using loose set of independent schemas that have no complete explicit contract. I don't like this because it pushes elements of the producer/consumer schema contract outside of Avro and into an implicit convention that cannot enjoy compatibility checking for example.

I digress - what I hope to distill from this kind of discourse are good patterns for bridging data in streams and at rest for these use cases. But I admit that this is not a problem that Iceberg should solve.

Apologies for the distraction and thanks for your time.

rdblue · 2020-05-28T17:44:27Z

Thanks for the context, @teabot.

For our streaming systems, we aren't supporting the case you're talking about and each stream (Kafka topic) is getting its own Iceberg schema so that schema evolution works mostly the same way for a topic as it does for a table. That aligns well with streaming SQL use cases, which we hope will make it easier to use streaming systems. It doesn't help the problem of event ordering across event types, though, which are required to be in separate streams unless they use the union-as-struct convention.

I think the trade-off is worth it. In most cases you don't need strict order guarantees across event types. Where you do, you can either join two streams with a common ordinal field or make the schema slightly more complex using the union-as-struct approach if you need to avoid the join.

rdblue · 2020-07-04T23:58:47Z

Since there hasn't been any further progress on this, I'm going to close it. Feel free to reopen it if you'd like to continue working on a read path for existing data files with union types. I don't think that it is likely that we would make changes that allow writing data with unions, though.

sudssf force-pushed the ISSUE-189 branch 2 times, most recently from b838a77 to 711b6e7 Compare May 18, 2020 02:28

sudssf changed the title ~~Add support for union record type~~ ISSUE-189: Add support for union record type May 18, 2020

sudssf mentioned this pull request May 18, 2020

Allow reading non-optional unions as struct of optional fields #189

Closed

sudssf force-pushed the ISSUE-189 branch 2 times, most recently from ff18d2f to accdd87 Compare May 18, 2020 15:26

sudssf force-pushed the ISSUE-189 branch 2 times, most recently from 44201eb to 0f30882 Compare May 20, 2020 00:30

Sudarshan S added 2 commits May 19, 2020 21:57

ISSUE-189: Add support for union record type

4459b20

apache#189 Add test using union record reference: use conversion similar to spark-avro which will convert union type of size > 2 to struct type.

fix hack: instead use flag in StructType

ce80269

sudssf force-pushed the ISSUE-189 branch from 0f30882 to ce80269 Compare May 20, 2020 04:58

Jiayi-Liao reviewed May 21, 2020

View reviewed changes

core/src/test/java/org/apache/iceberg/avro/AvroDataUnionRecordTest.java Outdated Show resolved Hide resolved

core/src/test/java/org/apache/iceberg/avro/AvroDataUnionRecordTest.java Outdated Show resolved Hide resolved

rdsr reviewed May 21, 2020

View reviewed changes

fix review comments

6d66d2b

rdblue reviewed May 21, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java Show resolved Hide resolved

rdblue reviewed May 21, 2020

View reviewed changes

Sudarshan S added 4 commits May 24, 2020 11:35

fix for invalid union type ["int", "null", "string"]

6eb0a27

null should always be first in union schema ( as per avro spec)

revert code formatting changes

6133e5e

revert code formatting changes

795e1cd

change accessor name to reflect struct represents union schema to str…

bfc7750

…uct conversion. this should not impact existing code path.

rdblue reviewed May 25, 2020

View reviewed changes

rdblue linked an issue Jul 4, 2020 that may be closed by this pull request

Allow reading non-optional unions as struct of optional fields #189

Closed

rdblue closed this Jul 4, 2020

rdblue changed the title ~~ISSUE-189: Add support for union record type~~ Add support for union record type Jul 4, 2020

homatthew mentioned this pull request Feb 1, 2023

[GOBBLIN-1774] Util for detecting non optional uniontype columns based on Hive Table metadata apache/gobblin#3632

Merged

4 tasks


		recordSchema = Schema.createRecord(recordName, null, null, false, fields);

		if (struct.isUnionSchema()) {

Conversation

sudssf commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sudssf commented May 18, 2020

Uh oh!

Uh oh!

Uh oh!

rdsr May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdsr commented May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudssf commented May 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudssf commented May 18, 2020 •

edited

Loading

rdsr May 21, 2020 •

edited

Loading

rdsr commented May 21, 2020 •

edited

Loading

rdsr May 26, 2020 •

edited

Loading

teabot commented May 28, 2020 •

edited

Loading