Column projection of union type by yiqiangin · Pull Request #108 · linkedin/iceberg

yiqiangin · 2022-05-16T18:25:18Z

Problem
Currently column projection does not work for union type. The root cause of the problem is as follows:

The current code assumes that the types inside a union from Avro schema should match the type fields inside a union struct from Iceberg schema when Avro union reader is created.
In case of column projection of union type, the current code only prune the schema of union in Iceberg schema with the projected fields, while the union of Avro schema still contains all the types. It results in the mismatch between Avro schema and Iceberg schema for the union in this case.
However, as all the contents of each data type in a union in Avro file should be read by Avro readers correctly no matter this data type is projected or not based on the decoding procedure of Avro file, all the types in a union from Avro schema are needed to create the corresponding type readers in AvroUnionReader even in case of column projection. Therefore the union in Avro schema cannot be pruned like what is done to the union struct in Iceberg schema.

Solution
Assuming there are N types in a union, there are N+1 fields including "tag" field in the struct corresponding to the union in Iceberg schema. The user can project any K fields (K>=1 and K<=N+1 and including the tag field) of the union in a query. The case of without column projection equals to full fields projection namely K=N+1. Therefore the solution does not differentiate the cases of with and without column projections.
In addition, the order of the types in a union in Iceberg schema can be identified from its field name like "field0".."fieldK". K is the index which can be used to match the order of the types in the union of Avro schema.

In the code of create the readers of all types in the union of Avro schema (namely AvroSchemaWithTypeVisitor.visitUnion), checking the fields of the struct corresponding union in Iceberg schema to create a map between the order index and the field type in Iceberg schema. When iterating through all the types in Avro schema, using the order index to check if the corresponding type exists in the map, if yes which means the field is projected, creating the option of creating the reader with the type in Iceberg schema, otherwise, creating the option with type null.

In the code of AvroUnionReader, Iceberg schema needs to be passed into it. The fields of the returned row should be constructed based on the fields in Iceberg schema not the types in Avro schema. If tag field is projected, one more field is added in the beginning of the row and updated with the index of the field in Avro file.

Test
All the test cases in TestSparkAvroUnions.java with a new test case writeAndValidateRequiredComplexUnionWithProjection

Manual test with Spark3 with all the following queries on a table with a union:

val df = spark.sql("select c1.field0 from u_yiqding.avro_union_table_test")

val df = spark.sql("select c1.field0,c1.field1 from u_yiqding.avro_union_table_test")

val df = spark.sql("select c1.tag,c1.field0 from u_yiqding.avro_union_table_test")

val df = spark.sql("select c1.tag,c1.field0,c1.field1 from u_yiqding.avro_union_table_test")

val df = spark.sql("select c1.field1,c1.field0,c1.tag from u_yiqding.avro_union_table_test")

val df = spark.sql("select c1.field1,c1.field0 from u_yiqding.avro_union_table_test")

wmoustafa · 2022-05-17T23:43:39Z

Thank you so much for the PR and comprehensive description!