PoC of schema pruning by the mapping from Avro field name to Iceberg field id#1
Closed
yiqiangin wants to merge 5 commits intofuncheetah:masterfrom
yiqiangin:master
Closed
PoC of schema pruning by the mapping from Avro field name to Iceberg field id#1yiqiangin wants to merge 5 commits intofuncheetah:masterfrom yiqiangin:master
yiqiangin wants to merge 5 commits intofuncheetah:masterfrom
yiqiangin:master
Conversation
added 4 commits
August 30, 2022 11:06
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is the implementation of support for schema pruning within a complex union in the TODO work list of the PR apache#4242
In Iceberg, the complex union is represented by a struct with multiple fields. Without schema pruning caused by the column projection in the query, the number of fields equals to the number of types in the union plus one (for the tag field). When the column projection happens, the union schema of Iceberg is pruned and there are only a part of the fields in the struct according to the definition of column projection.
In contrast, the union schema of Avro schema is not pruned in case of column projection, as the full union schema is needed to read the data from Avro file successfully.
Also the readers to read the data of the union from Avro file are created based on the type schema from both Avro schema and Iceberg schema. The major problem to be solved here is to correlate the type in Avro schema with the type in Iceberg schema, especially in case that only a part of types exist in Iceberg schema with column projection.
The main idea of the solution:
build the mapping from the type name in Avro schema to the id of the corresponding field in Iceberg schema
When value readers are created, find the corresponding field in Iceberg schema for a type of Avro schema with the id stored in the mapping which key is the name of the type of Avro schema.
The mapping from the field name in Avro schema to the field id in Iceberg schema is derived during the conversion from Avro schema to Iceberg schema in the function of AvroSchemaUtil.convertToDeriveNameMapping and the class of SchemaToType.
The mapping of direct child fields of an Avro schema field is stored as a property named AvroFieldNameToIcebergId in this Avro schema field, therefore it can be easily accessed when Avro schema is traversed to generate the correspond readers to read Avro data file.
In case of union, the key of the mapping is the name of the branch in the union.
In case of complex union, the code of AvroSchemaWithTypeVisitor.visitUnion() first gets the mapping from the property of Avro schema, then get the field id in Iceberg schema using the type name in Avro schema, finally it uses the field id to get the field type in Iceberg schema:
if the corresponding field in Iceberg schema exists, the field is used to create the reader together with Avro schema node;
if the field for the given field id does not exist in Iceberg schema (which means this field is not projected in Iceberg schema), a pseudo branch type is created based on the corresponding Avro schema node to faciltate the creation of the reader.
In the class of UnionReader, the rows read from Avro data file are filtered according to the fields existing in Iceberg schema.