PoC of schema pruning by the mapping from Avro field name to Iceberg field id by yiqiangin · Pull Request #1 · funcheetah/iceberg-1

yiqiangin · 2022-08-30T02:16:27Z

This PR is the implementation of support for schema pruning within a complex union in the TODO work list of the PR apache#4242
In Iceberg, the complex union is represented by a struct with multiple fields. Without schema pruning caused by the column projection in the query, the number of fields equals to the number of types in the union plus one (for the tag field). When the column projection happens, the union schema of Iceberg is pruned and there are only a part of the fields in the struct according to the definition of column projection.
In contrast, the union schema of Avro schema is not pruned in case of column projection, as the full union schema is needed to read the data from Avro file successfully.
Also the readers to read the data of the union from Avro file are created based on the type schema from both Avro schema and Iceberg schema. The major problem to be solved here is to correlate the type in Avro schema with the type in Iceberg schema, especially in case that only a part of types exist in Iceberg schema with column projection.
The main idea of the solution:

build the mapping from the type name in Avro schema to the id of the corresponding field in Iceberg schema
When value readers are created, find the corresponding field in Iceberg schema for a type of Avro schema with the id stored in the mapping which key is the name of the type of Avro schema.
The mapping from the field name in Avro schema to the field id in Iceberg schema is derived during the conversion from Avro schema to Iceberg schema in the function of AvroSchemaUtil.convertToDeriveNameMapping and the class of SchemaToType.
The mapping of direct child fields of an Avro schema field is stored as a property named AvroFieldNameToIcebergId in this Avro schema field, therefore it can be easily accessed when Avro schema is traversed to generate the correspond readers to read Avro data file.
In case of union, the key of the mapping is the name of the branch in the union.
In case of complex union, the code of AvroSchemaWithTypeVisitor.visitUnion() first gets the mapping from the property of Avro schema, then get the field id in Iceberg schema using the type name in Avro schema, finally it uses the field id to get the field type in Iceberg schema:
if the corresponding field in Iceberg schema exists, the field is used to create the reader together with Avro schema node;
if the field for the given field id does not exist in Iceberg schema (which means this field is not projected in Iceberg schema), a pseudo branch type is created based on the corresponding Avro schema node to faciltate the creation of the reader.
In the class of UnionReader, the rows read from Avro data file are filtered according to the fields existing in Iceberg schema.

PoC of the mapping from Avro field name to Iceberg field id

b0e2ab7

yiqiangin changed the title ~~PoC of the mapping from Avro field name to Iceberg field id~~ PoC of schema pruning by the mapping from Avro field name to Iceberg field id Aug 30, 2022

Yiqiang Ding added 4 commits August 30, 2022 11:06

PoC of the mapping from Avro field name to Iceberg field id

29edc11

Merge branch 'master' of https://github.com/yiqiangin/iceberg-union

ab6baf8

Merge branch 'master' of https://github.com/yiqiangin/iceberg-union

5ca3328

Merge branch 'master' of https://github.com/yiqiangin/iceberg-union

fec89b3

yiqiangin marked this pull request as draft August 30, 2022 20:04

yiqiangin marked this pull request as ready for review August 30, 2022 20:04

yiqiangin closed this by deleting the head repository Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC of schema pruning by the mapping from Avro field name to Iceberg field id#1

PoC of schema pruning by the mapping from Avro field name to Iceberg field id#1
yiqiangin wants to merge 5 commits intofuncheetah:masterfrom
yiqiangin:master

yiqiangin commented Aug 30, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yiqiangin commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yiqiangin commented Aug 30, 2022 •

edited

Loading