[LI] Bug fix: Remove ids from fileSchema before feeding it into apply… #136

rzhang10 · 2023-01-26T20:13:29Z

…NameMapping

In #134 , we added a config to let ORC read path ignore file schema ids and only use ids from namemapping, but in the code where the runtime fileSchemaWithIds is generated:

fileSchemaWithIds = ORCSchemaUtil.applyNameMapping(fileSchema, nameMapping);

Previously, we have assumed the fileSchema and nameMapping contains exact same columns, so that the nameMapping can guarantee to override all ids (if exist) in the fileSchema, thus serving as the single id provider for the fileSchemaWithIds.

but for the case when they don't contain exact same columns, for example, when the underlying files contains more columns than the hive schema columns, then nameMapping will not completely override the fileSchema's all ids, there could be some uncleaned ids left. The way to fix it is we first do a cleanup by removing the ids first:

fileSchemaWithIds = ORCSchemaUtil.applyNameMapping(ORCSchemaUtil.removeIds(fileSchema), nameMapping);

This fix will make li-iceberg read hive tables whose table schema is a subset of its underlying file schema.

…NameMapping

rzhang10 · 2023-02-01T18:51:51Z

Integration tested on cluster with a table that contains less column than the underlying files, the read is successful.

yiqiangin · 2023-02-01T19:49:56Z

orc/src/main/java/org/apache/iceberg/orc/RemoveIds.java

  }

+  @Override
+  public TypeDescription union(TypeDescription union, List<TypeDescription> options) {


Could you clarify a little more on why this function is needed for this PR?

We are leveraging the RemoveIds visitor to remove ids from a ORC schema, the issue is in LI-Iceberg we added union type support whereas the vanilla RemoveIds doesn't have the union case (because vanilla iceberg doesn't have union type), so we just need to add this override implementation to reconstruct the union schema case.

[LI] Bug fix: Remove ids from fileSchema before feeding it into apply…

0dc28d4

…NameMapping

github-actions bot added the ORC label Jan 26, 2023

rzhang10 added 2 commits January 28, 2023 07:20

Add union handling in ORC RemoveIds

74b5ba4

Fix checkstyle

75b7e3a

yiqiangin reviewed Feb 1, 2023

View reviewed changes

yiqiangin approved these changes Feb 1, 2023

View reviewed changes

rzhang10 merged commit df650aa into linkedin:li-0.11.x Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LI] Bug fix: Remove ids from fileSchema before feeding it into apply… #136

[LI] Bug fix: Remove ids from fileSchema before feeding it into apply… #136

rzhang10 commented Jan 26, 2023 •

edited

Loading

Uh oh!

rzhang10 commented Feb 1, 2023

Uh oh!

yiqiangin Feb 1, 2023

Uh oh!

rzhang10 Feb 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[LI] Bug fix: Remove ids from fileSchema before feeding it into apply… #136

[LI] Bug fix: Remove ids from fileSchema before feeding it into apply… #136

Conversation

rzhang10 commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzhang10 commented Feb 1, 2023

Uh oh!

yiqiangin Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

rzhang10 Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rzhang10 commented Jan 26, 2023 •

edited

Loading