Skip to content

Conversation

@rzhang10
Copy link
Member

@rzhang10 rzhang10 commented Jan 26, 2023

…NameMapping

In #134 , we added a config to let ORC read path ignore file schema ids and only use ids from namemapping, but in the code where the runtime fileSchemaWithIds is generated:

fileSchemaWithIds = ORCSchemaUtil.applyNameMapping(fileSchema, nameMapping);

Previously, we have assumed the fileSchema and nameMapping contains exact same columns, so that the nameMapping can guarantee to override all ids (if exist) in the fileSchema, thus serving as the single id provider for the fileSchemaWithIds.

but for the case when they don't contain exact same columns, for example, when the underlying files contains more columns than the hive schema columns, then nameMapping will not completely override the fileSchema's all ids, there could be some uncleaned ids left. The way to fix it is we first do a cleanup by removing the ids first:

fileSchemaWithIds = ORCSchemaUtil.applyNameMapping(ORCSchemaUtil.removeIds(fileSchema), nameMapping);

This fix will make li-iceberg read hive tables whose table schema is a subset of its underlying file schema.

@github-actions github-actions bot added the ORC label Jan 26, 2023
@rzhang10
Copy link
Member Author

rzhang10 commented Feb 1, 2023

Integration tested on cluster with a table that contains less column than the underlying files, the read is successful.

}

@Override
public TypeDescription union(TypeDescription union, List<TypeDescription> options) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify a little more on why this function is needed for this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are leveraging the RemoveIds visitor to remove ids from a ORC schema, the issue is in LI-Iceberg we added union type support whereas the vanilla RemoveIds doesn't have the union case (because vanilla iceberg doesn't have union type), so we just need to add this override implementation to reconstruct the union schema case.

@rzhang10 rzhang10 merged commit df650aa into linkedin:li-0.11.x Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants