[LI] Bug fix: Remove ids from fileSchema before feeding it into apply… #136
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…NameMapping
In #134 , we added a config to let ORC read path ignore file schema ids and only use ids from namemapping, but in the code where the runtime
fileSchemaWithIdsis generated:fileSchemaWithIds = ORCSchemaUtil.applyNameMapping(fileSchema, nameMapping);Previously, we have assumed the
fileSchemaandnameMappingcontains exact same columns, so that thenameMappingcan guarantee to override all ids (if exist) in the fileSchema, thus serving as the single id provider for thefileSchemaWithIds.but for the case when they don't contain exact same columns, for example, when the underlying files contains more columns than the hive schema columns, then
nameMappingwill not completely override thefileSchema's all ids, there could be some uncleaned ids left. The way to fix it is we first do a cleanup by removing the ids first:fileSchemaWithIds = ORCSchemaUtil.applyNameMapping(ORCSchemaUtil.removeIds(fileSchema), nameMapping);This fix will make li-iceberg read hive tables whose table schema is a subset of its underlying file schema.