-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Follow name mapping when importing ORC tables #1399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rdsr @shardulm94 @aokolnychyi @rdblue PTAL if you have a chance. Thanks! |
|
Thanks for pinging me, @edgarRd. I'm back from a long weekend off and I'll look as soon as I have time. |
| String nameMappingString = targetTable.properties().get(TableProperties.DEFAULT_NAME_MAPPING); | ||
| NameMapping nameMapping = nameMappingString != null ? NameMappingParser.fromJson(nameMappingString) : null; | ||
| NameMapping nameMapping = nameMappingString != null ? | ||
| NameMappingParser.fromJson(nameMappingString) : MappingUtil.create(targetTable.schema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change to use a name mapping by default in a separate PR. For now, let's just support a name mapping when importing ORC data.
My rationale is that I think the two are independently useful. Someone might want to change the default, but not cherry-pick ORC changes. Similarly, someone might want to use a name mapping with ORC, but not want to default to name mapping.
Also, I think that we would want to default to name mapping slightly differently. It doesn't make sense to me to create a temporary mapping that is used for metrics here, unless that mapping is also used to read the data. So I would prefer to update tables when importing and add a mapping to table metadata by default, if it is not already there.
| import static org.apache.iceberg.TableProperties.PARQUET_VECTORIZATION_ENABLED; | ||
| import static org.apache.iceberg.types.Types.NestedField.optional; | ||
|
|
||
| @RunWith(Enclosed.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there are a lot of changes in this test, can you also move this to iceberg-spark instead of iceberg-spark2? I think it is in spark2 by accident, and moving it in IntelliJ produces no warnings. That way this also runs in the Java 11 test profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I could work on another PR with that change, I was wondering if that'd minimize the diff in this PR and make it more clear. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fine with me.
|
Great work, @edgarRd! Everything looks good, except for changing the default behavior. Let's separate that out and discuss how we want to do it. Thank you! |
3343339 to
b9a9c4e
Compare
|
@rdblue I've updated the comment on the default. I was wondering if the relocation of the tests could be done in a follow up PR just to make the diff clear. Thanks! |
| String nameMappingString = targetTable.properties().get(TableProperties.DEFAULT_NAME_MAPPING); | ||
| NameMapping nameMapping = nameMappingString != null ? NameMappingParser.fromJson(nameMappingString) : null; | ||
| NameMapping nameMapping = nameMappingString != null ? | ||
| NameMappingParser.fromJson(nameMappingString) : MappingUtil.create(targetTable.schema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change needs to be reverted. The import process must create and store a mapping on the table. We should not use an ephemeral mapping that is immediately discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, somehow I missed this one but did it for the method of unpartitioned tables. Thanks for pointing this out. I've reverted the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explains it. I thought that you were going to, so I was surprised.
|
@edgarRd, I had another look. The default is still changed so that a table with no mapping with create one. That needs to be reverted before we can commit this. |
b9a9c4e to
2b6e369
Compare
|
@rdblue Thanks for taking a look! |
2b6e369 to
0304058
Compare
|
Thanks for updating! I'll merge this when tests are passing. |
This PR fixes #1345 and #1206 - the main changes included are:
TestSparkTableUtilto run tests in multiple file formats.Note that I changed to use a name mapping by default of the target table, as mentioned in #1345 (comment).
PTAL @rdsr @shardulm94 @aokolnychyi - Thanks!