Skip to content

Conversation

@edgarRd
Copy link
Contributor

@edgarRd edgarRd commented Aug 29, 2020

This PR fixes #1345 and #1206 - the main changes included are:

  1. Refactor TestSparkTableUtil to run tests in multiple file formats.
  2. Use name mapping when importing ORC tables.

Note that I changed to use a name mapping by default of the target table, as mentioned in #1345 (comment).

PTAL @rdsr @shardulm94 @aokolnychyi - Thanks!

@edgarRd
Copy link
Contributor Author

edgarRd commented Sep 3, 2020

@rdsr @shardulm94 @aokolnychyi @rdblue PTAL if you have a chance. Thanks!

@rdblue
Copy link
Contributor

rdblue commented Sep 8, 2020

Thanks for pinging me, @edgarRd. I'm back from a long weekend off and I'll look as soon as I have time.

String nameMappingString = targetTable.properties().get(TableProperties.DEFAULT_NAME_MAPPING);
NameMapping nameMapping = nameMappingString != null ? NameMappingParser.fromJson(nameMappingString) : null;
NameMapping nameMapping = nameMappingString != null ?
NameMappingParser.fromJson(nameMappingString) : MappingUtil.create(targetTable.schema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change to use a name mapping by default in a separate PR. For now, let's just support a name mapping when importing ORC data.

My rationale is that I think the two are independently useful. Someone might want to change the default, but not cherry-pick ORC changes. Similarly, someone might want to use a name mapping with ORC, but not want to default to name mapping.

Also, I think that we would want to default to name mapping slightly differently. It doesn't make sense to me to create a temporary mapping that is used for metrics here, unless that mapping is also used to read the data. So I would prefer to update tables when importing and add a mapping to table metadata by default, if it is not already there.

import static org.apache.iceberg.TableProperties.PARQUET_VECTORIZATION_ENABLED;
import static org.apache.iceberg.types.Types.NestedField.optional;

@RunWith(Enclosed.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there are a lot of changes in this test, can you also move this to iceberg-spark instead of iceberg-spark2? I think it is in spark2 by accident, and moving it in IntelliJ produces no warnings. That way this also runs in the Java 11 test profile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I could work on another PR with that change, I was wondering if that'd minimize the diff in this PR and make it more clear. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fine with me.

@rdblue
Copy link
Contributor

rdblue commented Sep 14, 2020

Great work, @edgarRd! Everything looks good, except for changing the default behavior. Let's separate that out and discuss how we want to do it. Thank you!

@edgarRd edgarRd force-pushed the orc-import-name-mapping branch from 3343339 to b9a9c4e Compare September 24, 2020 20:44
@edgarRd
Copy link
Contributor Author

edgarRd commented Sep 24, 2020

@rdblue I've updated the comment on the default. I was wondering if the relocation of the tests could be done in a follow up PR just to make the diff clear. Thanks!

String nameMappingString = targetTable.properties().get(TableProperties.DEFAULT_NAME_MAPPING);
NameMapping nameMapping = nameMappingString != null ? NameMappingParser.fromJson(nameMappingString) : null;
NameMapping nameMapping = nameMappingString != null ?
NameMappingParser.fromJson(nameMappingString) : MappingUtil.create(targetTable.schema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs to be reverted. The import process must create and store a mapping on the table. We should not use an ephemeral mapping that is immediately discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, somehow I missed this one but did it for the method of unpartitioned tables. Thanks for pointing this out. I've reverted the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explains it. I thought that you were going to, so I was surprised.

@rdblue
Copy link
Contributor

rdblue commented Sep 25, 2020

@edgarRd, I had another look. The default is still changed so that a table with no mapping with create one. That needs to be reverted before we can commit this.

@edgarRd edgarRd force-pushed the orc-import-name-mapping branch from b9a9c4e to 2b6e369 Compare September 25, 2020 21:36
@edgarRd
Copy link
Contributor Author

edgarRd commented Sep 25, 2020

@rdblue Thanks for taking a look!

@edgarRd edgarRd force-pushed the orc-import-name-mapping branch from 2b6e369 to 0304058 Compare September 25, 2020 21:43
@rdblue
Copy link
Contributor

rdblue commented Sep 25, 2020

Thanks for updating! I'll merge this when tests are passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark: Follow name mapping while importing ORC tables

2 participants