Spark: Follow name mapping when importing ORC tables #1399

edgarRd · 2020-08-29T00:24:47Z

This PR fixes #1345 and #1206 - the main changes included are:

Refactor TestSparkTableUtil to run tests in multiple file formats.
Use name mapping when importing ORC tables.

Note that I changed to use a name mapping by default of the target table, as mentioned in #1345 (comment).

PTAL @rdsr @shardulm94 @aokolnychyi - Thanks!

edgarRd · 2020-09-03T00:21:28Z

@rdsr @shardulm94 @aokolnychyi @rdblue PTAL if you have a chance. Thanks!

rdblue · 2020-09-08T20:46:40Z

Thanks for pinging me, @edgarRd. I'm back from a long weekend off and I'll look as soon as I have time.

rdblue · 2020-09-14T00:10:24Z

spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

      String nameMappingString = targetTable.properties().get(TableProperties.DEFAULT_NAME_MAPPING);
-      NameMapping nameMapping = nameMappingString != null ? NameMappingParser.fromJson(nameMappingString) : null;
+      NameMapping nameMapping = nameMappingString != null ?
+          NameMappingParser.fromJson(nameMappingString) : MappingUtil.create(targetTable.schema());


I think we should change to use a name mapping by default in a separate PR. For now, let's just support a name mapping when importing ORC data.

My rationale is that I think the two are independently useful. Someone might want to change the default, but not cherry-pick ORC changes. Similarly, someone might want to use a name mapping with ORC, but not want to default to name mapping.

Also, I think that we would want to default to name mapping slightly differently. It doesn't make sense to me to create a temporary mapping that is used for metrics here, unless that mapping is also used to read the data. So I would prefer to update tables when importing and add a mapping to table metadata by default, if it is not already there.

rdblue · 2020-09-14T00:46:00Z

spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java

 import static org.apache.iceberg.TableProperties.PARQUET_VECTORIZATION_ENABLED;
 import static org.apache.iceberg.types.Types.NestedField.optional;

+@RunWith(Enclosed.class)


While there are a lot of changes in this test, can you also move this to iceberg-spark instead of iceberg-spark2? I think it is in spark2 by accident, and moving it in IntelliJ produces no warnings. That way this also runs in the Java 11 test profile.

I agree. I could work on another PR with that change, I was wondering if that'd minimize the diff in this PR and make it more clear. Thanks!

Yeah, that's fine with me.

rdblue · 2020-09-14T00:46:36Z

Great work, @edgarRd! Everything looks good, except for changing the default behavior. Let's separate that out and discuss how we want to do it. Thank you!

edgarRd · 2020-09-24T20:47:34Z

@rdblue I've updated the comment on the default. I was wondering if the relocation of the tests could be done in a follow up PR just to make the diff clear. Thanks!

rdblue · 2020-09-25T20:18:45Z

spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

    String nameMappingString = targetTable.properties().get(TableProperties.DEFAULT_NAME_MAPPING);
-    NameMapping nameMapping = nameMappingString != null ? NameMappingParser.fromJson(nameMappingString) : null;
+    NameMapping nameMapping = nameMappingString != null ?
+        NameMappingParser.fromJson(nameMappingString) : MappingUtil.create(targetTable.schema());


This change needs to be reverted. The import process must create and store a mapping on the table. We should not use an ephemeral mapping that is immediately discarded.

Yeah, somehow I missed this one but did it for the method of unpartitioned tables. Thanks for pointing this out. I've reverted the change.

That explains it. I thought that you were going to, so I was surprised.

rdblue · 2020-09-25T20:20:05Z

@edgarRd, I had another look. The default is still changed so that a table with no mapping with create one. That needs to be reverted before we can commit this.

edgarRd · 2020-09-25T21:38:24Z

@rdblue Thanks for taking a look!

rdblue · 2020-09-25T22:05:31Z

Thanks for updating! I'll merge this when tests are passing.

…or copy (apache#1399)

probot-autolabeler bot added ORC spark labels Aug 29, 2020

rdblue reviewed Sep 14, 2020

View reviewed changes

edgarRd added 2 commits September 24, 2020 15:27

Parameterize TestSparkTableUtil tests

573db27

ORC: Add name mapping on import

09e7ee8

edgarRd force-pushed the orc-import-name-mapping branch from 3343339 to b9a9c4e Compare September 24, 2020 20:44

rdblue reviewed Sep 25, 2020

View reviewed changes

edgarRd force-pushed the orc-import-name-mapping branch from b9a9c4e to 2b6e369 Compare September 25, 2020 21:36

Do not apply default name mapping on table import

0304058

edgarRd force-pushed the orc-import-name-mapping branch from 2b6e369 to 0304058 Compare September 25, 2020 21:43

rdblue merged commit c07b23b into apache:master Sep 25, 2020

waterlx mentioned this pull request Oct 13, 2020

Importing Hive table (using ORC) is blocked by "ORC schema does not contain Iceberg IDs" #1604

Closed

rdblue added this to the Java 0.10.0 Release milestone Nov 16, 2020

parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025

Fix copy delete files: returned source path instead of staging path f…

5939836

…or copy (apache#1399)

Spark: Follow name mapping when importing ORC tables #1399

Spark: Follow name mapping when importing ORC tables #1399

Uh oh!

Conversation

edgarRd commented Aug 29, 2020

Uh oh!

edgarRd commented Sep 3, 2020

Uh oh!

rdblue commented Sep 8, 2020

Uh oh!

rdblue Sep 14, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Sep 14, 2020

Choose a reason for hiding this comment

Uh oh!

edgarRd Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Sep 25, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 14, 2020

Uh oh!

edgarRd commented Sep 24, 2020

Uh oh!

rdblue Sep 25, 2020

Choose a reason for hiding this comment

Uh oh!

edgarRd Sep 25, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Sep 25, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 25, 2020

Uh oh!

edgarRd commented Sep 25, 2020

Uh oh!

rdblue commented Sep 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants