Add IcebergStorageHandler #1107

cmathiesen · 2020-06-10T13:22:31Z

This PR adds an IcebergStorageHandler that bundles together all the interfaces needed to read Iceberg table from Hive. This would first require #933 and #1103 to be merged in as it requires the InputFormat and the SerDe classes

We are planning on adding tests using HiveRunner for this class once everything is merged in :D

@rdblue @rdsr @massdosage @teabot

rdsr · 2020-06-12T03:35:46Z

@massdosage @cmathiesen . In what order should we review the work here? #933 , #1103 and then this PR ?

massdosage · 2020-06-12T14:39:11Z

@massdosage @cmathiesen . In what order should we review the work here? #933 , #1103 and then this PR ?

Hello @rdsr! So, we discussed this with @rdblue and he asked us to try split the code coming out of Hiveberg into separate, discrete PRs where possible, even if some of the PRs didn't add anything that was directly useable. So that's what we've tried to do but only once all 3 are committed does this become really useful for end users.

I would actually suggest doing the reviews in the reverse order and start with this one which is the simplest. Once we have this and #1103 merged in we can add a whole lot more HiveRunner tests to #933 which test the InputFormat, the SerDe and the StorageHandler all working together which would be really nice.

massdosage · 2020-06-12T14:43:26Z

Ah, I just realised this actually depends on the InputFormat and SerDe classes being present so we'd actually need #1103 merged first and then should either add this code to #933 or only look at this after that has been merged and then add the HiveRunner tests for the StorageHandler to this PR. So I'd say please look at #1103 first.

rdsr · 2020-06-15T15:06:29Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergStorageHandler.java

+
+  @Override
+  public void configureJobConf(TableDesc tableDesc, JobConf jobConf) {
+


I think is this the place we can pass in many of the options which Iceberg supports. E.g reading a specific snapshot, case insensitive match etc..

pvary · 2020-07-22T12:23:31Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergStorageHandler.java

+  @Override
+  public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer deserializer, ExprNodeDesc exprNodeDesc) {
+    DecomposedPredicate predicate = new DecomposedPredicate();
+    predicate.residualPredicate = (ExprNodeGenericFuncDesc) exprNodeDesc;


As a first implementation it is correct.
Maybe stating the obvious, but if we know that the Iceberg predicate will cover the expression fully, then we can return empty residualPredicate so we will not have an unnecessary filtering operation in Hive. Or if the expression filtering is covered by Iceberg partially then Hive should filter only for the residual filter.

The challenge here is that Iceberg produces per-split residuals, so we would need to return them for each split, or find the common filter that needs to be applied to all splits.

There's also a question of whether Hive or Iceberg is better at doing the filtering. For vectorized reads, Iceberg may be better. But for row-based reads, engines are typically better. That's why we delegate final filtering to Spark, which will benefit from that engine's codegen.

Does Iceberg push these filters to the specific file readers (ORC/Parquet)? Or this predicate pushdown is only for filtering out specific files and not the content of the files?
In Hive we found it very beneficial to push every filter to the readers. This is especially true for column based formats, like Parquet or ORC, where we do not have to deserialize the whole data if filtering already removes the unnecessary rows.

Thanks,
Peter

PS: Maybe this is not the best place to start this conversation. If you feel so, feel free to redirect it to the correct channel. Just started to familiarize myself with the workings of the Iceberg community.

We have some prior work on how we intend to push filters down over here. The intention was to first get this "first implementation" merged in and then raise a subsequent PR to add improvements. I need to update this PR since the InputFormat was merged in (working on it as we speak). We'd definitely appreciate input on the subsequent PRs if there are better ways of interacting with Hive.

Sure thing! I will follow the stuff around the HiveSerde and friends :)
As a first implementation it is definitely ok to return the whole expression as a residualPredicate. Hive just have to try to apply that again which is absolutely ok from correctness side.

@pvary, thanks for taking a look at these, it's great to have more people with knowledge of Hive participating!

massdosage · 2020-07-23T13:51:54Z

This is now ready to review @rdsr @rdblue @guilload. Thanks all!

rdblue · 2020-07-24T00:54:33Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

+
+  @Override
+  public Class<? extends OutputFormat> getOutputFormatClass() {
+    return HiveIgnoreKeyTextOutputFormat.class;


In table metadata, we use FileOutputFormat because it can't be instantiated, so any write attempted through a Hive table library would fail, instead of writing data that doesn't appear in the table. Does that need to be done here?

Good question, we haven't tried the write path at all but I agree that it would be better if it failed rather than silently doing nothing.

Another option would be to create a HiveIcebergOutputFormat class that throws UnsupportedOperationExceptions, this is what they do in Delta's Hive connector.

That seems like a good option. If we get to the phase when we have a working writer implementation, then if the correct writer is already specified then we do not have to recreate all of the tables. We just change the jars and everything should work like charm :)

rdblue · 2020-07-24T00:55:03Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

+
+  @Override
+  public String toString() {
+    return this.getClass().getName();


What do other storage handlers do? Could this be a short name, like "iceberg"?

I had a look at a few and most of them don't override this method. Hive's own JDBC storage handler returns the full class name as a string via a [Constant]( - https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/common/src/java/org/apache/hadoop/hive/conf/Constants.java#L62). So it could be OK like this, or we just remove the method.

rdblue · 2020-07-24T00:57:05Z

This looks good to me, just a couple minor questions. I'll merge this to unblock next steps.

Where does this fit in the overall plan for Hive support? Are tests going to be added next? What works right now and what doesn't?

massdosage · 2020-07-24T11:56:25Z

This looks good to me, just a couple minor questions. I'll merge this to unblock next steps.

Where does this fit in the overall plan for Hive support?

Well, this is actually a good milestone as we now have everything that's needed to be able to read Iceberg tables from Hive merged into master.

Up next we were going to add back in the features like pushdowns, system tables, time travel reads etc. but they're all improvements, what is in Iceberg now should work end to end for the read path. One thing I'd like to change to make this easier to use is for the module to build an uber jar so you only have to add one jar to Hive's classpath instead of 6+ which is the case at the moment. Once that's done we also need to add documentation describing this all.

Are tests going to be added next? What works right now and what doesn't?

@guilload ended up adding the tests in #1192 - the main test class is TestHiveIcebergInputFormat. We were planning to flesh that out with more tests as we add in the above features etc.

pvary · 2020-07-24T12:23:38Z

Up next we were going to add back in the features like pushdowns, system tables, time travel reads etc. but they're all improvements, what is in Iceberg now should work end to end for the read path.

Maybe something like "STORED AS ICEBERG" should help to save the users some time if they are creating the Hive tables manually above existing Iceberg tables. See the same for JSON: https://issues.apache.org/jira/browse/HIVE-19899 - this would need a Hive change after the Iceberg release, but having IcebergStorageFormatDescriptor.java might be a part of the Iceberg project.

massdosage · 2020-07-24T14:55:03Z

Maybe something like "STORED AS ICEBERG" should help to save the users some time if they are creating the Hive tables manually above existing Iceberg tables. See the same for JSON: https://issues.apache.org/jira/browse/HIVE-19899 - this would need a Hive change after the Iceberg release, but having IcebergStorageFormatDescriptor.java might be a part of the Iceberg project.

That's a great idea, I'll look into it when I get a chance.

rdblue · 2020-07-24T20:51:11Z

Thanks for the update, @massdosage!

If we have a working read path, then it would be awesome to start working on some docs commits to the site as well, although some of the configuration may change with updates to table resolution, #1155.

Add StorageHandler

65fe819

rdblue marked this pull request as draft June 10, 2020 18:19

rdsr reviewed Jun 15, 2020

View reviewed changes

pvary reviewed Jul 22, 2020

View reviewed changes

massdosage added 2 commits July 23, 2020 14:37

merge master

2c5fe14

update storage handler

fa03e73

cmathiesen marked this pull request as ready for review July 23, 2020 13:49

rdblue reviewed Jul 24, 2020

View reviewed changes

rdblue merged commit d1429a2 into apache:master Jul 24, 2020

massdosage deleted the iceberg-storage-handler branch July 24, 2020 11:56

cmathiesen added a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Hive: Add IcebergStorageHandler (apache#1107)

78fe2ff


		@Override
		public void configureJobConf(TableDesc tableDesc, JobConf jobConf) {

Add IcebergStorageHandler #1107

Add IcebergStorageHandler #1107

Uh oh!

Conversation

cmathiesen commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdsr commented Jun 12, 2020

Uh oh!

massdosage commented Jun 12, 2020

Uh oh!

massdosage commented Jun 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massdosage commented Jul 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 24, 2020

Uh oh!

massdosage commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Jul 24, 2020

Uh oh!

massdosage commented Jul 24, 2020

Uh oh!

rdblue commented Jul 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cmathiesen commented Jun 10, 2020 •

edited

Loading

massdosage commented Jul 24, 2020 •

edited

Loading