[WIP] Mapred input format #933

massdosage · 2020-04-16T15:14:19Z

This draft PR shows the current status of an InputFormat for Hive. We've developed this mostly independently but took some inspiration from the mapreduce InputFormat that was recently merged in. We have a separate branch where we're working on Hive read support via a SerDe, StorageHandler etc. which is still in its early stages (see https://github.com/ExpediaGroup/incubator-iceberg/tree/add-hive-read-support/hive/src/). The most interesting thing there currently is a HiveRunner test which is pretty much an integration test for the mapred InputFormat that runs it against an in-memory Hive metastore and MR cluster, see https://github.com/ExpediaGroup/incubator-iceberg/blob/add-hive-read-support/hive/src/test/java/org/apache/iceberg/hive/serde/TestIcebergInputFormat.java#L87

For this PR we'd appreciate feedback on:

Any improvements on how we use the Iceberg API (we're not experts in it).
Suggestions for how we deal with the overlap in code with the mapreduce InputFormat (do we want to tackle it in this PR or do that in a subsequent follow-on PR?).

# Conflicts: # mr/src/main/java/org/apache/iceberg/mr/IcebergInputFormat.java # mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormat.java

on CP

Prepare for upstream WIP PR

mr/dependencies.lock

rdblue · 2020-06-04T16:46:58Z

Looks like the current problem is that missing org.pentaho Jar again. Maybe we should move that exclusion into the configurations.all block so it's global.

massdosage · 2020-06-05T12:29:07Z

Looks like the current problem is that missing org.pentaho Jar again. Maybe we should move that exclusion into the configurations.all block so it's global.

OK, I don't have that issue (probably because I have the jar cached locally) but I know what you're talking about as I've seen it in many other projects. I'll try remove it locally and see if I can reproduce it and then add an exclusion like you suggest.

guilload · 2020-06-11T18:40:15Z

Hi @massdosage, following @rdblue's suggestion, I'd like to open a PR against this one to bring the improvements I've made on #1104 but before I do that, do you think you could rebase this branch onto master and clean up / squash some commits to make things easier for me. This PR has currently a lot of commits and is hard to navigate.

shardulm94 · 2020-06-11T21:43:15Z

mr/src/main/java/org/apache/iceberg/mr/IcebergRecordReader.java

+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.parquet.Parquet;
+
+public class IcebergRecordReader<T> {


Is it possible to reuse the IcebergRecordReader already implemented in

iceberg/mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

Line 280 in cad1249

private static final class IcebergRecordReader<T> extends RecordReader<Void, T> {

. Seems much more extensive to me, e.g. handling identity partitioned data and multiple data models e.g. Pig.

This intention here is for common record reader code across both the mapred and mapreduce sub-packages to be here and then the specific implementations just deal with any particulars related to the different APIs. Right now the InputFormat in the mapred package is only used by Hive. I'd be happy to move more common code out as and when we find it, this is just a start at how it could be done. I'd prefer to do that in later PRs and get a working. albeit basic, end to end read path for Hive merged first.

shardulm94 · 2020-06-11T21:54:14Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+    }
+    switch (leaf.getOperator()) {
+      case EQUALS:
+        return equal(column, leaf.getLiteral());


We will need to convert literal values from the Hive data types to Iceberg data types.

I'm not entirely sure what this means - is this for things like Dates and Timestamps to be of the right type?

Yup! Heres what Hive will give you https://github.com/apache/hive/blob/cb213d88304034393d68cc31a95be24f5aac62b6/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L49-L56 but Iceberg expects https://github.com/linkedin/iceberg/blob/6e5601bbea2c032f11d90114e2c3e64dc6c0e5c3/api/src/main/java/org/apache/iceberg/types/Type.java#L29-L45. So you will need add conversions for the parity. Would also suggest adding one more test case which covers all datatypes that Hive can give us.

You can also take some insights from

iceberg/orc/src/main/java/org/apache/iceberg/orc/ExpressionToSearchArgument.java

Line 241 in cad1249

private <T> Object literal(Type icebergType, T icebergLiteral) {

which is converting the other way round. This deals with ORC's SearchArgument but it is pretty much the same as Hive's SearchArgument.

shardulm94 · 2020-06-11T22:02:58Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+      case AND:
+        ExpressionTree andLeft = childNodes.get(0);
+        ExpressionTree andRight = childNodes.get(1);
+        if (childNodes.size() > 2) {


Can also be true for OR?

Also, maybe simplify this to

Expression result = Expression.alwaysTrue(); for (ExpressionTree child: childNodes) { result = and(result, translate(child, leaves)) }

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

shardulm94 · 2020-06-11T22:05:52Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+   * @param leaves All instances of the leaf nodes.
+   * @return Array of leftover evaluated nodes.
+   */
+  private static Expression[] getLeftoverLeaves(List<ExpressionTree> allChildNodes, List<PredicateLeaf> leaves) {


I guess this should be unnecessary if the logic for OR/AND operators is simplified to use a for loop as in my suggestion above.

Yeah a great suggestion, I'll make those changes :D

shardulm94 · 2020-06-11T22:09:21Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+        //We are unsure of how the CONSTANT case works, so using the approach of:
+        //https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/
+        // ParquetFilterPredicateConverter.java#L116
+        return null;


We return null here but Expression.and(), Expressions.not() and Expressions.or()` have null checks in them, so seems like this will fail regardless? Is it better to throw a more readable exception here instead.

shardulm94 · 2020-06-11T22:10:12Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+   */
+  private static Expression translateLeaf(PredicateLeaf leaf) {
+    String column = leaf.getColumnName();
+    if (column.equals("snapshot__id")) {


Reference the snapshot__id constant defined in SystemTableUtil

shardulm94 · 2020-06-11T22:17:30Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergInputFormat.java

+      SearchArgument sarg = ConvertAstToSearchArg.create(conf, exprNodeDesc);
+      Expression filter = IcebergFilterFactory.generateFilterExpression(sarg);
+
+      long snapshotIdToScan = extractSnapshotID(conf, exprNodeDesc);


I guess this may cause issues if the table itself contains a column called snapshot__id. @rdblue Do you think we should reserve a few column names (or a prefix) in the spec for these virtual columns? I guess such virtual columns are generally useful for supporting time travel/versioning/incremental scans in purely SQL engines.

Yeah, this was an edge case that cropped up when we were testing. We got around it by making it configurable but with a default of snapshot__id. So by default we're adding this extra column to a table schema, but if a user knows they already have a column with this same name they can set the virtual column to a different name:

TBLPROPERTIES ('iceberg.hive.snapshot.virtual.column.name' = 'new_column_name')

But reserving column names would be a nice addition so this check doesn't need to happen

mr/src/main/java/org/apache/iceberg/mr/mapred/TableResolver.java

shardulm94 · 2020-06-11T22:24:47Z

mr/src/main/java/org/apache/iceberg/mr/mapred/iterables/SnapshotIterable.java

+/**
+ * Creates an Iterable of Records with all snapshot metadata that can be used with the RecordReader.
+ */
+public class SnapshotIterable implements CloseableIterable {


Is this really needed? Can we reuse

iceberg/core/src/main/java/org/apache/iceberg/SnapshotsTable.java

Line 60 in cad1249

public TableScan newScan() {

to get the iterable instead?

The reason we ended up adding the SnapshotIterable is so we could wrap a row from the SnapshotsTable in a Record which could be passed back to the RecordReader without needing to change any code for the RR. It could be super helpful if there was a method in SnapshotsTable that returned Records for the whole table, as there isn't a Reader, so to speak, for the METADATA file format (https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/FileFormat.java#L31) that we can add in to the ReaderFactory that would allow us 'read' from the SnapshotTable

My concern is that we will need N such classes for the N metadata tables we offer. I would suggest looking at how this is handled for Spark.

iceberg/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java

Line 108 in cad1249

if (task.isDataTask()) {

Also we can probably consider adding support for metadata tables in a separate PR? This PR has a lot of new functionality to review.

A great call out, that looks like exactly the sort of thing we should be doing here! I'll remove most of the system tables stuff from our PR's and we'll do a separate PR that should be easier to review :))

massdosage · 2020-06-12T15:01:19Z

Hi @massdosage, following @rdblue's suggestion, I'd like to open a PR against this one to bring the improvements I've made on #1104 but before I do that, do you think you could rebase this branch onto master and clean up / squash some commits to make things easier for me. This PR has currently a lot of commits and is hard to navigate.

I was assuming it will get squashed when merged into master so the (too many) commits would get removed at that stage. Is there a specific problem on your side? If we rebase it that will break everyone's existing checkouts.

cmathiesen · 2020-06-12T17:24:25Z

@shardulm94 thank you for the review! I've made some optimisations to the FilterFactory and have left a few questions/discussions open :))

shardulm94 · 2020-06-12T18:42:50Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

  private static Expression translateLeaf(PredicateLeaf leaf) {
    String column = leaf.getColumnName();
-    if (column.equals("snapshot__id")) {
+    if (column.equals(SystemTableUtil.DEFAULT_SNAPSHOT_ID_COLUMN_NAME)) {


Actually, this may not be enough since the column name is configurable?

A good point, I'll address this in the follow up PR

shardulm94 · 2020-06-12T18:43:22Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

   * @return Expression that is translated from the Hive SearchArgument.
   */
-  private static Expression translate(ExpressionTree tree, List<PredicateLeaf> leaves) {
+  private static Expression translate(ExpressionTree tree, List<PredicateLeaf> leaves,


Nit: childNodes doesn't need to be parameter, can be a local variable instead.

shardulm94 · 2020-06-12T18:50:18Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

-                translate(tree.getChildren().get(1), leaves));
+        Expression orResult = Expressions.alwaysFalse();
+        for (ExpressionTree child : childNodes) {
+          orResult = or(orResult, translate(child, leaves, childNodes));


I think passing childNodes here is incorrect. It should be child.getChildren() else we just keep on passing the root child nodes over and over. However, I think we should just make childNodes a local variable instead so that we don't make this mistake. It would also be good to add some tests for nested expressions.

shardulm94 · 2020-06-16T08:41:53Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+      case STRING:
+        return leaf.getLiteral();
+      case DATE:
+        return ((Timestamp) leaf.getLiteral()).getTime();


IIRC Hive's Date type predicate literal is of type java.sql.Date. So this looks incorrect? (Unsure)

Yeah you're correct about the Date type, and I was using that initially, but then when running the tests I would get a ClassCastException: Can't cast Timestamp as Date: it looks like Hive does some internal conversion into a Timestamp when calling leaf.getLiteral for a Date type:
https://github.com/apache/hive/blob/branch-2.3/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java#L108

Ugh! Nice find. Timestamp.getTime() is probably still incorrect though as Iceberg expects the number of days since epoch as a literal for the Date type. Timestamp.getTime() will give you the number of mills since epoch.

Also, might be good to mention this Hive behaviour as a comment as it looks very non-intuitive.

Ah yeah, good shout on the comment and using the correct granularity, I'll make those changes!

shardulm94 · 2020-06-16T08:45:45Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+      case DECIMAL:
+        return BigDecimal.valueOf(((HiveDecimalWritable) leaf.getLiteral()).doubleValue());
+      case TIMESTAMP:
+        return ((Timestamp) leaf.getLiteral()).getTime();


Timestamp.getTime() gives back milliseconds since epoch, but Iceberg expects microsecond granularity. You might want to factor in Timestamp.getNanos() to get microsecond granularity.

An integration test using HiveRunner will probably be very helpful in finding these conversion issues. I would suggest looking at

iceberg/data/src/test/java/org/apache/iceberg/data/TestLocalScan.java

Line 78 in 90d0a0e

public class TestLocalScan {

which has helped me debug issues related to filter conversions in the past, especially the test dealing with Date and Timestamp types

iceberg/data/src/test/java/org/apache/iceberg/data/TestLocalScan.java

Line 458 in 90d0a0e

public void testFilterWithDateAndTimestamp() throws IOException {

ah nice, thank you for the pointer!

It is not released yet, but later we have to consider the TimeZones as well. See: https://issues.apache.org/jira/browse/HIVE-20007

Iceberg leaves time zone handling to the engine. Engines should pass concrete values to Iceberg in expressions.

For example, a query might have a clause WHERE ts > TIMESTAMP '...'. It is the query engine's responsibility to determine what concrete value that timestamp actually represents. If ts is a timestamp without time zone, for example, the engine is responsible for converting the timestamp with time zone for comparison.

Iceberg should receive an unambiguous value to use in the comparison. Preferably, the value is in microseconds from epoch, but we can convert from alternative representations like millis + nanos.

shardulm94 · 2020-06-16T08:48:58Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergFilterFactory.java

+    }
+  }
+
+  private static List<Object> hiveLiteralListToIcebergType(List<Object> hiveLiteralTypes) {


We should just reuse leafToIcebergType here, seems like theres equivalent logic in two places.

I would agree about trying to reuse leafToIcebergType. Some refactoring needs to happen for the IN operator case where we need to convert a literal List instead of just a single literal and my current logic doesn't support that very well - but I am on it!

massdosage · 2020-07-23T11:11:19Z

We adapted our approach and worked with @guilload on #1192 which has now been merged so I'm going to close this. We'll raise follow-on PRs which contain any missing functionality that was here as well as things we removed from this PR to simplify it (pushdowns, time travel queries etc.)

rdsr and others added 30 commits March 15, 2020 17:27

InputFormat support for Iceberg

f8dc2b0

Address review comments

5f5ffbd

Address review comments

1d07d8c

added mapred inputformat

b48fb17

added mapred inputformat

e38f88b

Merge remote-tracking branch 'rdsr/mr_generic' into mr_generic_with_hive

3fef146

# Conflicts: # mr/src/main/java/org/apache/iceberg/mr/IcebergInputFormat.java # mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormat.java

Merge branch 'master' into mr_generic_with_hive

ab5a65f

move hive runner to version that matches Hive 2.3.6 used here

274adbb

checkstyle fixes

ad612ad

added test table data

1bb76bf

Shading modules needed for mapred api

03f375a

Fix checkstyle issues

d7dfb0e

backing up the crazy - need to find out how to get hive-exec:core back

07307ec

on CP

only hive-exec core on mr classpath now

7398753

brutal attempt at overriding guava version

e3c26de

Shade all the guava

5d728e5

Fix the guava

fff5e0e

Nuke jackson dependencies

ee213ad

remove test method

a0d07a7

tidy up checkstyle

11296c3

remove classes from mapreduce inputformat branch

69cef2b

revert baseline plugin version (no idea why it was failing earlier)

a08c087

re-enable error-prone plugin

d4cff1c

tidy up dependency scopes

72a04a8

merge master back in

078a06d

merge master

28ba743

fix build after merge

d57037a

trim down to non-hive related classes

f135b78

tidy up, add tests, incorporate some code from upstream

190fb37

Merge pull request #4 from ExpediaGroup/tidy-up-before-upstream-wip-pr

99a566a

Prepare for upstream WIP PR

massdosage added 2 commits June 4, 2020 13:53

added a HiveRunner test for the mapred InputFormat

e1b81dc

tidy up

f6c5108

massdosage commented Jun 4, 2020

View reviewed changes

mr/dependencies.lock Show resolved Hide resolved

massdosage added 3 commits June 5, 2020 13:29

exclude pentaho

a6c2b19

wip checkpoint

937e228

debugging failing tests

382b3bb

massdosage mentioned this pull request Jun 8, 2020

Add IcebergSerDe #1103

Merged

guilload mentioned this pull request Jun 8, 2020

Implement MR v1 (mapred) input format #1104

Closed

massdosage added 2 commits June 8, 2020 20:52

fix tests

9843691

tidy up

7570d06

cmathiesen mentioned this pull request Jun 10, 2020

Add IcebergStorageHandler #1107

Merged

Add SnapshotIterable

2fce735

shardulm94 reviewed Jun 11, 2020

View reviewed changes

Clean up FilterFactory

52a6a17

shardulm94 reviewed Jun 12, 2020

View reviewed changes

cmathiesen added 4 commits June 15, 2020 14:53

Convert Hive types

d2eee30

Remove system tables code

00dcf76

Add type conversion test

cb764b9

Removing old todo comment

0f0b39d

shardulm94 reviewed Jun 16, 2020

View reviewed changes

Timestamps in microseconds

a6eb6c7

cmathiesen mentioned this pull request Jul 1, 2020

Add system tables functionality ExpediaGroup/iceberg#14

Merged

merge master

81b933c

massdosage closed this Jul 23, 2020

[WIP] Mapred input format #933

[WIP] Mapred input format #933

Uh oh!

Conversation

massdosage commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rdblue commented Jun 4, 2020

Uh oh!

massdosage commented Jun 5, 2020

Uh oh!

guilload commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massdosage commented Jun 12, 2020

Uh oh!

cmathiesen commented Jun 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmathiesen Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shardulm94 Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

massdosage commented Apr 16, 2020 •

edited

Loading

guilload commented Jun 11, 2020 •

edited

Loading

cmathiesen Jun 16, 2020 •

edited

Loading

shardulm94 Jun 16, 2020 •

edited

Loading

rdblue Jul 22, 2020 •

edited

Loading