Add IcebergSerDe #1103

cmathiesen · 2020-06-08T16:30:37Z

Hello! This is part 1 of our series of PR's to add in the mapred InputFormat to support reading tables from Hive. This was initially meant to only include the IcebergSerDe but we had to add a few more classes to get it working properly.

@rdblue @massdosage @teabot

massdosage · 2020-06-08T16:44:15Z

This is required for #933 so that we can write proper integration tests for the IF with it using this SerDe.

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergSchemaToTypeInfo.java

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergWritable.java

mr/src/main/java/org/apache/iceberg/mr/mapred/SystemTableUtil.java

rdblue · 2020-06-10T17:30:33Z

mr/src/main/java/org/apache/iceberg/mr/mapred/SystemTableUtil.java

+    }
+  }
+
+  protected static String getVirtualColumnName(Properties properties) {


When are properties used and when is configuration used? I'm surprised that we need both.

Yeah we agree, we discovered this when adding the SerDe - the IF uses Configuration but the SerDe only uses Properties and we wanted to use the methods across both classes and it seemed simpler to overload a method rather than create new Properties from the Configuration in the IF

Although is exactly what we're doing in the TableResolver class... :')

The configuration contains all the configs we set in HiveConf and possible hadoop conf as well.
The properties are a merged result of Hive table and partition properties. We can see how Hive uses these as part of the initialize method of AbstractSerde

public void initialize(Configuration configuration, Properties tableProperties, Properties partitionProperties) throws SerDeException { initialize(configuration, SerDeUtils.createOverlayedProperties(tableProperties, partitionProperties)); }```

rdblue · 2020-06-10T17:42:10Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergSchemaToTypeInfo.java

+    return types;
+  }
+
+  private static TypeInfo generateTypeInfo(Type type) throws Exception {


It looks like this would be easier to implement using the type visitors, which already have the logic to traverse a schema. A good example is converting a Type to Spark's DataType.

That looks way simpler, I'll get started on that

I already have a type visitor somewhere from Schema to ObjectInspector. I can also submit that one on my PR so you can focus on the remaining things to do.

Yea +1 for a visitor. We have TypeInfo to Iceberg Type visitor for inspiration https://github.com/linkedin/iceberg/blob/master/hive/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToIcebergType.java

@guilload that would be great, thank you!

Still work in progress, things missing are mostly unit tests, but this is what it'll look like:
guilload@3bffa7c

That looks promising, happy to move that in here when you're done if the others agree. Thanks!

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergSchemaToTypeInfo.java

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergSerDe.java

mr/src/test/java/org/apache/iceberg/mr/mapred/TestIcebergSchemaToTypeInfo.java

mr/src/test/java/org/apache/iceberg/mr/mapred/TestIcebergSerDe.java

rdblue · 2020-06-10T18:02:39Z

mr/src/test/java/org/apache/iceberg/mr/mapred/TestIcebergSerDe.java

+
+    IcebergSerDe serDe = new IcebergSerDe();
+    List<Object> deserialized = (List<Object>) serDe.deserialize(writable);
+    Map result = (Map) deserialized.get(0);


Should this use the object inspectors?

massdosage · 2020-06-19T18:44:05Z

Hello @guilload, if we give you write access as an external collaborator to our fork of Iceberg at https://github.com/ExpediaGroup/iceberg would that make it easier to get your changes into this PR?

guilload · 2020-06-19T22:02:06Z

Yes, please! Github won't let me open a PR from my repo to yours or fork your repo either.

#12)

massdosage · 2020-06-26T13:14:45Z

All the outstanding comments on this PR have now been resolved. Could we please get another round of reviews to see if this can be merged now? @rdblue @rdsr

Thanks!

rdblue · 2020-06-30T00:58:21Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergSerDe.java

+    try {
+      table = TableResolver.resolveTableFromConfiguration(configuration, serDeProperties);
+    } catch (IOException e) {
+      throw new UncheckedIOException("Unable to resolve table from configuration: ", e);


I didn't realize Java added an UncheckedIOException in 8. We have one that is RuntimeIOException. We should probably convert Iceberg over to using the standard Java one.

rdblue · 2020-06-30T01:04:25Z

...in/java/org/apache/iceberg/mr/mapred/serde/objectinspector/IcebergBinaryObjectInspector.java

+
+  @Override
+  public byte[] getPrimitiveJavaObject(Object o) {
+    return o == null ? null : ((ByteBuffer) o).array();


This isn't correct because it doesn't follow the contract of ByteBuffer. Avro will reuse byte buffers, so there is no guarantee that this array is the correct length. In addition, we want to generally follow the ByteBuffer contract so that we don't need to worry about whether an optimization later (buffer reuse) will break certain sections of code.

An easy fix is to use ByteBuffers.toByteArray here.

rdblue · 2020-06-30T01:08:52Z

mr/src/main/java/org/apache/iceberg/mr/mapred/serde/objectinspector/IcebergObjectInspector.java

+      case TIME:
+      case UUID:
+      default:
+        throw new IllegalArgumentException(primitiveType.typeId() + " type is not supported");


Couldn't fixed by read as binary? And UUID as string? And doesn't Hive support time?

First two done. As for TIME - Hive supports DATE and TIMESTAMP, I don't know enough about Iceberg's types to comment on how these differ from TIME but I'm guessing it doesn't?

rdblue · 2020-06-30T01:14:59Z

...in/java/org/apache/iceberg/mr/mapred/serde/objectinspector/IcebergRecordObjectInspector.java

+    }
+
+    @Override
+    public int getFieldID() {


@omalley, is the Iceberg field ID suitable to return as a Hive field ID here?

rdblue · 2020-06-30T01:15:43Z

...in/java/org/apache/iceberg/mr/mapred/serde/objectinspector/IcebergRecordObjectInspector.java

+
+    @Override
+    public int hashCode() {
+      return 31 * field.hashCode() + oi.hashCode();


We typically prefer Objects.hash to this older pattern.

rdblue · 2020-06-30T01:19:09Z

...java/org/apache/iceberg/mr/mapred/serde/objectinspector/IcebergTimestampObjectInspector.java

+        implements TimestampObjectInspector {
+
+  private static final IcebergTimestampObjectInspector INSTANCE_WITH_ZONE =
+          new IcebergTimestampObjectInspector(o -> ((OffsetDateTime) o).toLocalDateTime());


Minor: It seems like this would be a bit cleaner if the outer class was abstract and these were anonymous classes with an implementation for LocalDateTime convert(Object o) or something similar. Using Function is okay, but seems like it uses functions to avoid normal inheritance.

rdblue · 2020-06-30T02:04:38Z

...ava/org/apache/iceberg/mr/mapred/serde/objectinspector/TestIcebergBinaryObjectInspector.java

+public class TestIcebergBinaryObjectInspector {
+
+  @Test
+  public void testIcebergBinaryObjectInspector() {


It would be nice to have more cases in this test suite:

When the buffer's limit is less than array().length

When the buffer's arrayOffset is non-zero

When the buffer's position is non-zero

rdblue · 2020-06-30T02:07:45Z

...main/java/org/apache/iceberg/mr/mapred/serde/objectinspector/IcebergDateObjectInspector.java

+  @Override
+  public DateWritable getPrimitiveWritableObject(Object o) {
+    Date date = getPrimitiveJavaObject(o);
+    return date == null ? null : new DateWritable(date);


Instead of converting to Date and then wrapping with DateWritable, could we use the DateWritable constructor that accepts an integer? That would be more direct and we could convert using DateTimeUtil.daysFromDate(localDate) that we use elsewhere.

rdblue · 2020-06-30T02:10:47Z

I had a few questions, but overall this looks good. The only blocker is how ByteBuffer is handled.

* Refactor TestIcebergObjectInspector * Inherit from AbstractPrimitiveJavaObjectInspector rather than IcebergPrimitiveObjectInspector * Avoid creating an intermediate Date object * Fix IcebergRecordStructField.equals * Use inheritance to implement static Timestamp object inspectors * Handle UUID type as String * Handle fixed type as binary (byte array)

massdosage · 2020-07-01T13:25:12Z

@rdblue thanks for the comments in the review, we've implemented or replied to all of them. Could you please take another look?

rdblue · 2020-07-01T16:21:23Z

...test/java/org/apache/iceberg/mr/mapred/serde/objectinspector/TestIcebergObjectInspector.java


 public class TestIcebergObjectInspector {

+  private int id = 0;


Why was this introduced? It seems like relying on the same execution order between the schema creation and the test methods is brittle.

I'd prefer to move back to fixed IDs since that's easier to test and more clear in assertions.

That made my life easier when adding new field but I get your point, I'll fix it in a follow-up PR.

@rdblue , @guilload, @massdosage if there are many issues, open items, does it make sense to create a milestone with all the open tickets? So that it can be worked on in parallel by us and we don't duplicate effort or step on each other's toes?

Sounds good to me. You should be able to create and edit milestones.

I think the only missing necessary follow up here already has a PR at #1157 so I'm not sure we need to do this? We're now co-ordinating with @guilload on the next steps for the mapred InputFormat.

rdblue · 2020-07-01T16:23:01Z

mr/src/test/java/org/apache/iceberg/mr/mapred/TestTableResolver.java

+    TableResolver.resolveTableFromConfiguration(conf);
+  }
+
+  @Test(expected = NullPointerException.class)


We usually prefer using AssertHelpers.assertThrows here, but this is minor since you don't need to check that other state has not been modified after the failure.

Since the above is almost certainly going to be refactored when we merge more logic between the two InputFormats we can tackle this then.

I agree. I didn't want to block progress on Hive just for this.

rdblue · 2020-07-01T16:24:05Z

I had a couple of minor comments, but we can fix those later. I'm going to merge this to unblock the next steps. Thanks @cmathiesen, @massdosage, @guilload, and everyone that reviewed!

rdsr · 2020-07-01T16:57:54Z

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergSerDe.java

+ * under the License.
+ */
+
+package org.apache.iceberg.mr.mapred;


can we put the hive classes in org.apache.iceberg.hive ? This is committed, but are you guys ok for this refactor?

I thought the convention was that the package name needed to match the subproject name and this isn't in the hive subproject but maybe that's not the case? Alternatively they could go in org.apache.iceberg.mr.hive?

I prefer org.apache.iceberg.mr.hive

mr.hive sounds good to me.

I don't think we need to worry too much about this kind of refactor right now. We expect it to change rapidly as we build. We'll include a note in any release about how it is experimental and subject to change.

OK, my preference would be to leave this as it is for now and then do a review of all the packaging once we have the StorageHandler and InputFormat merged.

cmathiesen and others added 3 commits June 8, 2020 16:21

Adding serde classes

a96f950

added some required classes

8ec7a4b

Add tests

90311cc

massdosage reviewed Jun 8, 2020

View reviewed changes

mr/src/main/java/org/apache/iceberg/mr/mapred/IcebergSchemaToTypeInfo.java Outdated Show resolved Hide resolved

Remove try/catch

c0cec17

massdosage mentioned this pull request Jun 10, 2020

Implement MR v1 (mapred) input format #1104

Closed

cmathiesen mentioned this pull request Jun 10, 2020

Add IcebergStorageHandler #1107

Merged