ORC: Supported nested identity partition data #989

rdsr · 2020-04-30T07:02:13Z

Fixes #897

The following changes are made

Added a OrcTypeWithSchemaVisitor
Added support for nested identity partition similar to Avro: Support partition values using a constants map #896
Refactored a lot of data reader functions which can now be shared across Spark and Iceberg Generics for ORC
SparkOrcReader is simplified in the process. I've moved away from Spark UnsafeRow and used GenericInternalRow instead similar to Spark Avro and Parquet readers.

orc/src/main/java/org/apache/iceberg/orc/OrcValReader.java

orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

rdsr · 2020-04-30T13:53:22Z

cc @edgarRd

orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java

rdsr · 2020-05-01T04:46:09Z

I reran the Spark jmh tests which @shardulm94 had written on a fresh checkout of iceberg and against my patch. Here's the output

Fresh checkout of Iceberg

Benchmark                                                                          Mode  Cnt   Score   Error  Units
IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  11.004 ± 0.330   s/op
IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.920 ± 1.424   s/op
IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.028 ± 0.103   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.695 ± 0.203   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.293 ± 0.236   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.850 ± 0.211   s/op

With the ORC nested partition patch applied

Benchmark                                                                          Mode  Cnt   Score   Error  Units
IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  13.984 ± 0.440   s/op
IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.454 ± 0.438   s/op
IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.264 ± 0.091   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.037 ± 0.241   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.615 ± 0.336   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.980 ± 0.167   s/op

Seems like there isn't much difference. cc @shardulm94

rdblue · 2020-05-06T18:18:00Z

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

        "Error in ORC file, children fields and names do not match.");

    List<Types.NestedField> icebergFields = Lists.newArrayListWithExpectedSize(children.size());
+    // TODO how we get field ids from ORC schema


I just noticed the logic here and it's a correctness bug. ORC should not assign column IDs when one is missing. Instead, it should ignore the field.

Should we use another PR to fix this?

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java

orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java

rdblue · 2020-05-07T17:04:50Z

orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java

+
+    protected abstract T create();
+
+    protected abstract T reuseOrCreate();


Why not pass the possibly reused object in here?

orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java

rdblue · 2020-05-07T17:17:32Z

orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java

+
+    private T readInternal(T struct, ColumnVector[] columnVectors, int row) {
+      for (int c = 0; c < readers.length; ++c) {
+        set(struct, c, reader(c).read(columnVectors[c], row));


You might consider a different approach. This currently mirrors what happens in Avro, where the constants are set after reading a record. That is done because Avro can't skip fields easily -- it needs to read through a value even if the value won't be used.

But columnar formats can easily skip. That's why in Parquet, we replace the column reader with a constant reader. So the struct reader behaves exactly like normal and reads a value from every child reader. But some of those children might ignore what's in the data file and return a constant. That should be more efficient because you're not materializing columns you don't need to.

Is is ok if I tackle this in followup?

Yeah, that sounds good.

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java

rdblue · 2020-05-07T17:36:04Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkOrcReader.java


    try (CloseableIterable<InternalRow> reader = ORC.read(Files.localInput(testFile))
        .project(schema)
-        .createReaderFunc(SparkOrcReader::new)


I think this should test with and without container reuse if that is implemented in this PR. Probably just make this test parameterized.

For now I've removed the reuse code. We can tackle than in followup

rdblue

Nice work, @rdsr! The main change needed is to fix or remove container reuse because that's a correctness problem.

rdsr · 2020-05-20T19:48:22Z

thanks @rdblue . Picking this up again. Should address ur comments soon

shardulm94 · 2020-05-07T09:56:52Z

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

        }
    }
-
+    orcType.setAttribute(ICEBERG_ID_ATTRIBUTE, fieldId.toString());


For completeness sake, also set ICEBERG_REQUIRED_ATTRIBUTE?

Adding this is actually causing failures

org.apache.iceberg.data.orc.TestGenericReadProjection > testRenamedAddedField FAILED java.lang.IllegalArgumentException: No conversion of type LONG to self needed at org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1671) at org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2124) at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2331) at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1961) at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2371) at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:227) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:752) at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:80) at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:65) at com.google.common.collect.Iterables.getOnlyElement(Iterables.java:254) at org.apache.iceberg.data.orc.TestGenericReadProjection.writeAndRead(TestGenericReadProjection.java:53)

And I vaguely remember we fixed a similar bug before in ORC

It would be great to know what's going on here. Since this is just a projection schema and the reader is built with the Iceberg schema (that has required/optional), I don't think it is really a blocker. But setting a property here shouldn't cause ORC to fail, right?

I'll file the necessary followups.

orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java

rdsr · 2020-05-22T07:38:32Z

@rdblue This is ready for another round of review

rdblue · 2020-05-22T15:56:32Z

This looks good to me, except for the conflicts. Can you rebase and we can commit it?

rdsr · 2020-05-22T16:22:03Z

Weird. I thought I did resolve the conflicts.

rdsr · 2020-05-22T16:33:41Z

@rdblue . Fixed conflicts and filed all the necessary followups.

rdblue · 2020-05-22T18:42:06Z

Looks good to me! I'll merge it. Thanks, @rdsr!

rdsr commented Apr 30, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcValReader.java Outdated Show resolved Hide resolved

rdsr commented Apr 30, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java Show resolved Hide resolved

rdsr commented Apr 30, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java Show resolved Hide resolved

rdsr marked this pull request as ready for review April 30, 2020 07:06

abti reviewed Apr 30, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java Outdated Show resolved Hide resolved

rdsr changed the title ~~[WIP] Orc nested partition support~~ [WIP] Orc nested Identity partition support Apr 30, 2020

rdsr force-pushed the orc_nested_partition branch 2 times, most recently from 96881f5 to 8b635a0 Compare May 1, 2020 15:09

rdblue reviewed May 6, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java Outdated Show resolved Hide resolved