Skip to content

Conversation

@rdsr
Copy link
Contributor

@rdsr rdsr commented Apr 30, 2020

Fixes #897

The following changes are made

  1. Added a OrcTypeWithSchemaVisitor
  2. Added support for nested identity partition similar to Avro: Support partition values using a constants map #896
  3. Refactored a lot of data reader functions which can now be shared across Spark and Iceberg Generics for ORC
  4. SparkOrcReader is simplified in the process. I've moved away from Spark UnsafeRow and used GenericInternalRow instead similar to Spark Avro and Parquet readers.

@rdsr rdsr marked this pull request as ready for review April 30, 2020 07:06
@rdsr
Copy link
Contributor Author

rdsr commented Apr 30, 2020

cc @edgarRd

@rdsr rdsr changed the title [WIP] Orc nested partition support [WIP] Orc nested Identity partition support Apr 30, 2020
@rdsr
Copy link
Contributor Author

rdsr commented May 1, 2020

I reran the Spark jmh tests which @shardulm94 had written on a fresh checkout of iceberg and against my patch. Here's the output

Fresh checkout of Iceberg

Benchmark                                                                          Mode  Cnt   Score   Error  Units
IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  11.004 ± 0.330   s/op
IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.920 ± 1.424   s/op
IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.028 ± 0.103   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.695 ± 0.203   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.293 ± 0.236   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.850 ± 0.211   s/op

With the ORC nested partition patch applied

Benchmark                                                                          Mode  Cnt   Score   Error  Units
IcebergSourceNestedORCDataReadBenchmark.readFileSourceNonVectorized                  ss    5  13.984 ± 0.440   s/op
IcebergSourceNestedORCDataReadBenchmark.readFileSourceVectorized                     ss    5  10.454 ± 0.438   s/op
IcebergSourceNestedORCDataReadBenchmark.readIceberg                                  ss    5   2.264 ± 0.091   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  10.037 ± 0.241   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  10.615 ± 0.336   s/op
IcebergSourceNestedORCDataReadBenchmark.readWithProjectionIceberg                    ss    5   1.980 ± 0.167   s/op

Seems like there isn't much difference. cc @shardulm94

@rdsr rdsr force-pushed the orc_nested_partition branch 2 times, most recently from 96881f5 to 8b635a0 Compare May 1, 2020 15:09
"Error in ORC file, children fields and names do not match.");

List<Types.NestedField> icebergFields = Lists.newArrayListWithExpectedSize(children.size());
// TODO how we get field ids from ORC schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed the logic here and it's a correctness bug. ORC should not assign column IDs when one is missing. Instead, it should ignore the field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use another PR to fix this?


protected abstract T create();

protected abstract T reuseOrCreate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass the possibly reused object in here?


private T readInternal(T struct, ColumnVector[] columnVectors, int row) {
for (int c = 0; c < readers.length; ++c) {
set(struct, c, reader(c).read(columnVectors[c], row));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider a different approach. This currently mirrors what happens in Avro, where the constants are set after reading a record. That is done because Avro can't skip fields easily -- it needs to read through a value even if the value won't be used.

But columnar formats can easily skip. That's why in Parquet, we replace the column reader with a constant reader. So the struct reader behaves exactly like normal and reads a value from every child reader. But some of those children might ignore what's in the data file and return a constant. That should be more efficient because you're not materializing columns you don't need to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is ok if I tackle this in followup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good.


try (CloseableIterable<InternalRow> reader = ORC.read(Files.localInput(testFile))
.project(schema)
.createReaderFunc(SparkOrcReader::new)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should test with and without container reuse if that is implemented in this PR. Probably just make this test parameterized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've removed the reuse code. We can tackle than in followup

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @rdsr! The main change needed is to fix or remove container reuse because that's a correctness problem.

@rdsr rdsr changed the title [WIP] Orc nested Identity partition support [WIP] ORC nested Identity partition support May 7, 2020
@rdsr
Copy link
Contributor Author

rdsr commented May 20, 2020

thanks @rdblue . Picking this up again. Should address ur comments soon

}
}

orcType.setAttribute(ICEBERG_ID_ATTRIBUTE, fieldId.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness sake, also set ICEBERG_REQUIRED_ATTRIBUTE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this is actually causing failures

org.apache.iceberg.data.orc.TestGenericReadProjection > testRenamedAddedField FAILED
    java.lang.IllegalArgumentException: No conversion of type LONG to self needed
        at org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1671)
        at org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2124)
        at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2331)
        at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1961)
        at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2371)
        at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:227)
        at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:752)
        at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:80)
        at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:65)
        at com.google.common.collect.Iterables.getOnlyElement(Iterables.java:254)
        at org.apache.iceberg.data.orc.TestGenericReadProjection.writeAndRead(TestGenericReadProjection.java:53)

And I vaguely remember we fixed a similar bug before in ORC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to know what's going on here. Since this is just a projection schema and the reader is built with the Iceberg schema (that has required/optional), I don't think it is really a blocker. But setting a property here shouldn't cause ORC to fail, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll file the necessary followups.

@rdsr
Copy link
Contributor Author

rdsr commented May 22, 2020

@rdblue This is ready for another round of review

@rdblue
Copy link
Contributor

rdblue commented May 22, 2020

This looks good to me, except for the conflicts. Can you rebase and we can commit it?

@rdsr
Copy link
Contributor Author

rdsr commented May 22, 2020

Weird. I thought I did resolve the conflicts.

@rdsr
Copy link
Contributor Author

rdsr commented May 22, 2020

@rdblue . Fixed conflicts and filed all the necessary followups.

@rdblue rdblue changed the title [WIP] ORC nested Identity partition support ORC: Supported nested identity partition data May 22, 2020
@rdblue rdblue merged commit 17caf95 into apache:master May 22, 2020
@rdblue
Copy link
Contributor

rdblue commented May 22, 2020

Looks good to me! I'll merge it. Thanks, @rdsr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ORC: Support partition values from a constant map

4 participants