-
Notifications
You must be signed in to change notification settings - Fork 2.9k
ORC: Supported nested identity partition data #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
Show resolved
Hide resolved
|
cc @edgarRd |
orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java
Outdated
Show resolved
Hide resolved
|
I reran the Spark jmh tests which @shardulm94 had written on a fresh checkout of iceberg and against my patch. Here's the output Fresh checkout of Iceberg With the ORC nested partition patch applied Seems like there isn't much difference. cc @shardulm94 |
96881f5 to
8b635a0
Compare
| "Error in ORC file, children fields and names do not match."); | ||
|
|
||
| List<Types.NestedField> icebergFields = Lists.newArrayListWithExpectedSize(children.size()); | ||
| // TODO how we get field ids from ORC schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed the logic here and it's a correctness bug. ORC should not assign column IDs when one is missing. Instead, it should ignore the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use another PR to fix this?
orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java
Outdated
Show resolved
Hide resolved
|
|
||
| protected abstract T create(); | ||
|
|
||
| protected abstract T reuseOrCreate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass the possibly reused object in here?
|
|
||
| private T readInternal(T struct, ColumnVector[] columnVectors, int row) { | ||
| for (int c = 0; c < readers.length; ++c) { | ||
| set(struct, c, reader(c).read(columnVectors[c], row)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider a different approach. This currently mirrors what happens in Avro, where the constants are set after reading a record. That is done because Avro can't skip fields easily -- it needs to read through a value even if the value won't be used.
But columnar formats can easily skip. That's why in Parquet, we replace the column reader with a constant reader. So the struct reader behaves exactly like normal and reads a value from every child reader. But some of those children might ignore what's in the data file and return a constant. That should be more efficient because you're not materializing columns you don't need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is ok if I tackle this in followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds good.
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
Outdated
Show resolved
Hide resolved
|
|
||
| try (CloseableIterable<InternalRow> reader = ORC.read(Files.localInput(testFile)) | ||
| .project(schema) | ||
| .createReaderFunc(SparkOrcReader::new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should test with and without container reuse if that is implemented in this PR. Probably just make this test parameterized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've removed the reuse code. We can tackle than in followup
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, @rdsr! The main change needed is to fix or remove container reuse because that's a correctness problem.
|
thanks @rdblue . Picking this up again. Should address ur comments soon |
| } | ||
| } | ||
|
|
||
| orcType.setAttribute(ICEBERG_ID_ATTRIBUTE, fieldId.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness sake, also set ICEBERG_REQUIRED_ATTRIBUTE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this is actually causing failures
org.apache.iceberg.data.orc.TestGenericReadProjection > testRenamedAddedField FAILED
java.lang.IllegalArgumentException: No conversion of type LONG to self needed
at org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1671)
at org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2124)
at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2331)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1961)
at org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2371)
at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:227)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:752)
at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:80)
at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:65)
at com.google.common.collect.Iterables.getOnlyElement(Iterables.java:254)
at org.apache.iceberg.data.orc.TestGenericReadProjection.writeAndRead(TestGenericReadProjection.java:53)
And I vaguely remember we fixed a similar bug before in ORC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to know what's going on here. Since this is just a projection schema and the reader is built with the Iceberg schema (that has required/optional), I don't think it is really a blocker. But setting a property here shouldn't cause ORC to fail, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll file the necessary followups.
orc/src/main/java/org/apache/iceberg/orc/OrcSchemaWithTypeVisitor.java
Outdated
Show resolved
Hide resolved
|
@rdblue This is ready for another round of review |
|
This looks good to me, except for the conflicts. Can you rebase and we can commit it? |
|
Weird. I thought I did resolve the conflicts. |
1d943e5 to
b973d4b
Compare
|
@rdblue . Fixed conflicts and filed all the necessary followups. |
|
Looks good to me! I'll merge it. Thanks, @rdsr! |
Fixes #897
The following changes are made
UnsafeRowand used GenericInternalRow instead similar to Spark Avro and Parquet readers.