-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark: Support ORC vectorized reads #1189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks, @shardulm94! It's great to see a PR for this. I'll try to get some time to review it. |
|
Do we know why the nested data benchmarks show such a big improvement? Are we running correctness tests for the same cases to make sure we aren't dropping data by accident? It just seems a bit too good. |
|
Which cases are you comparing for nested data? |
That's the answer I was looking for. It was really surprising that Spark vectorized was so slow. Thanks! |
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkOrcReaders.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public int getInt(int rowId) { | ||
| Integer value = (Integer) primitiveValueReader.read(vector, rowId); | ||
| return value != null ? value : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this extra check just to add a default value? If it is okay to return a default value, then getInt should never be called for a rowId where the value is null. And if this is never called when the value is null, then I'd rather directly cast. That way, a NullPointerException is thrown if the method contract is violated instead of silently returning the wrong value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! That makes sense. If the reader is not going to check for null, then returning a default value would be erroneous. One issue I see is that Spark's own ColumnVector code seems to violate this at https://github.com/apache/spark/blob/d5b903e38556ee3e8e1eb8f71a08e232afa4e36a/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java#L92 and similar methods within the same class.
In Spark 2.4 these methods are not actually used, the methods are used at https://github.com/apache/spark/blob/d5b903e38556ee3e8e1eb8f71a08e232afa4e36a/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java#L53 but those are not used anywhere.
In Spark 3.0, they are used at https://github.com/apache/spark/blob/3b0aee3f9557ae1666b63665b08d899fe7682852/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java#L55, but this seems incorrect for the reasons you mentioned above. I am unsure when the copy method would actually be triggered though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like other implementations in Spark just return from the underlying vector, without checking. So these methods would always return something. Arrow is an exception: if null checking is enabled (the default) then it will check and throw IllegalStateException.
I think it still makes sense to throw the NullPointerException, even if copy in Spark 3 would use it. I don't see any uses of ArrayData.copy in Spark 3, and any problem would only affect arrays, so the impact is limited. Plus, other sources (Arrow) break in this case and it appears to be overlooked. I think it makes sense to throw an exception to prevent a copy that corrupts the data by dropping null values.
|
The only blocker I see is that the batch iterator doesn't pass the close call down to the underlying ORC reader. Otherwise, everything is minor. Awesome work, @shardulm94! |
…le source falls back to reading row by row
6a5e8c5 to
7d0cf1c
Compare
|
Nice work, @shardulm94! I'm merging this because the blockers have been resolved and this is large. We can follow up with a change for the way nulls are handled in getters. I don't think that is a blocker because there are examples in Spark of both ignoring the issue and throwing exceptions. |
VectorizedSparkOrcReadersNullValuesColumnVectorwithConstantColumnVectorwhich can support any constant (including nulls)RowDataReaderintoBaseDataReaderso that it can be reused inBatchDataReaderTestFilteredScanin Spark3 to use IcebergGenerics instead of Avro just like in Spark2. This enables use to test ORC reads/writes which do not have Avro GenericRecord writer.Benchmark results:
Tests reads of 10 files with 5M records each