-
Notifications
You must be signed in to change notification settings - Fork 3k
ORC: Fix importing ORC files with float and double columns and test #3332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC: Fix importing ORC files with float and double columns and test #3332
Conversation
| if (type.typeId() == Type.TypeID.FLOAT) { | ||
| max = ((Double) max).floatValue(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested without this fix and encountered the error, so I'm sure that the test does in fact hit the problem.
...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
Show resolved
Hide resolved
|
This looks good to me. To unblock the 0.12.1 release, we should get this in and fix the minor issue with the tests later. |
bc41476 to
8e04370
Compare
| // To avoid storing NaN in the Iceberg metrics, NaN is normalized to +/- Infinity for max / min respectively. | ||
| private static Object normalizeFloatingPointColumnsIfNeeded(Bound bound, Type type, double value) { | ||
| if (type.typeId() == Type.TypeID.DOUBLE) { | ||
| return Double.isNaN(value) ? (bound == Bound.UPPER ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a simpler method, replaceNaN(double value, double replacement)? Then you just need the ternary check here and you can customize with Double.NEGATIVE_INFINITY in the call rather than passing a Bound.
Also, you only need one since the bound is always a double to begin with. So you can replace the logic with:
max = replaceNaN(((DoubleColumnStatistics) columnStats).getMaximum(), Double.POSITIVE_INFINITY);
if (type.typeId() == Type.TypeID.FLOAT) {
max = ((Double) value).floatValue;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's a good idea. Wasn't a fan of the double ternary.
|
Ah that's good idea. Wasn't a big fan of the multiple ternaries.
…On Fri, Oct 29, 2021 at 10:20 AM Ryan Blue ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java
<#3332 (comment)>:
> @@ -262,6 +269,16 @@ private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescripti
return Optional.ofNullable(Conversions.toByteBuffer(type, truncateIfNeeded(Bound.UPPER, type, max, metricsMode)));
}
+ // ORC uses NaN in its metrics for floating point numbers (float and double).
+ // To avoid storing NaN in the Iceberg metrics, NaN is normalized to +/- Infinity for max / min respectively.
+ private static Object normalizeFloatingPointColumnsIfNeeded(Bound bound, Type type, double value) {
+ if (type.typeId() == Type.TypeID.DOUBLE) {
+ return Double.isNaN(value) ? (bound == Bound.UPPER ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : value;
What about a simpler method, replaceNaN(double value, double replacement)?
Then you just need the ternary check here and you can customize with
Double.NEGATIVE_INFINITY in the call rather than passing a Bound.
Also, you only need one since the bound is always a double to begin with.
So you can replace the logic with:
max = replaceNaN(((DoubleColumnStatistics) columnStats).getMaximum(), Double.POSITIVE_INFINITY);
if (type.typeId() == Type.TypeID.FLOAT) {
max = ((Double) value).floatValue;
}
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3332 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLAXEQNIVK7XX2UOEYFGFLUJLJUDANCNFSM5GMSUBDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
|
Thanks, @kbendick! |
The OrcMetrics code assumed that Iceberg metrics would be available, but that isn't the case when importing existing ORC files.
This adds tests for the changes proposed in #3320 to support importing ORC files with floats and doubles in them due to this metrics behavior.
This also adds the same fix for the metrics situation for
max.I had to update the Spark3 Extensions gradle build file. But given that iceberg-data and iceberg-orc are already added as
implementationin the top level spark project, I'm wondering if this is ok? Otherwise I can find a different route to test it.