ORC: Fix importing ORC files with float and double columns and test #3332

kbendick · 2021-10-20T21:08:24Z

The OrcMetrics code assumed that Iceberg metrics would be available, but that isn't the case when importing existing ORC files.

This adds tests for the changes proposed in #3320 to support importing ORC files with floats and doubles in them due to this metrics behavior.

This also adds the same fix for the metrics situation for max.

I had to update the Spark3 Extensions gradle build file. But given that iceberg-data and iceberg-orc are already added as implementation in the top level spark project, I'm wondering if this is ok? Otherwise I can find a different route to test it.

spark/v3.0/build.gradle

kbendick · 2021-10-20T21:11:30Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+        if (type.typeId() == Type.TypeID.FLOAT) {
+          max = ((Double) max).floatValue();
+        }
+      }


I tested without this fix and encountered the error, so I'm sure that the test does in fact hit the problem.

...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

rdblue · 2021-10-26T21:47:10Z

This looks good to me. To unblock the 0.12.1 release, we should get this in and fix the minor issue with the tests later.

… infinity

kbendick · 2021-10-27T00:17:39Z

@yyanyy @rdblue I updated to handle non-finite metrics from Orc files. Can you please take a look? Thanks!

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

…finite values

…ptional

rdblue · 2021-10-29T17:19:50Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+  // To avoid storing NaN in the Iceberg metrics, NaN is normalized to +/- Infinity for max / min respectively.
+  private static Object normalizeFloatingPointColumnsIfNeeded(Bound bound, Type type, double value) {
+    if (type.typeId() == Type.TypeID.DOUBLE) {
+      return Double.isNaN(value) ? (bound == Bound.UPPER ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : value;


What about a simpler method, replaceNaN(double value, double replacement)? Then you just need the ternary check here and you can customize with Double.NEGATIVE_INFINITY in the call rather than passing a Bound.

Also, you only need one since the bound is always a double to begin with. So you can replace the logic with:

max = replaceNaN(((DoubleColumnStatistics) columnStats).getMaximum(), Double.POSITIVE_INFINITY); if (type.typeId() == Type.TypeID.FLOAT) { max = ((Double) value).floatValue; }

Ah that's a good idea. Wasn't a fan of the double ternary.

kbendick · 2021-10-29T17:21:27Z

Ah that's good idea. Wasn't a big fan of the multiple ternaries.

…

On Fri, Oct 29, 2021 at 10:20 AM Ryan Blue ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java <#3332 (comment)>: > @@ -262,6 +269,16 @@ private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescripti return Optional.ofNullable(Conversions.toByteBuffer(type, truncateIfNeeded(Bound.UPPER, type, max, metricsMode))); } + // ORC uses NaN in its metrics for floating point numbers (float and double). + // To avoid storing NaN in the Iceberg metrics, NaN is normalized to +/- Infinity for max / min respectively. + private static Object normalizeFloatingPointColumnsIfNeeded(Bound bound, Type type, double value) { + if (type.typeId() == Type.TypeID.DOUBLE) { + return Double.isNaN(value) ? (bound == Bound.UPPER ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : value; What about a simpler method, replaceNaN(double value, double replacement)? Then you just need the ternary check here and you can customize with Double.NEGATIVE_INFINITY in the call rather than passing a Bound. Also, you only need one since the bound is always a double to begin with. So you can replace the logic with: max = replaceNaN(((DoubleColumnStatistics) columnStats).getMaximum(), Double.POSITIVE_INFINITY); if (type.typeId() == Type.TypeID.FLOAT) { max = ((Double) value).floatValue; } — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3332 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLAXEQNIVK7XX2UOEYFGFLUJLJUDANCNFSM5GMSUBDA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

rdblue · 2021-10-29T19:02:33Z

Thanks, @kbendick!

…pache#3332)

…3332)

…pache#3332)

kbendick requested a review from rdblue October 20, 2021 21:08

github-actions bot added build ORC spark labels Oct 20, 2021

kbendick changed the title ~~Fix importing orc files double float metrics~~ Test for fix importing orc files with doubles and floats Oct 20, 2021

kbendick commented Oct 20, 2021

View reviewed changes

spark/v3.0/build.gradle Show resolved Hide resolved

kbendick commented Oct 20, 2021

View reviewed changes

kbendick mentioned this pull request Oct 20, 2021

ORC: Fix importing ORC files with float and double columns #3320

Closed

kbendick changed the title ~~Test for fix importing orc files with doubles and floats~~ ORC: Fix importing ORC files with float and double columns and test Oct 20, 2021

rdblue reviewed Oct 20, 2021

View reviewed changes

...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java Outdated Show resolved Hide resolved

rdblue reviewed Oct 20, 2021

View reviewed changes

...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java Show resolved Hide resolved

yyanyy reviewed Oct 21, 2021

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Show resolved Hide resolved

rdblue added this to the Java 0.12.1 Release milestone Oct 26, 2021

rdblue approved these changes Oct 26, 2021

View reviewed changes

kbendick added 7 commits October 26, 2021 14:46

WIP with included fixes so can hop to another branch

bfba1d9

work on manually writing an ORC file to import

6908f85

Test added in TestAddFileProcedure which causes the bug

7d05e0f

Fix max and have fully working tests after updating build file

7b3c539

Remove unused preconditions import

fce4888

Fix checkstyle missing leading space after commas

cd195b0

Remove unnecessary space

8e04370

kbendick force-pushed the fix-importing-orc-files-double-float-metrics branch from bc41476 to 8e04370 Compare October 26, 2021 21:47

kbendick added 2 commits October 26, 2021 16:33

Check if ORC is providing NaN for stats and reassign Iceberg stats to…

ddeeda7

… infinity

Add note to update to move test file off of Iceberg schema

1b9b181

kbendick commented Oct 27, 2021

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

rdblue reviewed Oct 27, 2021

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Show resolved Hide resolved

rdblue reviewed Oct 27, 2021

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

Handle doubles that might be NaN and only check for NaN not strictly …

4ec3448

…finite values

kbendick added 3 commits October 27, 2021 13:31

Fix cyclomatic complexity style check errors

b41aeb5

Workaround cyclomatic complexity check via helper method instead of O…

2737c6b

…ptional

Avoid unnecessary boxing and unboxing of primitive types

4898f93

rdblue reviewed Oct 29, 2021

View reviewed changes

Avoid double ternary and simplify

6a3c966

rdblue approved these changes Oct 29, 2021

View reviewed changes

rdblue merged commit ddc5aff into apache:master Oct 29, 2021

kbendick added a commit to kbendick/iceberg that referenced this pull request Nov 1, 2021

ORC: Fix importing ORC files with float and double columns and test (a…

b9c7be7

…pache#3332)

rdblue pushed a commit that referenced this pull request Nov 1, 2021

ORC: Fix importing ORC files with float and double columns and test (#…

eec6ff6

…3332)

izchen pushed a commit to izchen/iceberg that referenced this pull request Dec 7, 2021

ORC: Fix importing ORC files with float and double columns and test (a…

b20599a

…pache#3332)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC: Fix importing ORC files with float and double columns and test #3332

ORC: Fix importing ORC files with float and double columns and test #3332

Uh oh!

kbendick commented Oct 20, 2021 •

edited

Loading

Uh oh!

Uh oh!

kbendick Oct 20, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue commented Oct 26, 2021

Uh oh!

kbendick commented Oct 27, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue Oct 29, 2021

Uh oh!

kbendick Oct 29, 2021

Uh oh!

kbendick commented Oct 29, 2021 via email

Uh oh!

rdblue commented Oct 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ORC: Fix importing ORC files with float and double columns and test #3332

ORC: Fix importing ORC files with float and double columns and test #3332

Uh oh!

Conversation

kbendick commented Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kbendick Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue commented Oct 26, 2021

Uh oh!

kbendick commented Oct 27, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue Oct 29, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 29, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick commented Oct 29, 2021 via email

Uh oh!

rdblue commented Oct 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kbendick commented Oct 20, 2021 •

edited

Loading