Skip to content

Conversation

@kbendick
Copy link
Contributor

@kbendick kbendick commented Oct 20, 2021

The OrcMetrics code assumed that Iceberg metrics would be available, but that isn't the case when importing existing ORC files.

This adds tests for the changes proposed in #3320 to support importing ORC files with floats and doubles in them due to this metrics behavior.

This also adds the same fix for the metrics situation for max.

I had to update the Spark3 Extensions gradle build file. But given that iceberg-data and iceberg-orc are already added as implementation in the top level spark project, I'm wondering if this is ok? Otherwise I can find a different route to test it.

@kbendick kbendick requested a review from rdblue October 20, 2021 21:08
@kbendick kbendick changed the title Fix importing orc files double float metrics Test for fix importing orc files with doubles and floats Oct 20, 2021
if (type.typeId() == Type.TypeID.FLOAT) {
max = ((Double) max).floatValue();
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested without this fix and encountered the error, so I'm sure that the test does in fact hit the problem.

@kbendick kbendick changed the title Test for fix importing orc files with doubles and floats ORC: Fix importing ORC files with float and double columns and test Oct 20, 2021
@rdblue rdblue added this to the Java 0.12.1 Release milestone Oct 26, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 26, 2021

This looks good to me. To unblock the 0.12.1 release, we should get this in and fix the minor issue with the tests later.

@kbendick kbendick force-pushed the fix-importing-orc-files-double-float-metrics branch from bc41476 to 8e04370 Compare October 26, 2021 21:47
@kbendick
Copy link
Contributor Author

@yyanyy @rdblue I updated to handle non-finite metrics from Orc files. Can you please take a look? Thanks!

// To avoid storing NaN in the Iceberg metrics, NaN is normalized to +/- Infinity for max / min respectively.
private static Object normalizeFloatingPointColumnsIfNeeded(Bound bound, Type type, double value) {
if (type.typeId() == Type.TypeID.DOUBLE) {
return Double.isNaN(value) ? (bound == Bound.UPPER ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a simpler method, replaceNaN(double value, double replacement)? Then you just need the ternary check here and you can customize with Double.NEGATIVE_INFINITY in the call rather than passing a Bound.

Also, you only need one since the bound is always a double to begin with. So you can replace the logic with:

  max = replaceNaN(((DoubleColumnStatistics) columnStats).getMaximum(), Double.POSITIVE_INFINITY);
  if (type.typeId() == Type.TypeID.FLOAT) {
    max = ((Double) value).floatValue;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's a good idea. Wasn't a fan of the double ternary.

@kbendick
Copy link
Contributor Author

kbendick commented Oct 29, 2021 via email

@rdblue rdblue merged commit ddc5aff into apache:master Oct 29, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 29, 2021

Thanks, @kbendick!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants