[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types #29792

sunchao · 2020-09-17T18:34:25Z

What changes were proposed in this pull request?

In SPARK-24994 we implemented unwrapping cast for integral types. This extends it to support numeric types such as float/double/decimal, so that filters involving these types can be better pushed down to data sources.

Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case:

cast(e as double) < 1.9

assume type of e is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to:

e < cast(1.9 as short)

as in the previous implementation, since if e is 1, the original expression evaluates to true, but converted expression will evaluate to false.

To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a roundtrip value derived from converting the literal first to the narrower type, and then to the wider type, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9.

Now in the case of truncate, we'd convert the original expression to:

e <= cast(1.9 as short)

instead, so that the conversion also is valid when e is 1.

For more details, please check this blog post by Presto which offers a very good explanation on how it works.

Why are the changes needed?

For queries such as:

SELECT * FROM tbl WHERE short_col < 100.5

The predicate short_col < 100.5 can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

…c types

SparkQA · 2020-09-18T00:14:59Z

Test build #128838 has finished for PR 29792 at commit 65833e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2020-09-23T04:50:40Z

cc @viirya @dongjoon-hyun @cloud-fan @gatorsmile @HyukjinKwon @huaxingao

cloud-fan · 2020-09-23T05:59:54Z

...c/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

    assertEquivalent(castInt(f) < v.toInt, falseIfNotNull(f))
+
+    val d = Float.NegativeInfinity
+    assertEquivalent(castDouble(f2) > d.toDouble, f2 =!= d)


is casting double to float rounding up or rounding down?

it is rounding down, see below for a test on this.

cloud-fan · 2020-09-23T16:32:45Z

...c/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala

+    assertEquivalent(castDouble(f) <= doubleValue, f <= doubleValue.toShort)
+    assertEquivalent(castDouble(f) < doubleValue, f <= doubleValue.toShort)
+
+    // Cases for rounding up: 3.14 will be rounded to 3.14000010... after casting to float


so casting double to float can be either rounding up or down, depend on the value?

@cloud-fan Sorry i was wrong in the above comment (somehow I was thinking casting from double to short there).

Yes, it appears that casting from double to float can be either rounding up or down, depending on value:

scala> val x = 0.39999999 x: Double = 0.39999999 scala> val y = x.toFloat y: Float = 0.39999998 scala > val x = 0.49999999 y: Double = 0.49999999 scala> val y = x.toFloat y: Float = 0.5

To clarify, here the round up is after casting roundtrip: double 3.14 -> float 3.14 -> double 3.14000010...

This is an important point. Can we explain how to know it's rounding up or down in the PR description?

Yup, will do. This is a good point.

@cloud-fan updated the description. Please take another look. Thanks!

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

cloud-fan · 2020-09-29T08:31:19Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+    case ShortType => Some((Short.MinValue, Short.MaxValue))
+    case IntegerType => Some((Int.MinValue, Int.MaxValue))
+    case LongType => Some((Long.MinValue, Long.MaxValue))
+    case FloatType => Some((Float.NegativeInfinity, Float.NaN))


why the upper bound is not PositiveInfinity?

This is because PositiveInfinity is considered to be < NaN in Spark. If we treat it as the upper bound, rules handling the upper bounds will not be valid. For instance the following expr:

cast(e as double) > double('+inf')

would be converted to

e === double('+inf')

which won't be correct if e evaluates to double('NaN').

cloud-fan · 2020-09-29T08:37:35Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+    }
+
+    // When we reach to this point, it means either there is no min/max for the `fromType` (e.g.,
+    // decimal type), or that the literal `value` is within range `(min, max)`. For these, we


why it's safe to skip range check for decimal type?

It is safe since knowing min/max for a type just gives us more opportunity for optimizations. I skipped decimal type here because (it seems) there is no min/max defined in the DecimalType, unlike other numeric types.

makes sense.

cloud-fan · 2020-09-29T08:50:15Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+      // narrower type. In this case we simply return the original expression.
+      return exp
+    }
+    val valueRoundTrip = Cast(Literal(newValue, fromType), toType).eval()


The case I'm worried about is cast(float_col as double) cmp double_lit. It's not straightforward to me that a double -> float -> double roundtrip can tell rounding up or down. is it because float -> double can only be rounding up?

So double to float can result to either rounding up or down. For instance, by casting 3.14 in double to float, even though the value is still 3.14, the binary representation is rounded up:

3.14 in double:

0 10000000000 1001 0001 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1111

3.14 in float

0 10000000 1001 0001 1110 1011 1000 011

Here the sign bit and exponent bits (11 and 8 bits respectively for double and float) are the same for both float and double. However, in the fraction part, the last is rounded up to 1.

After casting back to double, there won't be any rounding up or down - the remaining digits are simply padded with 0:

0 10000000000 1001 0001 1110 1011 1000 0110 0000000000000000000000000000

Is it defined as part of IEEE Standard for Floating-Point Arithmetic (IEEE 754)?

Yes I think both the binary format as well as rounding rules are specified in IEEE 754. There are a few rounding rules and I think the default one is "rounding to half even".

SparkQA · 2020-10-01T17:32:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33926/

SparkQA · 2020-10-01T17:56:09Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33926/

SparkQA · 2020-10-01T22:30:43Z

Test build #129311 has finished for PR 29792 at commit 3340de4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-10-05T16:48:36Z

ping @cloud-fan - addressed your comments, could you take another look at this? thanks!

cloud-fan · 2020-10-06T09:34:30Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+    val newValue = Cast(Literal(value), fromType).eval()
+    if (newValue == null) {
+      // This means the cast failed, for instance, due to the value is not representable in the
+      // narrower type. In this case we simply return the original expression.


can you give a real example here?

I see, it's for decimal only. It's better to make the comment more explicit.

yup will do - there is also a test case covering this.

cloud-fan · 2020-10-06T12:54:27Z

The patch LGTM. Can we have an end-to-end test suite for it? The current tests prove that the optimized expression tree is what we expect, it's better to have high-level tests to prove that, after optimization, the query still returns corrected result.

sunchao · 2020-10-06T16:30:43Z

Can we have an end-to-end test suite for it?

Thanks - this is a good suggestion. Will add that.

SparkQA · 2020-10-09T00:48:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34172/

SparkQA · 2020-10-09T01:05:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34172/

SparkQA · 2020-10-09T04:23:23Z

Test build #129566 has finished for PR 29792 at commit 94942bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UnwrapCastInComparisonEndToEndSuite extends QueryTest with SharedSparkSession

sunchao · 2020-10-09T04:27:59Z

@cloud-fan added an e2e test suite. Please take another look, thanks.

maropu · 2020-10-09T07:28:24Z

sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala

+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types.Decimal
+
+class UnwrapCastInComparisonEndToEndSuite extends QueryTest with SharedSparkSession {


Could you add these end-2-end tests in SQLQueryTestSuite instead of making a new suite?

Yea I can add the test there instead - I was just following the existing ReplaceNullWithFalseInPredicateEndToEndSuite though.

I think it's fine to follow ReplaceNullWithFalseInPredicateEndToEndSuite here.

maropu · 2020-10-09T07:28:39Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+ *  - `cast(fromExp, toType) <= value` ==> `fromExp < cast(value, fromType)`
+ *  - `cast(fromExp, toType) < value` ==> `fromExp < cast(value, fromType)`
+ *
+ *  Similarly for the case when casting `value` to `fromType` causes rounding down.


nit: wrong indent.

maropu · 2020-10-09T12:43:28Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+    case IntegerType => Some((Int.MinValue, Int.MaxValue))
+    case LongType => Some((Long.MinValue, Long.MaxValue))
+    case FloatType => Some((Float.NegativeInfinity, Float.NaN))
+    case DoubleType => Some((Double.NegativeInfinity, Double.NaN))


Looks it does not have any test for this code path, so could you add some tests for it. (NOTE: I think byte, int, and long are not tested in UnwrapCastInBinaryComparisonSuite, too)

Will add a test case (although I think it will be pretty trivial). I only added tests for short in the previous PR because the handling for other integral types is exactly the same.

sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala

- Added test case for getRange() - Added test for Float.PositiveInfinity and Float.MinValue/Float/MaxValue - Move `select` after `where` - Separate test cases - Fix indentation

SparkQA · 2020-10-12T21:04:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34310/

SparkQA · 2020-10-12T21:21:35Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34310/

SparkQA · 2020-10-13T01:48:48Z

Test build #129704 has finished for PR 29792 at commit 76b9f73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-10-13T02:40:10Z

Addressed comments. Please take another look @cloud-fan @maropu . Thanks!

cloud-fan · 2020-10-13T12:44:18Z

thanks, merging to master!

…c types In SPARK-24994 we implemented unwrapping cast for **integral types**. This extends it to support **numeric types** such as float/double/decimal, so that filters involving these types can be better pushed down to data sources. Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case: ```sql cast(e as double) < 1.9 ``` assume type of `e` is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to: ```sql e < cast(1.9 as short) ``` as in the previous implementation, since if `e` is 1, the original expression evaluates to true, but converted expression will evaluate to false. To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a _roundtrip value_ derived from **converting the literal first to the narrower type, and then to the wider type**, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9. <img width="1153" alt="Screen Shot 2020-09-28 at 3 30 27 PM" src="https://user-images.githubusercontent.com/506679/94492719-bd29e780-019f-11eb-9111-71d6e3d157f7.png"> Now in the case of truncate, we'd convert the original expression to: ```sql e <= cast(1.9 as short) ``` instead, so that the conversion also is valid when `e` is 1. For more details, please check [this blog post](https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html) by Presto which offers a very good explanation on how it works. For queries such as: ```sql SELECT * FROM tbl WHERE short_col < 100.5 ``` The predicate `short_col < 100.5` can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently. No Unit tests Closes apache#29792 from sunchao/SPARK-32858. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeri…

65833e7

…c types

probot-autolabeler bot added the SQL label Sep 17, 2020

More test coverage

3fcc396

sunchao changed the title ~~[WIP][SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types~~ [SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types Sep 22, 2020

cloud-fan reviewed Sep 23, 2020

View reviewed changes

cloud-fan reviewed Sep 29, 2020

View reviewed changes

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 29, 2020

View reviewed changes

Minor refactoring

3340de4

cloud-fan reviewed Oct 6, 2020

View reviewed changes

Add end-to-end test suite

94942bf

maropu reviewed Oct 9, 2020

View reviewed changes

cloud-fan reviewed Oct 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala Outdated Show resolved Hide resolved

Address comments

76b9f73

- Added test case for getRange() - Added test for Float.PositiveInfinity and Float.MinValue/Float/MaxValue - Move `select` after `where` - Separate test cases - Fix indentation

cloud-fan approved these changes Oct 13, 2020

View reviewed changes

cloud-fan closed this in feee8da Oct 13, 2020

sunchao deleted the SPARK-32858 branch October 13, 2020 16:29

[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types #29792

[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types #29792

Uh oh!

Conversation

sunchao commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 18, 2020

Uh oh!

dbtsai commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 1, 2020

Uh oh!

SparkQA commented Oct 1, 2020

Uh oh!

SparkQA commented Oct 1, 2020

Uh oh!

sunchao commented Oct 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 6, 2020

Uh oh!

sunchao commented Oct 6, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

sunchao commented Oct 9, 2020

Uh oh!

Choose a reason for hiding this comment

sunchao commented Sep 17, 2020 •

edited

Loading

dbtsai commented Sep 23, 2020 •

edited

Loading

sunchao Sep 23, 2020 •

edited

Loading

cloud-fan Oct 6, 2020 •

edited

Loading