Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Sep 17, 2020

What changes were proposed in this pull request?

In SPARK-24994 we implemented unwrapping cast for integral types. This extends it to support numeric types such as float/double/decimal, so that filters involving these types can be better pushed down to data sources.

Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case:

cast(e as double) < 1.9

assume type of e is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to:

e < cast(1.9 as short)

as in the previous implementation, since if e is 1, the original expression evaluates to true, but converted expression will evaluate to false.

To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a roundtrip value derived from converting the literal first to the narrower type, and then to the wider type, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9.

Screen Shot 2020-09-28 at 3 30 27 PM

Now in the case of truncate, we'd convert the original expression to:

e <= cast(1.9 as short)

instead, so that the conversion also is valid when e is 1.

For more details, please check this blog post by Presto which offers a very good explanation on how it works.

Why are the changes needed?

For queries such as:

SELECT * FROM tbl WHERE short_col < 100.5

The predicate short_col < 100.5 can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

@SparkQA
Copy link

SparkQA commented Sep 18, 2020

Test build #128838 has finished for PR 29792 at commit 65833e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao changed the title [WIP][SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types [SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types Sep 22, 2020
@dbtsai
Copy link
Member

dbtsai commented Sep 23, 2020

assertEquivalent(castInt(f) < v.toInt, falseIfNotNull(f))

val d = Float.NegativeInfinity
assertEquivalent(castDouble(f2) > d.toDouble, f2 =!= d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is casting double to float rounding up or rounding down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is rounding down, see below for a test on this.

assertEquivalent(castDouble(f) <= doubleValue, f <= doubleValue.toShort)
assertEquivalent(castDouble(f) < doubleValue, f <= doubleValue.toShort)

// Cases for rounding up: 3.14 will be rounded to 3.14000010... after casting to float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so casting double to float can be either rounding up or down, depend on the value?

Copy link
Member Author

@sunchao sunchao Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sorry i was wrong in the above comment (somehow I was thinking casting from double to short there).

Yes, it appears that casting from double to float can be either rounding up or down, depending on value:

scala> val x = 0.39999999
x: Double = 0.39999999

scala> val y = x.toFloat
y: Float = 0.39999998

scala > val x = 0.49999999
y: Double = 0.49999999

scala> val y = x.toFloat
y: Float = 0.5

To clarify, here the round up is after casting roundtrip: double 3.14 -> float 3.14 -> double 3.14000010...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important point. Can we explain how to know it's rounding up or down in the PR description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will do. This is a good point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan updated the description. Please take another look. Thanks!

case ShortType => Some((Short.MinValue, Short.MaxValue))
case IntegerType => Some((Int.MinValue, Int.MaxValue))
case LongType => Some((Long.MinValue, Long.MaxValue))
case FloatType => Some((Float.NegativeInfinity, Float.NaN))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the upper bound is not PositiveInfinity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because PositiveInfinity is considered to be < NaN in Spark. If we treat it as the upper bound, rules handling the upper bounds will not be valid. For instance the following expr:

cast(e as double) > double('+inf')

would be converted to

e === double('+inf')

which won't be correct if e evaluates to double('NaN').

}

// When we reach to this point, it means either there is no min/max for the `fromType` (e.g.,
// decimal type), or that the literal `value` is within range `(min, max)`. For these, we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's safe to skip range check for decimal type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safe since knowing min/max for a type just gives us more opportunity for optimizations. I skipped decimal type here because (it seems) there is no min/max defined in the DecimalType, unlike other numeric types.

Copy link
Contributor

@cloud-fan cloud-fan Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

// narrower type. In this case we simply return the original expression.
return exp
}
val valueRoundTrip = Cast(Literal(newValue, fromType), toType).eval()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I'm worried about is cast(float_col as double) cmp double_lit. It's not straightforward to me that a double -> float -> double roundtrip can tell rounding up or down. is it because float -> double can only be rounding up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So double to float can result to either rounding up or down. For instance, by casting 3.14 in double to float, even though the value is still 3.14, the binary representation is rounded up:

3.14 in double:

0 10000000000 1001 0001 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1111

3.14 in float

0 10000000 1001 0001 1110 1011 1000 011

Here the sign bit and exponent bits (11 and 8 bits respectively for double and float) are the same for both float and double. However, in the fraction part, the last is rounded up to 1.

After casting back to double, there won't be any rounding up or down - the remaining digits are simply padded with 0:

0 10000000000 1001 0001 1110 1011 1000 0110 0000000000000000000000000000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it defined as part of IEEE Standard for Floating-Point Arithmetic (IEEE 754)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think both the binary format as well as rounding rules are specified in IEEE 754. There are a few rounding rules and I think the default one is "rounding to half even".

@SparkQA
Copy link

SparkQA commented Oct 1, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33926/

@SparkQA
Copy link

SparkQA commented Oct 1, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33926/

@SparkQA
Copy link

SparkQA commented Oct 1, 2020

Test build #129311 has finished for PR 29792 at commit 3340de4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Oct 5, 2020

ping @cloud-fan - addressed your comments, could you take another look at this? thanks!

val newValue = Cast(Literal(value), fromType).eval()
if (newValue == null) {
// This means the cast failed, for instance, due to the value is not representable in the
// narrower type. In this case we simply return the original expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give a real example here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's for decimal only. It's better to make the comment more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup will do - there is also a test case covering this.

@cloud-fan
Copy link
Contributor

The patch LGTM. Can we have an end-to-end test suite for it? The current tests prove that the optimized expression tree is what we expect, it's better to have high-level tests to prove that, after optimization, the query still returns corrected result.

@sunchao
Copy link
Member Author

sunchao commented Oct 6, 2020

Can we have an end-to-end test suite for it?

Thanks - this is a good suggestion. Will add that.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34172/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34172/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129566 has finished for PR 29792 at commit 94942bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class UnwrapCastInComparisonEndToEndSuite extends QueryTest with SharedSparkSession

@sunchao
Copy link
Member Author

sunchao commented Oct 9, 2020

@cloud-fan added an e2e test suite. Please take another look, thanks.

import org.apache.spark.sql.test.SharedSparkSession
import org.apache.spark.sql.types.Decimal

class UnwrapCastInComparisonEndToEndSuite extends QueryTest with SharedSparkSession {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add these end-2-end tests in SQLQueryTestSuite instead of making a new suite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I can add the test there instead - I was just following the existing ReplaceNullWithFalseInPredicateEndToEndSuite though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to follow ReplaceNullWithFalseInPredicateEndToEndSuite here.

* - `cast(fromExp, toType) <= value` ==> `fromExp < cast(value, fromType)`
* - `cast(fromExp, toType) < value` ==> `fromExp < cast(value, fromType)`
*
* Similarly for the case when casting `value` to `fromType` causes rounding down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrong indent.

case IntegerType => Some((Int.MinValue, Int.MaxValue))
case LongType => Some((Long.MinValue, Long.MaxValue))
case FloatType => Some((Float.NegativeInfinity, Float.NaN))
case DoubleType => Some((Double.NegativeInfinity, Double.NaN))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks it does not have any test for this code path, so could you add some tests for it. (NOTE: I think byte, int, and long are not tested in UnwrapCastInBinaryComparisonSuite, too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a test case (although I think it will be pretty trivial). I only added tests for short in the previous PR because the handling for other integral types is exactly the same.

- Added test case for getRange()
- Added test for Float.PositiveInfinity and Float.MinValue/Float/MaxValue
- Move `select` after `where`
- Separate test cases
- Fix indentation
@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34310/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34310/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129704 has finished for PR 29792 at commit 76b9f73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Oct 13, 2020

Addressed comments. Please take another look @cloud-fan @maropu . Thanks!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in feee8da Oct 13, 2020
@sunchao sunchao deleted the SPARK-32858 branch October 13, 2020 16:29
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…c types

In SPARK-24994 we implemented unwrapping cast for **integral types**. This extends it to support **numeric types** such as float/double/decimal, so that filters involving these types can be better pushed down to data sources.

Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case:

```sql
cast(e as double) < 1.9
```

assume type of `e` is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to:

```sql
e < cast(1.9 as short)
```

as in the previous implementation, since if `e` is 1, the original expression evaluates to true, but converted expression will evaluate to false.

To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a _roundtrip value_ derived from **converting the literal first to the narrower type, and then to the wider type**, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9.

<img width="1153" alt="Screen Shot 2020-09-28 at 3 30 27 PM" src="https://user-images.githubusercontent.com/506679/94492719-bd29e780-019f-11eb-9111-71d6e3d157f7.png">

Now in the case of truncate, we'd convert the original expression to:
```sql
e <= cast(1.9 as short)
```
instead, so that the conversion also is valid when `e` is 1.

For more details, please check [this blog post](https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html) by Presto which offers a very good explanation on how it works.

For queries such as:
```sql
SELECT * FROM tbl WHERE short_col < 100.5
```
The predicate `short_col < 100.5` can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently.

No

Unit tests

Closes apache#29792 from sunchao/SPARK-32858.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants