-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types #29792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #128838 has finished for PR 29792 at commit
|
| assertEquivalent(castInt(f) < v.toInt, falseIfNotNull(f)) | ||
|
|
||
| val d = Float.NegativeInfinity | ||
| assertEquivalent(castDouble(f2) > d.toDouble, f2 =!= d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is casting double to float rounding up or rounding down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is rounding down, see below for a test on this.
| assertEquivalent(castDouble(f) <= doubleValue, f <= doubleValue.toShort) | ||
| assertEquivalent(castDouble(f) < doubleValue, f <= doubleValue.toShort) | ||
|
|
||
| // Cases for rounding up: 3.14 will be rounded to 3.14000010... after casting to float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so casting double to float can be either rounding up or down, depend on the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sorry i was wrong in the above comment (somehow I was thinking casting from double to short there).
Yes, it appears that casting from double to float can be either rounding up or down, depending on value:
scala> val x = 0.39999999
x: Double = 0.39999999
scala> val y = x.toFloat
y: Float = 0.39999998
scala > val x = 0.49999999
y: Double = 0.49999999
scala> val y = x.toFloat
y: Float = 0.5To clarify, here the round up is after casting roundtrip: double 3.14 -> float 3.14 -> double 3.14000010...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important point. Can we explain how to know it's rounding up or down in the PR description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will do. This is a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan updated the description. Please take another look. Thanks!
...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
Outdated
Show resolved
Hide resolved
| case ShortType => Some((Short.MinValue, Short.MaxValue)) | ||
| case IntegerType => Some((Int.MinValue, Int.MaxValue)) | ||
| case LongType => Some((Long.MinValue, Long.MaxValue)) | ||
| case FloatType => Some((Float.NegativeInfinity, Float.NaN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the upper bound is not PositiveInfinity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because PositiveInfinity is considered to be < NaN in Spark. If we treat it as the upper bound, rules handling the upper bounds will not be valid. For instance the following expr:
cast(e as double) > double('+inf')would be converted to
e === double('+inf')which won't be correct if e evaluates to double('NaN').
| } | ||
|
|
||
| // When we reach to this point, it means either there is no min/max for the `fromType` (e.g., | ||
| // decimal type), or that the literal `value` is within range `(min, max)`. For these, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's safe to skip range check for decimal type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safe since knowing min/max for a type just gives us more opportunity for optimizations. I skipped decimal type here because (it seems) there is no min/max defined in the DecimalType, unlike other numeric types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.
| // narrower type. In this case we simply return the original expression. | ||
| return exp | ||
| } | ||
| val valueRoundTrip = Cast(Literal(newValue, fromType), toType).eval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case I'm worried about is cast(float_col as double) cmp double_lit. It's not straightforward to me that a double -> float -> double roundtrip can tell rounding up or down. is it because float -> double can only be rounding up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So double to float can result to either rounding up or down. For instance, by casting 3.14 in double to float, even though the value is still 3.14, the binary representation is rounded up:
3.14 in double:
0 10000000000 1001 0001 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1111
3.14 in float
0 10000000 1001 0001 1110 1011 1000 011
Here the sign bit and exponent bits (11 and 8 bits respectively for double and float) are the same for both float and double. However, in the fraction part, the last is rounded up to 1.
After casting back to double, there won't be any rounding up or down - the remaining digits are simply padded with 0:
0 10000000000 1001 0001 1110 1011 1000 0110 0000000000000000000000000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it defined as part of IEEE Standard for Floating-Point Arithmetic (IEEE 754)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think both the binary format as well as rounding rules are specified in IEEE 754. There are a few rounding rules and I think the default one is "rounding to half even".
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129311 has finished for PR 29792 at commit
|
|
ping @cloud-fan - addressed your comments, could you take another look at this? thanks! |
| val newValue = Cast(Literal(value), fromType).eval() | ||
| if (newValue == null) { | ||
| // This means the cast failed, for instance, due to the value is not representable in the | ||
| // narrower type. In this case we simply return the original expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give a real example here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's for decimal only. It's better to make the comment more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup will do - there is also a test case covering this.
|
The patch LGTM. Can we have an end-to-end test suite for it? The current tests prove that the optimized expression tree is what we expect, it's better to have high-level tests to prove that, after optimization, the query still returns corrected result. |
Thanks - this is a good suggestion. Will add that. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129566 has finished for PR 29792 at commit
|
|
@cloud-fan added an e2e test suite. Please take another look, thanks. |
| import org.apache.spark.sql.test.SharedSparkSession | ||
| import org.apache.spark.sql.types.Decimal | ||
|
|
||
| class UnwrapCastInComparisonEndToEndSuite extends QueryTest with SharedSparkSession { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add these end-2-end tests in SQLQueryTestSuite instead of making a new suite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I can add the test there instead - I was just following the existing ReplaceNullWithFalseInPredicateEndToEndSuite though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to follow ReplaceNullWithFalseInPredicateEndToEndSuite here.
| * - `cast(fromExp, toType) <= value` ==> `fromExp < cast(value, fromType)` | ||
| * - `cast(fromExp, toType) < value` ==> `fromExp < cast(value, fromType)` | ||
| * | ||
| * Similarly for the case when casting `value` to `fromType` causes rounding down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wrong indent.
| case IntegerType => Some((Int.MinValue, Int.MaxValue)) | ||
| case LongType => Some((Long.MinValue, Long.MaxValue)) | ||
| case FloatType => Some((Float.NegativeInfinity, Float.NaN)) | ||
| case DoubleType => Some((Double.NegativeInfinity, Double.NaN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks it does not have any test for this code path, so could you add some tests for it. (NOTE: I think byte, int, and long are not tested in UnwrapCastInBinaryComparisonSuite, too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a test case (although I think it will be pretty trivial). I only added tests for short in the previous PR because the handling for other integral types is exactly the same.
sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/UnwrapCastInComparisonEndToEndSuite.scala
Outdated
Show resolved
Hide resolved
- Added test case for getRange() - Added test for Float.PositiveInfinity and Float.MinValue/Float/MaxValue - Move `select` after `where` - Separate test cases - Fix indentation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129704 has finished for PR 29792 at commit
|
|
Addressed comments. Please take another look @cloud-fan @maropu . Thanks! |
|
thanks, merging to master! |
…c types In SPARK-24994 we implemented unwrapping cast for **integral types**. This extends it to support **numeric types** such as float/double/decimal, so that filters involving these types can be better pushed down to data sources. Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case: ```sql cast(e as double) < 1.9 ``` assume type of `e` is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to: ```sql e < cast(1.9 as short) ``` as in the previous implementation, since if `e` is 1, the original expression evaluates to true, but converted expression will evaluate to false. To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a _roundtrip value_ derived from **converting the literal first to the narrower type, and then to the wider type**, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9. <img width="1153" alt="Screen Shot 2020-09-28 at 3 30 27 PM" src="https://user-images.githubusercontent.com/506679/94492719-bd29e780-019f-11eb-9111-71d6e3d157f7.png"> Now in the case of truncate, we'd convert the original expression to: ```sql e <= cast(1.9 as short) ``` instead, so that the conversion also is valid when `e` is 1. For more details, please check [this blog post](https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html) by Presto which offers a very good explanation on how it works. For queries such as: ```sql SELECT * FROM tbl WHERE short_col < 100.5 ``` The predicate `short_col < 100.5` can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently. No Unit tests Closes apache#29792 from sunchao/SPARK-32858. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In SPARK-24994 we implemented unwrapping cast for integral types. This extends it to support numeric types such as float/double/decimal, so that filters involving these types can be better pushed down to data sources.
Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case:
assume type of
eis short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to:as in the previous implementation, since if
eis 1, the original expression evaluates to true, but converted expression will evaluate to false.To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a roundtrip value derived from converting the literal first to the narrower type, and then to the wider type, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9.
Now in the case of truncate, we'd convert the original expression to:
instead, so that the conversion also is valid when
eis 1.For more details, please check this blog post by Presto which offers a very good explanation on how it works.
Why are the changes needed?
For queries such as:
The predicate
short_col < 100.5can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently.Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests