-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted #22408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
d61fb6e
ce6d47a
bb3dbd5
9a5cc2c
56d131b
966e1ea
df5ea47
d79e9d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |
|
|
||
| ## Upgrading From Spark SQL 2.3 to 2.4 | ||
|
|
||
| - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. | ||
| <table class="table"> | ||
| <tr> | ||
| <th> | ||
| <b>Query</b> | ||
| </th> | ||
| <th> | ||
| <b>Result Spark 2.3 or Prior</b> | ||
| </th> | ||
| <th> | ||
| <b>Result Spark 2.4</b> | ||
| </th> | ||
| <th> | ||
| <b>Remarks</b> | ||
| </th> | ||
| </tr> | ||
| <tr> | ||
| <th> | ||
| <b>SELECT <br> array_contains(array(1), 1.34D);</b> | ||
| </th> | ||
| <th> | ||
| <b>true</b> | ||
| </th> | ||
| <th> | ||
| <b>false</b> | ||
| </th> | ||
| <th> | ||
| <b>In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively.</b> | ||
| </th> | ||
| </tr> | ||
| <tr> | ||
| <th> | ||
| <b>SELECT <br> array_contains(array(1), '1');</b> | ||
| </th> | ||
| <th> | ||
| <b>true</b> | ||
| </th> | ||
| <th> | ||
| <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can promote
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If presto doesn't do it, we should follow it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan Yeah, presto gives error. Please refer to my earlier comment showing the presto output. Did you want anything to change in the description ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah then it's fine, we don't need to change anything here. |
||
| </th> | ||
| <th> | ||
| <b>Users can use explict cast</b> | ||
| </th> | ||
| </tr> | ||
| <tr> | ||
| <th> | ||
| <b>SELECT <br> array_contains(array(1), 'anystring');</b> | ||
| </th> | ||
| <th> | ||
| <b>null</b> | ||
| </th> | ||
| <th> | ||
| <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b> | ||
| </th> | ||
| <th> | ||
| <b>Users can use explict cast</b> | ||
| </th> | ||
| </tr> | ||
| </table> | ||
|
|
||
| - Since Spark 2.4, when there is a struct field in front of the IN operator before a subquery, the inner query must contain a struct field as well. In previous versions, instead, the fields of the struct were compared to the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a in (select 1, 'a' from range(1))` is not. In previous version it was the opposite. | ||
| - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became case-sensitive and would resolve to columns (unless typed in lower case). In Spark 2.4 this has been fixed and the functions are no longer case-sensitive. | ||
| - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis. | ||
|
|
@@ -1912,7 +1972,6 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |
| - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles. | ||
| - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown. | ||
| - Partition column inference previously found incorrect common type for different inferred types, for example, previously it ended up with double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. The conflict resolution follows the table below: | ||
|
|
||
| <table class="table"> | ||
| <tr> | ||
| <th> | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1331,23 +1331,27 @@ case class ArrayContains(left: Expression, right: Expression) | |
| @transient private lazy val ordering: Ordering[Any] = | ||
| TypeUtils.getInterpretedOrdering(right.dataType) | ||
|
|
||
| override def inputTypes: Seq[AbstractDataType] = right.dataType match { | ||
| case NullType => Seq.empty | ||
| case _ => left.dataType match { | ||
| case n @ ArrayType(element, _) => Seq(n, element) | ||
| override def inputTypes: Seq[AbstractDataType] = { | ||
| (left.dataType, right.dataType) match { | ||
| case (_, NullType) => Seq.empty | ||
| case (ArrayType(e1, hasNull), e2) => | ||
| TypeCoercion.findTightestCommonType(e1, e2) match { | ||
|
||
| case Some(dt) => Seq(ArrayType(dt, hasNull), dt) | ||
| case _ => Seq.empty | ||
| } | ||
| case _ => Seq.empty | ||
| } | ||
| } | ||
|
|
||
| override def checkInputDataTypes(): TypeCheckResult = { | ||
| if (right.dataType == NullType) { | ||
| TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as arguments") | ||
| } else if (!left.dataType.isInstanceOf[ArrayType] | ||
| || !left.dataType.asInstanceOf[ArrayType].elementType.sameType(right.dataType)) { | ||
| TypeCheckResult.TypeCheckFailure( | ||
| "Arguments must be an array followed by a value of same type as the array members") | ||
| } else { | ||
| TypeUtils.checkForOrderingExpr(right.dataType, s"function $prettyName") | ||
| (left.dataType, right.dataType) match { | ||
| case (_, NullType) => | ||
| TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as arguments") | ||
| case (ArrayType(e1, _), e2) if e1.sameType(e2) => | ||
| TypeUtils.checkForOrderingExpr(e2, s"function $prettyName") | ||
| case _ => TypeCheckResult.TypeCheckFailure(s"Input to function $prettyName should have " + | ||
| s"been ${ArrayType.simpleString} followed by a value with same element type, but it's " + | ||
| s"[${left.dataType.catalogString}, ${right.dataType.catalogString}].") | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { | |
| df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), | ||
| Seq(Row(true), Row(true)) | ||
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(1), 1.23D)"), | ||
|
||
| Seq(Row(false), Row(false)) | ||
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(1), 1.0D)"), | ||
| Seq(Row(true), Row(true)) | ||
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(1.0D), 1)"), | ||
| Seq(Row(true), Row(true)) | ||
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(1.23D), 1)"), | ||
| Seq(Row(false), Row(false)) | ||
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(array(1)), array(1.0D))"), | ||
| Seq(Row(true), Row(true)) | ||
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(array(1)), array(1.23D))"), | ||
| Seq(Row(false), Row(false)) | ||
| ) | ||
|
|
||
| checkAnswer( | ||
| df.selectExpr("array_contains(array(array(1)), array(1.23))"), | ||
|
||
| Seq(Row(false), Row(false)) | ||
| ) | ||
| val e1 = intercept[AnalysisException] { | ||
| df.selectExpr("array_contains(array(1), .01234567890123456790123456780)") | ||
| } | ||
| val errorMsg1 = | ||
| s""" | ||
| |Input to function array_contains should have been array followed by a | ||
| |value with same element type, but it's [array<int>, decimal(29,29)]. | ||
| """.stripMargin.replace("\n", " ").trim() | ||
| assert(e1.message.contains(errorMsg1)) | ||
|
|
||
| val e2 = intercept[AnalysisException] { | ||
| df.selectExpr("array_contains(array(1), 'foo')") | ||
| } | ||
| val errorMsg2 = | ||
| s""" | ||
| |Input to function array_contains should have been array followed by a | ||
| |value with same element type, but it's [array<int>, string]. | ||
| """.stripMargin.replace("\n", " ").trim() | ||
| assert(e2.message.contains(errorMsg2)) | ||
| } | ||
|
|
||
| test("arrays_overlap function") { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
both.