[SPARK-29600][SQL] ArrayContains function may return incorrect result for DecimalType #26811

amanomer · 2019-12-09T11:45:33Z

What changes were proposed in this pull request?

Use TypeCoercion.findWiderTypeForTwo() instead of TypeCoercion.findTightestCommonType() while preprocessing inputTypes in ArrayContains.

Why are the changes needed?

TypeCoercion.findWiderTypeForTwo() also handles cases for DecimalType.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test cases to be added.

amanomer · 2019-12-09T11:51:14Z

cc @cloud-fan @srowen

HyukjinKwon · 2019-12-09T12:18:08Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

      case (_, NullType) => Seq.empty
      case (ArrayType(e1, hasNull), e2) =>
-        TypeCoercion.findTightestCommonType(e1, e2) match {
+        TypeCoercion.findWiderTypeForTwo(e1, e2) match {


This kind of things need a justification and migration guide update with tests. Which JIRA caused this issue?

PR #22408 which is to prevent implicit downcasting of right expression.
It doesn't handle the case of decimal type upcasting.

I will update migration guide.

amanomer · 2019-12-09T13:12:14Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

-    assert(e2.message.contains(errorMsg2))
+    checkAnswer(
+      OneRowRelation().selectExpr("array_contains(array(1), 'foo')"),
+      Seq(Row(false))


After PR, for queries like SELECT array_contains(array(1), 'xyz'); left expression will be upcasted to array<string>. So instead of throwing an exception, it will result false.

srowen · 2019-12-09T14:24:48Z

Hm, I can't tell if the new behavior is intentional or an unintended side effect. WDYT @dilipbiswal @cloud-fan @maropu ?

maropu · 2019-12-09T14:52:25Z

I think this kind of implicit type casts can easily cause unexpected output in complicated queries... Even in pgsql, an array contain operator @> throws an exception for the query;

postgres=# select ARRAY[1,2,3] @> '3';
ERROR:  malformed array literal: "3"
LINE 1: select ARRAY[1,2,3] @> '3';
                               ^
DETAIL:  Array value must start with "{" or dimension information.

postgres=# select ARRAY[1,2,3] @> ARRAY['3'];
ERROR:  operator does not exist: integer[] @> text[]
LINE 1: select ARRAY[1,2,3] @> ARRAY['3'];
                            ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

If users want to process this kind of queries, using explicit casts looks better.

srowen · 2019-12-09T15:37:54Z

It sounds like we want an error in this case, and it currently generates an error? then I don't think we want to undo the previous change a bit like this.

maropu · 2019-12-10T00:27:24Z

@srowen Yea, I think so. cc: @gatorsmile

dongjoon-hyun · 2019-12-11T04:18:30Z

ok to test

cloud-fan · 2019-12-11T07:23:40Z

@amanomer do you know which PR causes the compatibility issue? We need to see if it's intentional or not. If it's intentional, there should be a migration guide.

SparkQA · 2019-12-11T08:05:02Z

Test build #115143 has finished for PR 26811 at commit d2ce3ae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2019-12-11T09:05:48Z

do you know which PR causes the compatibility issue?

I think this PR #22408 which uses findTightestCommonType() for casting.
findTightestCommonType() always tries to downcast datatype but in case of decimal, it should upcast the precision.

amanomer · 2019-12-11T09:11:20Z

If it's intentional, there should be a migration guide

I don't see this case(decimal types with different precision) in migration guide. Snap from migration guide is attached

amanomer · 2019-12-11T09:20:48Z

There was a similar issue for IN subquery expression which was addressed by #26485.

amanomer · 2019-12-11T12:21:42Z

cc @cloud-fan

cloud-fan · 2019-12-11T12:54:48Z

#26485 is accepted because it just makes the type coercion consistent between In and InSubquery.

I think it's a wrong design that, we have many type coercion rules for different operators, and have a DecimalPrecision rule to handle the decimal cases for all operators. This means that an operator needs to be dealt with twice: once in the normal type coercion rule, once in the DecimalPrecision.

For array_contains, it calls findTightestCommonType which skips decimal type. We should also deal with array_contains in DecimalPrecision. This is not ideal but should be the right fix for now.

In the future, we should really refactor this part. We should either separate type coercion rules by operators, or implement type coercion inside operators.

amanomer · 2019-12-11T17:50:54Z

cc @maropu @dongjoon-hyun @dilipbiswal

srowen · 2019-12-11T18:05:15Z

(I'd follow whatever @cloud-fan recommends)

maropu · 2019-12-12T00:58:18Z

me, too.

amanomer · 2019-12-12T02:01:23Z

@cloud-fan could you please review this PR and start tests

cloud-fan · 2019-12-12T09:20:44Z

let me make my review comment clear: I don't think we should simply call findWiderTypeForTwo, which does much more than decimal type coercion.

After taking a look at DecimalPrecision, I found we have the same decimal type coercion implemented in TypeCoercion as well: findWiderTypeForDecimal.

Now we can have a simple fix: replace findTightestCommonType with findWiderTypeWithoutStringPromotionForTwo, which simply calls findTightestCommonType and then findWiderTypeForDecimal, and also deals with complex types as well.

amanomer · 2019-12-12T10:09:49Z

cc @cloud-fan @HyukjinKwon

cloud-fan · 2019-12-12T10:53:19Z

docs/sql-migration-guide.md


  - Since Spark 3.0, the unary arithmetic operator plus(`+`) only accepts string, numeric and interval type values as inputs. Besides, `+` with a integral string representation will be coerced to double value, e.g. `+'1'` results `1.0`. In Spark version 2.4 and earlier, this operator is ignored. There is no type checking for it, thus, all type values with a `+` prefix are valid, e.g. `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no type coercion for it at all, e.g. in Spark 2.4, the result of `+'1'` is string `1`.

+  - Since Spark 3.0, the parameter(first or second) to array_contains function is implicitly promoted to the wider type parameter.


@maropu @srowen do we still need a migration guide? It looks an obvious bug to me that we forget to do type coercion for decimal types. And I don't think a user would expect Spark to fail array_contains with compatible decimal types.

Ah, I see. The current latest fix looks reasonable. This fix is not a behaviour change but a bug fix.

Thanks @cloud-fan @maropu
I will revert these changes from migration guide.

SparkQA · 2019-12-12T13:31:33Z

Test build #115234 has finished for PR 26811 at commit 7412368.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-12-12T18:17:54Z

Test build #115239 has finished for PR 26811 at commit 6c3545d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2019-12-13T07:07:48Z

cc @cloud-fan @maropu

maropu · 2019-12-13T07:10:29Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

       """.stripMargin.replace("\n", " ").trim()
    assert(e2.message.contains(errorMsg2))
+
+    checkAnswer(


Since this is a bug, can you split these three tests into a separate test unit and add a test title with the jira ID(SPARK-29600)?

Also, can you update the title, too?

Sure. I'll update

amanomer · 2019-12-13T12:35:06Z

Retest this please

cloud-fan · 2019-12-13T12:55:11Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

      s"""
         |Input to function array_contains should have been array followed by a
-         |value with same element type, but it's [array<int>, decimal(29,29)].
+         |value with same element type, but it's [array<int>, decimal(38,29)].


why precision becomes 38 in this case?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

Lines 864 to 869 in 1fc353d

case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>

val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>

// If we cannot do the implicit cast, just use the original input.

implicitCast(in, expected).getOrElse(in)

}

e.withNewChildren(children)

For query array_contains(array(1), .01234567890123456790123456780)
e.inputTypes will return Seq(Array(Decimal(38,29)), Decimal(38,29)) and above code will cast .01234567890123456790123456780 as Decimal(38,29).
Previously, when we were using findWiderTypeForTwo, decimal types were not getting upcasted but findWiderTypeWithoutStringPromotionForTwo will successfully upcast DecimalType

Previously, when we were using findWiderTypeForTwo

Before this PR, we were using findTightestCommonType. Why do we add cast but still can't resolve ArrayContains?

Do you mean why in above test case query, ArrayContains is throwing AnalysisException instead of casting integer to Decimal?

An integer cannot be casted to decimal with scale > 28.

decimalWith28Zeroes = 1.0000000000000000000000000000 SELECT array_contains(array(1), decimalWith28Zeroes); Result =>> true

decimalWith29Zeroes = 1.00000000000000000000000000000 SELECT array_contains(array(1), decimalWith29Zeroes); Result =>> AnalysisException

cc @maropu @cloud-fan

Yea I get that we can't do cast here. My question is: since we can't do cast, we should leave the expression un-touched. But now we add cast to one side and leave the expression unresolved. Where do we add that useless cast?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

Lines 864 to 869 in 1fc353d

case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>

val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>

// If we cannot do the implicit cast, just use the original input.

implicitCast(in, expected).getOrElse(in)

}

e.withNewChildren(children)

This code is to cast left and right expression one by one. Here,

e.childern is Seq( array<int>, decimal(29,29)), and

e.inputTypes will return Seq(array<decimal(38,29)>, decimal(38,29))

impicitCast(array<int>, array<decimal(38,29)>) will return None, since int can't be casted to decimal(38,29).

Above code is creating new expression by updating only right child.

ah thanks for finding this out!

SparkQA · 2019-12-13T16:35:44Z

Test build #115302 has finished for PR 26811 at commit 7b12d6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2019-12-17T14:19:54Z

Test have passed. Kindly review this PR.
cc @gengliangwang @HyukjinKwon @srowen

cloud-fan · 2019-12-17T17:09:59Z

thanks, merging to master!

cloud-fan · 2019-12-17T17:32:41Z

@amanomer can you leave a comment in the JIRA ticket, so that I can assign it to you?

amanomer · 2019-12-18T05:40:44Z

@cloud-fan find the comment below
https://issues.apache.org/jira/browse/SPARK-29600?focusedCommentId=16998413&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16998413

Thanks all.

Initial commit

a186dfd

HyukjinKwon reviewed Dec 9, 2019

View reviewed changes

Test case

71b7ad3

amanomer commented Dec 9, 2019

View reviewed changes

Updated migration guide

d2ce3ae

amanomer requested a review from HyukjinKwon December 9, 2019 15:42

dongjoon-hyun added the SQL label Dec 9, 2019

findWiderTypeForTwo -> findWiderTypeWithoutStringPromotionForTwo

7412368

cloud-fan reviewed Dec 12, 2019

View reviewed changes

revert migration guide changes

6c3545d

amanomer requested review from cloud-fan and maropu December 12, 2019 14:09

maropu reviewed Dec 13, 2019

View reviewed changes

amanomer changed the title ~~[SPARK-29600][SQL] array_contains built in function is not backward compatible in 3.0~~ [SPARK-29600][SQL] ArrayContains function may return incorrect result for DecimalType Dec 13, 2019

Split UT

7b12d6c

amanomer requested a review from maropu December 13, 2019 12:33

cloud-fan reviewed Dec 13, 2019

View reviewed changes

amanomer requested a review from cloud-fan December 17, 2019 14:20

cloud-fan closed this in 297f406 Dec 17, 2019


		- Since Spark 3.0, the unary arithmetic operator plus(`+`) only accepts string, numeric and interval type values as inputs. Besides, `+` with a integral string representation will be coerced to double value, e.g. `+'1'` results `1.0`. In Spark version 2.4 and earlier, this operator is ignored. There is no type checking for it, thus, all type values with a `+` prefix are valid, e.g. `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no type coercion for it at all, e.g. in Spark 2.4, the result of `+'1'` is string `1`.

		- Since Spark 3.0, the parameter(first or second) to array_contains function is implicitly promoted to the wider type parameter.

	case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
	val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>
	// If we cannot do the implicit cast, just use the original input.
	implicitCast(in, expected).getOrElse(in)
	}
	e.withNewChildren(children)

[SPARK-29600][SQL] ArrayContains function may return incorrect result for DecimalType #26811

[SPARK-29600][SQL] ArrayContains function may return incorrect result for DecimalType #26811

Uh oh!

Conversation

amanomer commented Dec 9, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amanomer commented Dec 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Dec 9, 2019

Uh oh!

maropu commented Dec 9, 2019

Uh oh!

srowen commented Dec 9, 2019

Uh oh!

maropu commented Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Dec 11, 2019

Uh oh!

cloud-fan commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

amanomer commented Dec 11, 2019

Uh oh!

amanomer commented Dec 11, 2019

Uh oh!

amanomer commented Dec 11, 2019

Uh oh!

amanomer commented Dec 11, 2019

Uh oh!

cloud-fan commented Dec 11, 2019

Uh oh!

amanomer commented Dec 11, 2019

Uh oh!

srowen commented Dec 11, 2019

Uh oh!

maropu commented Dec 12, 2019

Uh oh!

amanomer commented Dec 12, 2019

Uh oh!

cloud-fan commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amanomer commented Dec 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 12, 2019

Uh oh!

SparkQA commented Dec 12, 2019

Uh oh!

amanomer commented Dec 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amanomer commented Dec 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Dec 10, 2019 •

edited

Loading

cloud-fan commented Dec 12, 2019 •

edited

Loading

cloud-fan Dec 17, 2019 •

edited

Loading

amanomer commented Dec 17, 2019 •

edited

Loading