Skip to content

Conversation

@amanomer
Copy link
Contributor

@amanomer amanomer commented Dec 9, 2019

What changes were proposed in this pull request?

Use TypeCoercion.findWiderTypeForTwo() instead of TypeCoercion.findTightestCommonType() while preprocessing inputTypes in ArrayContains.

Why are the changes needed?

TypeCoercion.findWiderTypeForTwo() also handles cases for DecimalType.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test cases to be added.

@amanomer
Copy link
Contributor Author

amanomer commented Dec 9, 2019

cc @cloud-fan @srowen

case (_, NullType) => Seq.empty
case (ArrayType(e1, hasNull), e2) =>
TypeCoercion.findTightestCommonType(e1, e2) match {
TypeCoercion.findWiderTypeForTwo(e1, e2) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of things need a justification and migration guide update with tests. Which JIRA caused this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #22408 which is to prevent implicit downcasting of right expression.
It doesn't handle the case of decimal type upcasting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update migration guide.

assert(e2.message.contains(errorMsg2))
checkAnswer(
OneRowRelation().selectExpr("array_contains(array(1), 'foo')"),
Seq(Row(false))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After PR, for queries like SELECT array_contains(array(1), 'xyz'); left expression will be upcasted to array<string>. So instead of throwing an exception, it will result false.

@srowen
Copy link
Member

srowen commented Dec 9, 2019

Hm, I can't tell if the new behavior is intentional or an unintended side effect. WDYT @dilipbiswal @cloud-fan @maropu ?

@maropu
Copy link
Member

maropu commented Dec 9, 2019

I think this kind of implicit type casts can easily cause unexpected output in complicated queries... Even in pgsql, an array contain operator @> throws an exception for the query;

postgres=# select ARRAY[1,2,3] @> '3';
ERROR:  malformed array literal: "3"
LINE 1: select ARRAY[1,2,3] @> '3';
                               ^
DETAIL:  Array value must start with "{" or dimension information.

postgres=# select ARRAY[1,2,3] @> ARRAY['3'];
ERROR:  operator does not exist: integer[] @> text[]
LINE 1: select ARRAY[1,2,3] @> ARRAY['3'];
                            ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

If users want to process this kind of queries, using explicit casts looks better.

@srowen
Copy link
Member

srowen commented Dec 9, 2019

It sounds like we want an error in this case, and it currently generates an error? then I don't think we want to undo the previous change a bit like this.

@maropu
Copy link
Member

maropu commented Dec 10, 2019

@srowen Yea, I think so. cc: @gatorsmile

@dongjoon-hyun
Copy link
Member

ok to test

@cloud-fan
Copy link
Contributor

@amanomer do you know which PR causes the compatibility issue? We need to see if it's intentional or not. If it's intentional, there should be a migration guide.

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115143 has finished for PR 26811 at commit d2ce3ae.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@amanomer
Copy link
Contributor Author

do you know which PR causes the compatibility issue?

I think this PR #22408 which uses findTightestCommonType() for casting.
findTightestCommonType() always tries to downcast datatype but in case of decimal, it should upcast the precision.

@amanomer
Copy link
Contributor Author

If it's intentional, there should be a migration guide

I don't see this case(decimal types with different precision) in migration guide. Snap from migration guide is attached
Screenshot from 2019-12-11 14-38-53

@amanomer
Copy link
Contributor Author

There was a similar issue for IN subquery expression which was addressed by #26485.

@amanomer
Copy link
Contributor Author

cc @cloud-fan

@cloud-fan
Copy link
Contributor

#26485 is accepted because it just makes the type coercion consistent between In and InSubquery.

I think it's a wrong design that, we have many type coercion rules for different operators, and have a DecimalPrecision rule to handle the decimal cases for all operators. This means that an operator needs to be dealt with twice: once in the normal type coercion rule, once in the DecimalPrecision.

For array_contains, it calls findTightestCommonType which skips decimal type. We should also deal with array_contains in DecimalPrecision. This is not ideal but should be the right fix for now.

In the future, we should really refactor this part. We should either separate type coercion rules by operators, or implement type coercion inside operators.

@amanomer
Copy link
Contributor Author

@srowen
Copy link
Member

srowen commented Dec 11, 2019

(I'd follow whatever @cloud-fan recommends)

@maropu
Copy link
Member

maropu commented Dec 12, 2019

me, too.

@amanomer
Copy link
Contributor Author

@cloud-fan could you please review this PR and start tests

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 12, 2019

let me make my review comment clear: I don't think we should simply call findWiderTypeForTwo, which does much more than decimal type coercion.

After taking a look at DecimalPrecision, I found we have the same decimal type coercion implemented in TypeCoercion as well: findWiderTypeForDecimal.

Now we can have a simple fix: replace findTightestCommonType with findWiderTypeWithoutStringPromotionForTwo, which simply calls findTightestCommonType and then findWiderTypeForDecimal, and also deals with complex types as well.

@amanomer
Copy link
Contributor Author

cc @cloud-fan @HyukjinKwon


- Since Spark 3.0, the unary arithmetic operator plus(`+`) only accepts string, numeric and interval type values as inputs. Besides, `+` with a integral string representation will be coerced to double value, e.g. `+'1'` results `1.0`. In Spark version 2.4 and earlier, this operator is ignored. There is no type checking for it, thus, all type values with a `+` prefix are valid, e.g. `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no type coercion for it at all, e.g. in Spark 2.4, the result of `+'1'` is string `1`.

- Since Spark 3.0, the parameter(first or second) to array_contains function is implicitly promoted to the wider type parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu @srowen do we still need a migration guide? It looks an obvious bug to me that we forget to do type coercion for decimal types. And I don't think a user would expect Spark to fail array_contains with compatible decimal types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. The current latest fix looks reasonable. This fix is not a behaviour change but a bug fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cloud-fan @maropu
I will revert these changes from migration guide.

@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115234 has finished for PR 26811 at commit 7412368.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@amanomer amanomer requested review from cloud-fan and maropu December 12, 2019 14:09
@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115239 has finished for PR 26811 at commit 6c3545d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@amanomer
Copy link
Contributor Author

cc @cloud-fan @maropu

""".stripMargin.replace("\n", " ").trim()
assert(e2.message.contains(errorMsg2))

checkAnswer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a bug, can you split these three tests into a separate test unit and add a test title with the jira ID(SPARK-29600)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you update the title, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll update

@amanomer amanomer changed the title [SPARK-29600][SQL] array_contains built in function is not backward compatible in 3.0 [SPARK-29600][SQL] ArrayContains function may return incorrect result for DecimalType Dec 13, 2019
@amanomer amanomer requested a review from maropu December 13, 2019 12:33
@amanomer
Copy link
Contributor Author

Retest this please

s"""
|Input to function array_contains should have been array followed by a
|value with same element type, but it's [array<int>, decimal(29,29)].
|value with same element type, but it's [array<int>, decimal(38,29)].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why precision becomes 38 in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>
// If we cannot do the implicit cast, just use the original input.
implicitCast(in, expected).getOrElse(in)
}
e.withNewChildren(children)

For query array_contains(array(1), .01234567890123456790123456780)
e.inputTypes will return Seq(Array(Decimal(38,29)), Decimal(38,29)) and above code will cast .01234567890123456790123456780 as Decimal(38,29).
Previously, when we were using findWiderTypeForTwo, decimal types were not getting upcasted but findWiderTypeWithoutStringPromotionForTwo will successfully upcast DecimalType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, when we were using findWiderTypeForTwo

Before this PR, we were using findTightestCommonType. Why do we add cast but still can't resolve ArrayContains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean why in above test case query, ArrayContains is throwing AnalysisException instead of casting integer to Decimal?

An integer cannot be casted to decimal with scale > 28.

decimalWith28Zeroes = 1.0000000000000000000000000000
SELECT array_contains(array(1), decimalWith28Zeroes);
Result =>> true
decimalWith29Zeroes = 1.00000000000000000000000000000
SELECT array_contains(array(1), decimalWith29Zeroes);
Result =>> AnalysisException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@cloud-fan cloud-fan Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I get that we can't do cast here. My question is: since we can't do cast, we should leave the expression un-touched. But now we add cast to one side and leave the expression unresolved. Where do we add that useless cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>
// If we cannot do the implicit cast, just use the original input.
implicitCast(in, expected).getOrElse(in)
}
e.withNewChildren(children)

This code is to cast left and right expression one by one. Here,

  • e.childern is Seq( array<int>, decimal(29,29)), and
  • e.inputTypes will return Seq(array<decimal(38,29)>, decimal(38,29))

impicitCast(array<int>, array<decimal(38,29)>) will return None, since int can't be casted to decimal(38,29).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above code is creating new expression by updating only right child.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks for finding this out!

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115302 has finished for PR 26811 at commit 7b12d6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@amanomer
Copy link
Contributor Author

amanomer commented Dec 17, 2019

Test have passed. Kindly review this PR.
cc @gengliangwang @HyukjinKwon @srowen

@amanomer amanomer requested a review from cloud-fan December 17, 2019 14:20
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 297f406 Dec 17, 2019
@cloud-fan
Copy link
Contributor

@amanomer can you leave a comment in the JIRA ticket, so that I can assign it to you?

@amanomer
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants