[SPARK-25056][SQL] Unify the InConversion and BinaryComparison behavior #22038

wangyum · 2018-08-08T15:39:32Z

What changes were proposed in this pull request?

before this PR:

scala> val df = spark.range(4).toDF().selectExpr("cast(id as decimal(9, 2)) as id")
df: org.apache.spark.sql.DataFrame = [id: decimal(9,2)]

scala> df.filter("id in('1', '3')").show
+---+
| id|
+---+
+---+

scala> df.filter("id = '1' or id ='3'").show
+----+
|  id|
+----+
|1.00|
|3.00|
+----+

after this PR:

scala> val df = spark.range(4).toDF().selectExpr("cast(id as decimal(9, 2)) as id")
df: org.apache.spark.sql.DataFrame = [id: decimal(9,2)]

scala> df.filter("id in('1', '3')").show
+----+
|  id|
+----+
|1.00|
|3.00|
+----+

scala> df.filter("id = '1' or id ='3'").show
+----+
|  id|
+----+
|1.00|
|3.00|
+----+

This change is the same as HIVE-20204.

Other database behavior:
Teradata:

Oracle:

MySQL:

postgres

Hive-2.3.2

Hive current master

spark-sql:

How was this patch tested?

unit tests

…on's list only contains one datatype

SparkQA · 2018-08-08T19:38:21Z

Test build #94432 has finished for PR 22038 at commit 9459e6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-08-09T06:28:51Z

@mgaido91 what do you think about it?

SparkQA · 2018-08-09T07:05:02Z

Test build #94473 has finished for PR 22038 at commit c4775c4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-08-09T07:34:13Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

SparkQA · 2018-08-09T11:39:13Z

Test build #94479 has finished for PR 22038 at commit c4775c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T17:40:03Z

Test build #94504 has finished for PR 22038 at commit 935ed36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-10T07:05:01Z

Test build #94544 has finished for PR 22038 at commit cb25b78.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

SparkQA · 2018-08-12T07:05:02Z

Test build #94638 has finished for PR 22038 at commit 4fd2143.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-08-12T11:00:44Z

retest this please

SparkQA · 2018-08-12T15:05:07Z

Test build #94642 has finished for PR 22038 at commit 4fd2143.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-13T09:17:14Z

@wangyum since here you are enforcing that IN should behave as = in type comparisons, can we add a UT to enforce that? I see no UT enforcing what you are stating in the PR description...

mgaido91 · 2018-08-13T09:17:26Z

cc @cloud-fan @gatorsmile

cloud-fan · 2018-08-13T11:47:01Z

for behavior changes like this, we should at least list which mainstream databases/bigdata systems have the same behavior, or state that it's a SQL standard.

wangyum · 2018-08-14T02:57:03Z

Teradata:

Oracle:

MySQL:

postgres

Hive-2.3.2

Hive current master

spark-sql:

This change is the same as HIVE-20204.

mgaido91 · 2018-08-14T08:06:28Z

@wangyum what about Postgres and Hive?

wangyum · 2018-09-13T14:02:29Z

@mgaido91 I updated Postgres and Hive to #22038 (comment)
@gatorsmile Is this change make sense?

mgaido91 · 2018-09-13T14:14:55Z

@wangyum yes, it seems that the new behavior is the correct one. Maybe we can change Spark's behavior in 3.0. WDYT @cloud-fan @gatorsmile ?

maropu · 2018-09-14T05:21:44Z

@wangyum Can you put the sammary of the other databases behaivours in the PR description?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

HyukjinKwon · 2018-09-27T16:22:48Z

sql/core/src/test/resources/sql-tests/results/typeCoercion/native/inConversion.sql.out

 SELECT cast(1 as tinyint) in (cast(1 as string)) FROM t
 -- !query 8 schema
-struct<(CAST(CAST(1 AS TINYINT) AS STRING) IN (CAST(CAST(1 AS STRING) AS STRING))):boolean>
+struct<(CAST(CAST(1 AS TINYINT) AS TINYINT) IN (CAST(CAST(1 AS STRING) AS TINYINT))):boolean>


Should also update migration guide.

This is the BinaryComparison behavior:

scala> spark.sql("explain SELECT cast(1 as tinyint) > (cast(1 as string))").show(false) +---------------------------------------------------------------------------------------------------------------------------------------+ |plan | +---------------------------------------------------------------------------------------------------------------------------------------+ |== Physical Plan == *(1) Project [false AS (CAST(1 AS TINYINT) > CAST(CAST(1 AS STRING) AS TINYINT))#5] +- *(1) Scan OneRowRelation[] | +---------------------------------------------------------------------------------------------------------------------------------------+

But, since this is a behaviour change in the existing in, I think its worth updating the guide.

SparkQA · 2018-10-22T07:05:03Z

Test build #97732 has finished for PR 22038 at commit 4fd2143.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T07:05:05Z

Test build #97711 has finished for PR 22038 at commit 4fd2143.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-12-02T05:02:23Z

Test build #114695 has finished for PR 22038 at commit 80adb74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-04T02:26:14Z

cc @liancheng as well

maropu · 2020-01-09T01:45:20Z

retest this please

maropu · 2020-01-09T01:46:00Z

Brought this up again.

maropu · 2020-01-09T01:51:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+      case i @ In(value, list) if list.exists(_.dataType != value.dataType) =>
+        findWiderCommonType(list.map(_.dataType)) match {
+          case Some(listType) =>
+            val finalDataType = findCommonTypeForBinaryComparison(value.dataType, listType, conf)


Can you leave some comments about the discussion above?

SparkQA · 2020-01-09T05:51:19Z

Test build #116333 has finished for PR 22038 at commit 80adb74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/core/src/test/resources/sql-tests/results/typeCoercion/native/inConversion.sql.out

SparkQA · 2020-02-02T15:47:49Z

Test build #117741 has finished for PR 22038 at commit 232e42f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

SparkQA · 2020-02-22T15:49:44Z

Test build #118816 has finished for PR 22038 at commit b1958dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-24T09:55:29Z

The changes look good to me.

maropu · 2020-03-15T00:13:05Z

retest this please

SparkQA · 2020-03-15T04:18:57Z

Test build #119806 has finished for PR 22038 at commit b1958dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-03-16T01:11:14Z

cc: @cloud-fan

HyukjinKwon · 2020-03-16T06:22:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+        if (conf.getConf(SQLConf.LEGACY_IN_PREDICATE_FOLLOW_BINARY_COMPARISON_TYPE_COERCION)) {
+          findWiderCommonType(list.map(_.dataType)) match {
+            case Some(listType) =>
+              val finalDataType = findCommonTypeForBinaryComparison(value.dataType, listType, conf)


@wangyum, the behaviours between decimals and strings look good. But what about other types affected here?

If we think about interpreting IN as = with OR, we should think about other rules applied to equality comparison, for example:

// For equality between string and timestamp we cast the string to a timestamp // so that things like rounding of subsecond precision does not affect the comparison. case p @ Equality(left @ StringType(), right @ TimestampType()) => p.makeCopy(Array(Cast(left, TimestampType), right)) case p @ Equality(left @ TimestampType(), right @ StringType()) => p.makeCopy(Array(left, Cast(right, TimestampType)))

What do you think about fixing this issue completely rather than fixing cases one by one? I didn't check ANSI or other DBMSs yet but I know IN is able to be rewritten to = with OR. Considering that, I suspect the type coercion will be similar too.

We can remove TypeCoercion.scala#L418-L423 because we have added the same logic to findCommonTypeForBinaryComparison.

SparkQA · 2020-03-21T19:55:56Z

Test build #120132 has finished for PR 22038 at commit e60ff29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-08-26T09:30:01Z

retest this please

SparkQA · 2020-08-26T14:12:18Z

Test build #127922 has finished for PR 22038 at commit e60ff29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-12-05T00:45:46Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Unify the InConversion and BinaryComparison behaviour when InConversi…

9459e6e

…on's list only contains one datatype

findWiderTypeForTwo -> findWiderTypeWithoutStringPromotionForTwo

c4775c4

mgaido91 reviewed Aug 9, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala Outdated Show resolved Hide resolved

Add findInCommonType

935ed36

Fix test error.

cb25b78

mgaido91 reviewed Aug 10, 2018

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala Outdated Show resolved Hide resolved

Fix

4fd2143

wangyum changed the title ~~[SPARK-25056][SQL] Unify the InConversion and BinaryComparison behaviour when InConversion's list only contains one datatype~~ [SPARK-25056][SQL] Unify the InConversion and BinaryComparison behavior Aug 12, 2018

maropu reviewed Sep 14, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 27, 2018

View reviewed changes

maropu reviewed Jan 9, 2020

View reviewed changes

wangyum added 2 commits February 2, 2020 18:41

Merge remote-tracking branch 'upstream/master' into SPARK-25056

7c56d38

# Conflicts: # sql/core/src/test/resources/sql-tests/results/typeCoercion/native/inConversion.sql.out

Merge upstream

232e42f

HyukjinKwon reviewed Feb 18, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala Show resolved Hide resolved

wangyum added 2 commits February 22, 2020 13:17

Merge remote-tracking branch 'upstream/master' into SPARK-25056

8536af4

Add LEGACY_IN_PREDICATE_FOLLOW_BINARY_COMPARISON_TYPE_COERCION

b1958dd

HyukjinKwon reviewed Mar 16, 2020

View reviewed changes

wangyum added 2 commits March 21, 2020 20:17

Merge remote-tracking branch 'upstream/master' into SPARK-25056

f060efa

Fix

e60ff29

wangyum closed this Jun 12, 2020

wangyum reopened this Aug 26, 2020

tanelk mentioned this pull request Oct 9, 2020

[WIP][SPARK-33098][SQL] Fix In expression casts #29988

Closed

bersprockets mentioned this pull request Nov 3, 2020

[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207

Closed

github-actions bot added the Stale label Dec 5, 2020

github-actions bot closed this Dec 6, 2020

[SPARK-25056][SQL] Unify the InConversion and BinaryComparison behavior #22038

[SPARK-25056][SQL] Unify the InConversion and BinaryComparison behavior #22038

Uh oh!

Conversation

wangyum commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

wangyum commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

wangyum commented Aug 9, 2018

Uh oh!

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

SparkQA commented Aug 10, 2018

Uh oh!

Uh oh!

SparkQA commented Aug 12, 2018

Uh oh!

wangyum commented Aug 12, 2018

Uh oh!

SparkQA commented Aug 12, 2018

Uh oh!

mgaido91 commented Aug 13, 2018

Uh oh!

mgaido91 commented Aug 13, 2018

Uh oh!

cloud-fan commented Aug 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangyum commented Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Aug 14, 2018

Uh oh!

wangyum commented Sep 13, 2018

Uh oh!

mgaido91 commented Sep 13, 2018

Uh oh!

maropu commented Sep 14, 2018

Uh oh!

Uh oh!

HyukjinKwon Sep 27, 2018

Choose a reason for hiding this comment

Uh oh!

wangyum Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

maropu Jan 9, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Dec 2, 2019

Uh oh!

HyukjinKwon commented Dec 4, 2019

Uh oh!

maropu commented Jan 9, 2020

Uh oh!

maropu commented Jan 9, 2020

Uh oh!

maropu Jan 9, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

SparkQA commented Feb 2, 2020

Uh oh!

Uh oh!

SparkQA commented Feb 22, 2020

wangyum commented Aug 8, 2018 •

edited

Loading

cloud-fan commented Aug 13, 2018 •

edited

Loading

wangyum commented Aug 14, 2018 •

edited

Loading