Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Aug 8, 2018

What changes were proposed in this pull request?

before this PR:

scala> val df = spark.range(4).toDF().selectExpr("cast(id as decimal(9, 2)) as id")
df: org.apache.spark.sql.DataFrame = [id: decimal(9,2)]

scala> df.filter("id in('1', '3')").show
+---+
| id|
+---+
+---+

scala> df.filter("id = '1' or id ='3'").show
+----+
|  id|
+----+
|1.00|
|3.00|
+----+

after this PR:

scala> val df = spark.range(4).toDF().selectExpr("cast(id as decimal(9, 2)) as id")
df: org.apache.spark.sql.DataFrame = [id: decimal(9,2)]

scala> df.filter("id in('1', '3')").show
+----+
|  id|
+----+
|1.00|
|3.00|
+----+

scala> df.filter("id = '1' or id ='3'").show
+----+
|  id|
+----+
|1.00|
|3.00|
+----+

This change is the same as HIVE-20204.

Other database behavior:
Teradata:
image

Oracle:
image

MySQL:
image

postgres
image

Hive-2.3.2
image

Hive current master
image

spark-sql:
image

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94432 has finished for PR 22038 at commit 9459e6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Aug 9, 2018

@mgaido91 what do you think about it?

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94473 has finished for PR 22038 at commit c4775c4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Aug 9, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94479 has finished for PR 22038 at commit c4775c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94504 has finished for PR 22038 at commit 935ed36.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94544 has finished for PR 22038 at commit cb25b78.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2018

Test build #94638 has finished for PR 22038 at commit 4fd2143.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [SPARK-25056][SQL] Unify the InConversion and BinaryComparison behaviour when InConversion's list only contains one datatype [SPARK-25056][SQL] Unify the InConversion and BinaryComparison behavior Aug 12, 2018
@wangyum
Copy link
Member Author

wangyum commented Aug 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 12, 2018

Test build #94642 has finished for PR 22038 at commit 4fd2143.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

@wangyum since here you are enforcing that IN should behave as = in type comparisons, can we add a UT to enforce that? I see no UT enforcing what you are stating in the PR description...

@mgaido91
Copy link
Contributor

cc @cloud-fan @gatorsmile

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 13, 2018

for behavior changes like this, we should at least list which mainstream databases/bigdata systems have the same behavior, or state that it's a SQL standard.

@wangyum
Copy link
Member Author

wangyum commented Aug 14, 2018

Teradata:
image

Oracle:
image

MySQL:
image

postgres
image

Hive-2.3.2
image

Hive current master
image

spark-sql:
image

This change is the same as HIVE-20204.

@mgaido91
Copy link
Contributor

@wangyum what about Postgres and Hive?

@wangyum
Copy link
Member Author

wangyum commented Sep 13, 2018

@mgaido91 I updated Postgres and Hive to #22038 (comment)
@gatorsmile Is this change make sense?

@mgaido91
Copy link
Contributor

@wangyum yes, it seems that the new behavior is the correct one. Maybe we can change Spark's behavior in 3.0. WDYT @cloud-fan @gatorsmile ?

@maropu
Copy link
Member

maropu commented Sep 14, 2018

@wangyum Can you put the sammary of the other databases behaivours in the PR description?

SELECT cast(1 as tinyint) in (cast(1 as string)) FROM t
-- !query 8 schema
struct<(CAST(CAST(1 AS TINYINT) AS STRING) IN (CAST(CAST(1 AS STRING) AS STRING))):boolean>
struct<(CAST(CAST(1 AS TINYINT) AS TINYINT) IN (CAST(CAST(1 AS STRING) AS TINYINT))):boolean>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also update migration guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the BinaryComparison behavior:

scala> spark.sql("explain SELECT cast(1 as tinyint) > (cast(1 as string))").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------+
|plan                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------+
|== Physical Plan ==
*(1) Project [false AS (CAST(1 AS TINYINT) > CAST(CAST(1 AS STRING) AS TINYINT))#5]
+- *(1) Scan OneRowRelation[]

|
+---------------------------------------------------------------------------------------------------------------------------------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, since this is a behaviour change in the existing in, I think its worth updating the guide.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97732 has finished for PR 22038 at commit 4fd2143.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97711 has finished for PR 22038 at commit 4fd2143.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114695 has finished for PR 22038 at commit 80adb74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

cc @liancheng as well

@maropu
Copy link
Member

maropu commented Jan 9, 2020

retest this please

@maropu
Copy link
Member

maropu commented Jan 9, 2020

Brought this up again.

case i @ In(value, list) if list.exists(_.dataType != value.dataType) =>
findWiderCommonType(list.map(_.dataType)) match {
case Some(listType) =>
val finalDataType = findCommonTypeForBinaryComparison(value.dataType, listType, conf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave some comments about the discussion above?

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116333 has finished for PR 22038 at commit 80adb74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# Conflicts:
#	sql/core/src/test/resources/sql-tests/results/typeCoercion/native/inConversion.sql.out
@SparkQA
Copy link

SparkQA commented Feb 2, 2020

Test build #117741 has finished for PR 22038 at commit 232e42f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2020

Test build #118816 has finished for PR 22038 at commit b1958dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

The changes look good to me.

@maropu
Copy link
Member

maropu commented Mar 15, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Mar 15, 2020

Test build #119806 has finished for PR 22038 at commit b1958dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 16, 2020

cc: @cloud-fan

if (conf.getConf(SQLConf.LEGACY_IN_PREDICATE_FOLLOW_BINARY_COMPARISON_TYPE_COERCION)) {
findWiderCommonType(list.map(_.dataType)) match {
case Some(listType) =>
val finalDataType = findCommonTypeForBinaryComparison(value.dataType, listType, conf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum, the behaviours between decimals and strings look good. But what about other types affected here?

If we think about interpreting IN as = with OR, we should think about other rules applied to equality comparison, for example:

      // For equality between string and timestamp we cast the string to a timestamp
      // so that things like rounding of subsecond precision does not affect the comparison.
      case p @ Equality(left @ StringType(), right @ TimestampType()) =>
        p.makeCopy(Array(Cast(left, TimestampType), right))
      case p @ Equality(left @ TimestampType(), right @ StringType()) =>
        p.makeCopy(Array(left, Cast(right, TimestampType)))

What do you think about fixing this issue completely rather than fixing cases one by one? I didn't check ANSI or other DBMSs yet but I know IN is able to be rewritten to = with OR. Considering that, I suspect the type coercion will be similar too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove TypeCoercion.scala#L418-L423 because we have added the same logic to findCommonTypeForBinaryComparison.

@SparkQA
Copy link

SparkQA commented Mar 21, 2020

Test build #120132 has finished for PR 22038 at commit e60ff29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum closed this Jun 12, 2020
@wangyum wangyum reopened this Aug 26, 2020
@wangyum
Copy link
Member Author

wangyum commented Aug 26, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127922 has finished for PR 22038 at commit e60ff29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

github-actions bot commented Dec 5, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 5, 2020
@github-actions github-actions bot closed this Dec 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants