-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25056][SQL] Unify the InConversion and BinaryComparison behavior #22038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…on's list only contains one datatype
|
Test build #94432 has finished for PR 22038 at commit
|
|
@mgaido91 what do you think about it? |
|
Test build #94473 has finished for PR 22038 at commit
|
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
|
Test build #94479 has finished for PR 22038 at commit
|
|
Test build #94504 has finished for PR 22038 at commit
|
|
Test build #94544 has finished for PR 22038 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #94638 has finished for PR 22038 at commit
|
|
retest this please |
|
Test build #94642 has finished for PR 22038 at commit
|
|
@wangyum since here you are enforcing that IN should behave as = in type comparisons, can we add a UT to enforce that? I see no UT enforcing what you are stating in the PR description... |
|
for behavior changes like this, we should at least list which mainstream databases/bigdata systems have the same behavior, or state that it's a SQL standard. |
|
Hive current master This change is the same as HIVE-20204. |
|
@wangyum what about Postgres and Hive? |
|
@mgaido91 I updated Postgres and Hive to #22038 (comment) |
|
@wangyum yes, it seems that the new behavior is the correct one. Maybe we can change Spark's behavior in 3.0. WDYT @cloud-fan @gatorsmile ? |
|
@wangyum Can you put the sammary of the other databases behaivours in the PR description? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
| SELECT cast(1 as tinyint) in (cast(1 as string)) FROM t | ||
| -- !query 8 schema | ||
| struct<(CAST(CAST(1 AS TINYINT) AS STRING) IN (CAST(CAST(1 AS STRING) AS STRING))):boolean> | ||
| struct<(CAST(CAST(1 AS TINYINT) AS TINYINT) IN (CAST(CAST(1 AS STRING) AS TINYINT))):boolean> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also update migration guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the BinaryComparison behavior:
scala> spark.sql("explain SELECT cast(1 as tinyint) > (cast(1 as string))").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------+
|plan |
+---------------------------------------------------------------------------------------------------------------------------------------+
|== Physical Plan ==
*(1) Project [false AS (CAST(1 AS TINYINT) > CAST(CAST(1 AS STRING) AS TINYINT))#5]
+- *(1) Scan OneRowRelation[]
|
+---------------------------------------------------------------------------------------------------------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, since this is a behaviour change in the existing in, I think its worth updating the guide.
|
Test build #97732 has finished for PR 22038 at commit
|
|
Test build #97711 has finished for PR 22038 at commit
|
|
Test build #114695 has finished for PR 22038 at commit
|
|
cc @liancheng as well |
|
retest this please |
|
Brought this up again. |
| case i @ In(value, list) if list.exists(_.dataType != value.dataType) => | ||
| findWiderCommonType(list.map(_.dataType)) match { | ||
| case Some(listType) => | ||
| val finalDataType = findCommonTypeForBinaryComparison(value.dataType, listType, conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave some comments about the discussion above?
|
Test build #116333 has finished for PR 22038 at commit
|
# Conflicts: # sql/core/src/test/resources/sql-tests/results/typeCoercion/native/inConversion.sql.out
|
Test build #117741 has finished for PR 22038 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Show resolved
Hide resolved
|
Test build #118816 has finished for PR 22038 at commit
|
|
The changes look good to me. |
|
retest this please |
|
Test build #119806 has finished for PR 22038 at commit
|
|
cc: @cloud-fan |
| if (conf.getConf(SQLConf.LEGACY_IN_PREDICATE_FOLLOW_BINARY_COMPARISON_TYPE_COERCION)) { | ||
| findWiderCommonType(list.map(_.dataType)) match { | ||
| case Some(listType) => | ||
| val finalDataType = findCommonTypeForBinaryComparison(value.dataType, listType, conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum, the behaviours between decimals and strings look good. But what about other types affected here?
If we think about interpreting IN as = with OR, we should think about other rules applied to equality comparison, for example:
// For equality between string and timestamp we cast the string to a timestamp
// so that things like rounding of subsecond precision does not affect the comparison.
case p @ Equality(left @ StringType(), right @ TimestampType()) =>
p.makeCopy(Array(Cast(left, TimestampType), right))
case p @ Equality(left @ TimestampType(), right @ StringType()) =>
p.makeCopy(Array(left, Cast(right, TimestampType)))What do you think about fixing this issue completely rather than fixing cases one by one? I didn't check ANSI or other DBMSs yet but I know IN is able to be rewritten to = with OR. Considering that, I suspect the type coercion will be similar too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove TypeCoercion.scala#L418-L423 because we have added the same logic to findCommonTypeForBinaryComparison.
|
Test build #120132 has finished for PR 22038 at commit
|
|
retest this please |
|
Test build #127922 has finished for PR 22038 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |







What changes were proposed in this pull request?
before this PR:
after this PR:
This change is the same as HIVE-20204.
Other database behavior:

Teradata:
Oracle:

MySQL:

postgres

Hive-2.3.2

Hive current master

spark-sql:

How was this patch tested?
unit tests