[SPARK-29860][SQL] Fix dataType mismatch issue for InSubquery. #26485

turboFei · 2019-11-12T15:48:41Z

What changes were proposed in this pull request?

There is an issue for InSubquery expression.
For example, there are two tables ta and tb created by the below statements.

 sql("create table ta(id Decimal(18,0)) using parquet")
 sql("create table tb(id Decimal(19,0)) using parquet")

This statement below would thrown dataType mismatch exception.

 sql("select * from ta where id in (select id from tb)").show()

However, this similar statement could execute successfully.

 sql("select * from ta where id in ((select id from tb))").show()

The root cause is that, for InSubquery expression, it does not find a common type for two decimalType like In expression.
Besides that, for InSubquery expression, it also does not find a common type for DecimalType and double/float/bigInt.
In this PR, I fix this issue by finding widerType for InSubquery expression when DecimalType is involved.

Why are the changes needed?

Some InSubquery would throw dataType mismatch exception.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

maropu · 2019-11-13T01:05:45Z

ok to test

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2019-11-13T01:31:49Z

Test build #113659 has finished for PR 26485 at commit d39b092.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2019-11-13T03:53:56Z

Test build #113665 has finished for PR 26485 at commit cea1b78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2019-11-13T03:55:15Z

@maropu Kindly review this PR #26317

SparkQA · 2019-11-13T08:05:02Z

Test build #113678 has finished for PR 26485 at commit 312621a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-11-13T08:32:04Z

cc @wangyum

SparkQA · 2019-11-13T10:45:58Z

Test build #113689 has finished for PR 26485 at commit 48ec228.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T10:56:04Z

Test build #113691 has finished for PR 26485 at commit bb68aac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T15:25:56Z

Test build #113699 has finished for PR 26485 at commit f9fbdf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-14T00:01:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+        Some(widerType)
+      } else {
+        None
+      }


This code looks suspicious... I personally think this issue should be fixed only in InConversion instead of findTightestCommonType. That's because the change of findTightestCommonType can affect type coercion in the other operations... cc: @mgaido91 @cloud-fan

I think for two decimalType, such as Decimal(3,0) and Decimal(3,2), their tightest common type should be Decimal(5,2), which is consistent with the method name findTightestCommonType.

is this related to the specific bug? If not let's open another PR to do it.

It's more reasonable to fix InConversion. I think it's wrong that In and InSubquery have different type coercion logic.

Thanks @cloud-fan I will make a deep check.

Actually InConversion and BinaryComparison also have different type coercion logic: #22038

I find a similar implementation:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

Lines 316 to 324 in e46e487

case (t1: DecimalType, t2: DecimalType) =>

val scale = math.max(t1.scale, t2.scale)

val range = math.max(t1.precision - t1.scale, t2.precision - t2.scale)

if (range + scale > 38) {

// DecimalType can't support precision > 38

DoubleType

} else {

DecimalType(range + scale, scale)

}

I think it's wrong that In and InSubquery have different type coercion logic.

I agree on this. Please see #19635, where I tried to fix this....

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

maropu · 2019-11-14T00:52:26Z

I think the PR description above should be self-descriptive, so could you please make that more clearer? What's a root cause of this issue, how to fix, brabrabra...

turboFei · 2019-11-14T02:33:45Z

I think the PR description above should be self-descriptive, so could you please make that more clearer? What's a root cause of this issue, how to fix, brabrabra...

Thanks, updated.

SparkQA · 2019-11-14T07:00:09Z

Test build #113745 has finished for PR 26485 at commit 1a67c8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-11-14T12:28:42Z

I have rebased this PR and modified the logic of InSubquery in InConversion.
And I think the PR #22038 created by @wangyum would unify the logic of In and InSubquery.

SparkQA · 2019-11-14T13:03:27Z

Test build #113782 has finished for PR 26485 at commit c5857d1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T14:00:46Z

Test build #113788 has finished for PR 26485 at commit 9f3fdd0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T14:15:50Z

Test build #113780 has finished for PR 26485 at commit 0c3109e.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T14:43:43Z

Test build #113791 has finished for PR 26485 at commit 6120d50.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

...re/src/test/resources/sql-tests/results/subquery/negative-cases/subq-input-typecheck.sql.out

SparkQA · 2019-11-14T19:22:11Z

Test build #113794 has finished for PR 26485 at commit 4ac52bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-15T13:47:08Z

Test build #113879 has finished for PR 26485 at commit e7d7b61.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-11-15T15:13:30Z

retest this please.

maropu · 2019-11-15T22:47:27Z

retest this please

SparkQA · 2019-11-16T02:33:05Z

Test build #113899 has finished for PR 26485 at commit e7d7b61.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-04T02:25:53Z

retest this please

HyukjinKwon · 2019-12-04T02:26:11Z

cc @liancheng as well

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

...re/src/test/resources/sql-tests/results/subquery/negative-cases/subq-input-typecheck.sql.out

SparkQA · 2019-12-04T06:34:42Z

Test build #114820 has finished for PR 26485 at commit e7d7b61.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-12-04T07:09:05Z

...re/src/test/resources/sql-tests/results/subquery/negative-cases/subq-input-typecheck.sql.out

 -- !query 9 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '(named_struct('t4a', t4.`t4a`, 't4b', t4.`t4b`, 't4c', t4.`t4c`) IN (listquery()))' due to data type mismatch: 
+cannot resolve '(named_struct('t4a', t4.`t4a`, 't4b', t4.`t4b`, 't4c', t4.`t4c`) IN (listquery()))' due to data type mismatch:


In fact, I do not know why a space is involved, I have tried to remove it, but failed. It should does not matter.

SparkQA · 2019-12-04T07:43:50Z

Test build #114835 has finished for PR 26485 at commit a89b3b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-04T07:49:55Z

LGTM. Can we check the behavior in other databases like pgsql? It's better to know if Spark follows SQL standard or not.

turboFei · 2019-12-04T08:36:18Z

Will check the behavior later.

SparkQA · 2019-12-04T12:05:56Z

Test build #114847 has finished for PR 26485 at commit 05ffba1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-12-04T13:52:07Z

It seems that pgsql does not support decimal in string.

cloud-fan · 2019-12-04T15:03:42Z

How about the other way around (string in decimal)? Anyway this is already the behavior of In, we should fix them together later for ANSI-SQL compliant.

turboFei · 2019-12-04T15:22:37Z

The result is similar: ERROR: operator does not exist: text = numeric.
I agree that we can fix them later for ANSI-SQL compliant.

SparkQA · 2019-12-04T17:55:39Z

Test build #114862 has finished for PR 26485 at commit c8f93d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-05T08:00:32Z

thanks, merging to master!

### What changes were proposed in this pull request? There is an issue for InSubquery expression. For example, there are two tables `ta` and `tb` created by the below statements. ``` sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") ``` This statement below would thrown dataType mismatch exception. ``` sql("select * from ta where id in (select id from tb)").show() ``` However, this similar statement could execute successfully. ``` sql("select * from ta where id in ((select id from tb))").show() ``` The root cause is that, for `InSubquery` expression, it does not find a common type for two decimalType like `In` expression. Besides that, for `InSubquery` expression, it also does not find a common type for DecimalType and double/float/bigInt. In this PR, I fix this issue by finding widerType for `InSubquery` expression when DecimalType is involved. ### Why are the changes needed? Some InSubquery would throw dataType mismatch exception. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes apache#26485 from turboFei/SPARK-29860-in-subquery. Authored-by: turbofei <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

maropu reviewed Nov 13, 2019

View reviewed changes