-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29860][SQL] Fix dataType mismatch issue for InSubquery. #26485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #113659 has finished for PR 26485 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #113665 has finished for PR 26485 at commit
|
cea1b78 to
312621a
Compare
|
Test build #113678 has finished for PR 26485 at commit
|
|
cc @wangyum |
48ec228 to
bb68aac
Compare
|
Test build #113689 has finished for PR 26485 at commit
|
|
Test build #113691 has finished for PR 26485 at commit
|
|
Test build #113699 has finished for PR 26485 at commit
|
| Some(widerType) | ||
| } else { | ||
| None | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks suspicious... I personally think this issue should be fixed only in InConversion instead of findTightestCommonType. That's because the change of findTightestCommonType can affect type coercion in the other operations... cc: @mgaido91 @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for two decimalType, such as Decimal(3,0) and Decimal(3,2), their tightest common type should be Decimal(5,2), which is consistent with the method name findTightestCommonType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this related to the specific bug? If not let's open another PR to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more reasonable to fix InConversion. I think it's wrong that In and InSubquery have different type coercion logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cloud-fan I will make a deep check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually InConversion and BinaryComparison also have different type coercion logic: #22038
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find a similar implementation:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
Lines 316 to 324 in e46e487
| case (t1: DecimalType, t2: DecimalType) => | |
| val scale = math.max(t1.scale, t2.scale) | |
| val range = math.max(t1.precision - t1.scale, t2.precision - t2.scale) | |
| if (range + scale > 38) { | |
| // DecimalType can't support precision > 38 | |
| DoubleType | |
| } else { | |
| DecimalType(range + scale, scale) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's wrong that In and InSubquery have different type coercion logic.
I agree on this. Please see #19635, where I tried to fix this....
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
I think the PR description above should be self-descriptive, so could you please make that more clearer? What's a root cause of this issue, how to fix, brabrabra... |
Thanks, updated. |
|
Test build #113745 has finished for PR 26485 at commit
|
0c3109e to
c5857d1
Compare
|
Test build #113782 has finished for PR 26485 at commit
|
|
Test build #113788 has finished for PR 26485 at commit
|
|
Test build #113780 has finished for PR 26485 at commit
|
|
Test build #113791 has finished for PR 26485 at commit
|
...re/src/test/resources/sql-tests/results/subquery/negative-cases/subq-input-typecheck.sql.out
Show resolved
Hide resolved
|
Test build #113794 has finished for PR 26485 at commit
|
4ac52bd to
e7d7b61
Compare
|
Test build #113879 has finished for PR 26485 at commit
|
|
retest this please. |
|
retest this please |
|
Test build #113899 has finished for PR 26485 at commit
|
|
retest this please |
|
cc @liancheng as well |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
...re/src/test/resources/sql-tests/results/subquery/negative-cases/subq-input-typecheck.sql.out
Show resolved
Hide resolved
|
Test build #114820 has finished for PR 26485 at commit
|
| -- !query 9 output | ||
| org.apache.spark.sql.AnalysisException | ||
| cannot resolve '(named_struct('t4a', t4.`t4a`, 't4b', t4.`t4b`, 't4c', t4.`t4c`) IN (listquery()))' due to data type mismatch: | ||
| cannot resolve '(named_struct('t4a', t4.`t4a`, 't4b', t4.`t4b`, 't4c', t4.`t4c`) IN (listquery()))' due to data type mismatch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I do not know why a space is involved, I have tried to remove it, but failed. It should does not matter.
|
Test build #114835 has finished for PR 26485 at commit
|
|
LGTM. Can we check the behavior in other databases like pgsql? It's better to know if Spark follows SQL standard or not. |
|
Will check the behavior later. |
|
Test build #114847 has finished for PR 26485 at commit
|
05ffba1 to
c8f93d8
Compare
|
How about the other way around (string in decimal)? Anyway this is already the behavior of |
|
The result is similar: |
|
Test build #114862 has finished for PR 26485 at commit
|
|
thanks, merging to master! |
### What changes were proposed in this pull request?
There is an issue for InSubquery expression.
For example, there are two tables `ta` and `tb` created by the below statements.
```
sql("create table ta(id Decimal(18,0)) using parquet")
sql("create table tb(id Decimal(19,0)) using parquet")
```
This statement below would thrown dataType mismatch exception.
```
sql("select * from ta where id in (select id from tb)").show()
```
However, this similar statement could execute successfully.
```
sql("select * from ta where id in ((select id from tb))").show()
```
The root cause is that, for `InSubquery` expression, it does not find a common type for two decimalType like `In` expression.
Besides that, for `InSubquery` expression, it also does not find a common type for DecimalType and double/float/bigInt.
In this PR, I fix this issue by finding widerType for `InSubquery` expression when DecimalType is involved.
### Why are the changes needed?
Some InSubquery would throw dataType mismatch exception.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes apache#26485 from turboFei/SPARK-29860-in-subquery.
Authored-by: turbofei <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

What changes were proposed in this pull request?
There is an issue for InSubquery expression.
For example, there are two tables
taandtbcreated by the below statements.This statement below would thrown dataType mismatch exception.
However, this similar statement could execute successfully.
The root cause is that, for
InSubqueryexpression, it does not find a common type for two decimalType likeInexpression.Besides that, for
InSubqueryexpression, it also does not find a common type for DecimalType and double/float/bigInt.In this PR, I fix this issue by finding widerType for
InSubqueryexpression when DecimalType is involved.Why are the changes needed?
Some InSubquery would throw dataType mismatch exception.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test.