-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #130467 has finished for PR 30207 at commit
|
|
Test build #130502 has finished for PR 30207 at commit
|
| if dt1.isInstanceOf[IntegralType] && dt2.isInstanceOf[StringType] => | ||
| fixValue(rawValue, dt1).map { value => | ||
| s"$name ${op.symbol} $value" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an equivalent "attempt to correct" for In and Inset, just for binary comparisons. In the case of In and Inset, if the datatypes are not compatible, I just drop the filter (which is what would have happened before SPARK-22384)
|
Thank you, @bersprockets . cc @sunchao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun for pinging. On a high-level, if we are going to optimize & remove cast in the scenario like:
cast(b as string) <op> string_literalwhere b is an integral column, perhaps we should do it in UnwrapCastInBinaryComparison? so that it can not only be used by Hive but also other data sources.
Also @bersprockets can you improve the PR description? let's not put "why are the changes needed" in "What changes were proposed in this pull request?".
| ExtractableLiteral(value), ExtractAttribute(SupportedAttribute(name))) => | ||
| ExtractAttribute(SupportedAttribute(name), dt1), ExtractableLiteral(rawValue, dt2)) | ||
| if dt1.isInstanceOf[IntegralType] && dt2.isInstanceOf[StringType] => | ||
| fixValue(rawValue, dt1).map { value => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, will this change semantics? suppose we have cast(b as string) < '012' where b is 11. Before the conversion this will evaluate to false but after it will evaluate to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should probably ignore any literal strings with leading zeros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should do it in UnwrapCastInBinaryComparison so that it can not only be used by Hive but also other data sources.
Whatever makes sense. There is some (long-time) ongoing work with TypeCoercion (#22038) that fixes a few of these cases. But if if that goes through and we can close the gap with the others, that would be fine. I am probably not in a position to provide much help in the optimizer code (at this point).
What changes were proposed in this pull request?
This PR checks that the type of the extracted column is compatible with the type of the literal. If they're not compatible, it attempts to make them compatible. If that fails, the binary comparison is not used in the final filter expression.
Why are the changes needed?
To avoid unnecessary MetaExceptions.
SPARK-22384 expanded the types of filters that
Shim_v0_13#convertFilterscan handle to include filters that contain CAST expressions. This opened up the door for Spark to push down partition filters with mismatched datatypes.Take this example: Spark passes the filter
'cast(b as string) = "2"'to convertFilters, where b is an integral column. The integral column b is extracted from the CAST expression, but the literal is left as-is, resulting in the following filter getting pushed down to the metastore:b = "2"Hive throws a MetaException complaining that an integer column is being compared to a string literal (with the very misleading message "Filtering is supported only on partition keys of type string")
Here are some examples that throw a MetaException:
In fact, all the failures I could find boil down to the following partition filter getting pushed down to the metastore:
<col-name-of-integral-column> <binary-comparison> "<string-literal>"Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added tests.