[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207

bersprockets · 2020-10-30T17:37:46Z

What changes were proposed in this pull request?

This PR checks that the type of the extracted column is compatible with the type of the literal. If they're not compatible, it attempts to make them compatible. If that fails, the binary comparison is not used in the final filter expression.

Why are the changes needed?

To avoid unnecessary MetaExceptions.

SPARK-22384 expanded the types of filters that Shim_v0_13#convertFilters can handle to include filters that contain CAST expressions. This opened up the door for Spark to push down partition filters with mismatched datatypes.

Take this example: Spark passes the filter 'cast(b as string) = "2"' to convertFilters, where b is an integral column. The integral column b is extracted from the CAST expression, but the literal is left as-is, resulting in the following filter getting pushed down to the metastore:

b = "2"

Hive throws a MetaException complaining that an integer column is being compared to a string literal (with the very misleading message "Filtering is supported only on partition keys of type string")

Here are some examples that throw a MetaException:

sql("create table test (a int) partitioned by (b int) stored as parquet")
sql("insert into test values (1, 1), (1, 2), (2, 2)")

// These throw MetaExceptions
sql("select * from test where b in ('2')").show(false)
sql("select * from test where cast(b as string) = '2'").show(false)
sql("select * from test where cast(b as string) in ('2')").show(false)
sql("select * from test where cast(b as string) in (2)").show(false)
sql("select cast(b as string) as b from test where b in ('2')").show(false)
sql("select cast(b as string) as b from test").filter("b = '2'").show(false) // [1]
sql("select cast(b as string) as b from test").filter("b in (2)").show(false) // [2]
sql("select cast(b as string) as b from test").filter("b in ('2')").show(false)
sql("select * from test where cast(b as string) > '1'").show(false)
sql("select cast(b as string) b from test").filter("b > '1'").show(false) // [3]

// [1] but not sql("select cast(b as string) as b from test where b = '2'").show(false)
// [2] but not sql("select cast(b as string) as b from test where b in (2)").show(false)
// [3] but not sql("select cast(b as string) b from test where b > '1'").show(false)

In fact, all the failures I could find boil down to the following partition filter getting pushed down to the metastore:

<col-name-of-integral-column> <binary-comparison> "<string-literal>"

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests.

SparkQA · 2020-10-30T20:12:28Z

Test build #130467 has finished for PR 30207 at commit b89a6aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-01T23:31:04Z

Test build #130502 has finished for PR 30207 at commit a186127.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2020-11-02T15:54:05Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

+          if dt1.isInstanceOf[IntegralType] && dt2.isInstanceOf[StringType] =>
+        fixValue(rawValue, dt1).map { value =>
+          s"$name ${op.symbol} $value"
+        }


I don't have an equivalent "attempt to correct" for In and Inset, just for binary comparisons. In the case of In and Inset, if the datatypes are not compatible, I just drop the filter (which is what would have happened before SPARK-22384)

dongjoon-hyun · 2020-11-02T18:01:53Z

Thank you, @bersprockets .

cc @sunchao

sunchao

Thanks @dongjoon-hyun for pinging. On a high-level, if we are going to optimize & remove cast in the scenario like:

cast(b as string) <op> string_literal

where b is an integral column, perhaps we should do it in UnwrapCastInBinaryComparison? so that it can not only be used by Hive but also other data sources.

Also @bersprockets can you improve the PR description? let's not put "why are the changes needed" in "What changes were proposed in this pull request?".

sunchao · 2020-11-03T00:58:10Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

-          ExtractableLiteral(value), ExtractAttribute(SupportedAttribute(name))) =>
+          ExtractAttribute(SupportedAttribute(name), dt1), ExtractableLiteral(rawValue, dt2))
+          if dt1.isInstanceOf[IntegralType] && dt2.isInstanceOf[StringType] =>
+        fixValue(rawValue, dt1).map { value =>


hmm, will this change semantics? suppose we have cast(b as string) < '012' where b is 11. Before the conversion this will evaluate to false but after it will evaluate to true.

Yes, it should probably ignore any literal strings with leading zeros.

perhaps we should do it in UnwrapCastInBinaryComparison so that it can not only be used by Hive but also other data sources.

Whatever makes sense. There is some (long-time) ongoing work with TypeCoercion (#22038) that fixes a few of these cases. But if if that goes through and we can close the gap with the others, that would be fine. I am probably not in a position to provide much help in the optimizer code (at this point).

bersprockets added 10 commits October 20, 2020 18:24

First attempt

4cc1976

Fix test

e184352

Handle inset as well

704d565

Add some tests

e4bd24f

Add additional tests

c6cb7d2

Add additional test

80b0006

Add another case

4b77288

Remove debugging log messages

f85d965

Style fix

ac73ebd

Put back some empty lines

b89a6aa

bersprockets changed the title ~~[SPARK-33098][SQL] Don't push down partition filter with mismatched datatypes to metastore~~ [SPARK-33098][SQL][WIP] Don't push down partition filter with mismatched datatypes to metastore Nov 1, 2020

bersprockets added 2 commits November 1, 2020 12:10

Small fix; fix test

f0bafe9

Don't throw away all filters with mismatched datatypes

a186127

bersprockets changed the title ~~[SPARK-33098][SQL][WIP] Don't push down partition filter with mismatched datatypes to metastore~~ [SPARK-33098][SQL] Don't push down partition filter with mismatched datatypes to metastore Nov 2, 2020

bersprockets changed the title ~~[SPARK-33098][SQL] Don't push down partition filter with mismatched datatypes to metastore~~ [SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types Nov 2, 2020

bersprockets commented Nov 2, 2020

View reviewed changes

sunchao reviewed Nov 3, 2020

View reviewed changes

dongjoon-hyun mentioned this pull request Nov 15, 2020

[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column #30380

Closed

bersprockets closed this Nov 24, 2020

bersprockets deleted the filter_play branch November 2, 2022 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207

[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207

Uh oh!

bersprockets commented Oct 30, 2020 •

edited

Loading

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Nov 1, 2020

Uh oh!

bersprockets Nov 2, 2020

Uh oh!

dongjoon-hyun commented Nov 2, 2020

Uh oh!

sunchao left a comment •

edited

Loading

Uh oh!

sunchao Nov 3, 2020

Uh oh!

bersprockets Nov 3, 2020

Uh oh!

bersprockets Nov 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207

[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207

Uh oh!

Conversation

bersprockets commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Nov 1, 2020

Uh oh!

bersprockets Nov 2, 2020

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 2, 2020

Uh oh!

sunchao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

bersprockets Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

bersprockets Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bersprockets commented Oct 30, 2020 •

edited

Loading

sunchao left a comment •

edited

Loading