Skip to content

Conversation

@bersprockets
Copy link
Contributor

@bersprockets bersprockets commented Oct 30, 2020

What changes were proposed in this pull request?

This PR checks that the type of the extracted column is compatible with the type of the literal. If they're not compatible, it attempts to make them compatible. If that fails, the binary comparison is not used in the final filter expression.

Why are the changes needed?

To avoid unnecessary MetaExceptions.

SPARK-22384 expanded the types of filters that Shim_v0_13#convertFilters can handle to include filters that contain CAST expressions. This opened up the door for Spark to push down partition filters with mismatched datatypes.

Take this example: Spark passes the filter 'cast(b as string) = "2"' to convertFilters, where b is an integral column. The integral column b is extracted from the CAST expression, but the literal is left as-is, resulting in the following filter getting pushed down to the metastore:

b = "2"

Hive throws a MetaException complaining that an integer column is being compared to a string literal (with the very misleading message "Filtering is supported only on partition keys of type string")

Here are some examples that throw a MetaException:

sql("create table test (a int) partitioned by (b int) stored as parquet")
sql("insert into test values (1, 1), (1, 2), (2, 2)")

// These throw MetaExceptions
sql("select * from test where b in ('2')").show(false)
sql("select * from test where cast(b as string) = '2'").show(false)
sql("select * from test where cast(b as string) in ('2')").show(false)
sql("select * from test where cast(b as string) in (2)").show(false)
sql("select cast(b as string) as b from test where b in ('2')").show(false)
sql("select cast(b as string) as b from test").filter("b = '2'").show(false) // [1]
sql("select cast(b as string) as b from test").filter("b in (2)").show(false) // [2]
sql("select cast(b as string) as b from test").filter("b in ('2')").show(false)
sql("select * from test where cast(b as string) > '1'").show(false)
sql("select cast(b as string) b from test").filter("b > '1'").show(false) // [3]

// [1] but not sql("select cast(b as string) as b from test where b = '2'").show(false)
// [2] but not sql("select cast(b as string) as b from test where b in (2)").show(false)
// [3] but not sql("select cast(b as string) b from test where b > '1'").show(false)

In fact, all the failures I could find boil down to the following partition filter getting pushed down to the metastore:

<col-name-of-integral-column> <binary-comparison> "<string-literal>"

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130467 has finished for PR 30207 at commit b89a6aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets bersprockets changed the title [SPARK-33098][SQL] Don't push down partition filter with mismatched datatypes to metastore [SPARK-33098][SQL][WIP] Don't push down partition filter with mismatched datatypes to metastore Nov 1, 2020
@SparkQA
Copy link

SparkQA commented Nov 1, 2020

Test build #130502 has finished for PR 30207 at commit a186127.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets bersprockets changed the title [SPARK-33098][SQL][WIP] Don't push down partition filter with mismatched datatypes to metastore [SPARK-33098][SQL] Don't push down partition filter with mismatched datatypes to metastore Nov 2, 2020
@bersprockets bersprockets changed the title [SPARK-33098][SQL] Don't push down partition filter with mismatched datatypes to metastore [SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types Nov 2, 2020
if dt1.isInstanceOf[IntegralType] && dt2.isInstanceOf[StringType] =>
fixValue(rawValue, dt1).map { value =>
s"$name ${op.symbol} $value"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an equivalent "attempt to correct" for In and Inset, just for binary comparisons. In the case of In and Inset, if the datatypes are not compatible, I just drop the filter (which is what would have happened before SPARK-22384)

@dongjoon-hyun
Copy link
Member

Thank you, @bersprockets .

cc @sunchao

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongjoon-hyun for pinging. On a high-level, if we are going to optimize & remove cast in the scenario like:

cast(b as string) <op> string_literal

where b is an integral column, perhaps we should do it in UnwrapCastInBinaryComparison? so that it can not only be used by Hive but also other data sources.

Also @bersprockets can you improve the PR description? let's not put "why are the changes needed" in "What changes were proposed in this pull request?".

ExtractableLiteral(value), ExtractAttribute(SupportedAttribute(name))) =>
ExtractAttribute(SupportedAttribute(name), dt1), ExtractableLiteral(rawValue, dt2))
if dt1.isInstanceOf[IntegralType] && dt2.isInstanceOf[StringType] =>
fixValue(rawValue, dt1).map { value =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, will this change semantics? suppose we have cast(b as string) < '012' where b is 11. Before the conversion this will evaluate to false but after it will evaluate to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should probably ignore any literal strings with leading zeros.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should do it in UnwrapCastInBinaryComparison so that it can not only be used by Hive but also other data sources.

Whatever makes sense. There is some (long-time) ongoing work with TypeCoercion (#22038) that fixes a few of these cases. But if if that goes through and we can close the gap with the others, that would be fine. I am probably not in a position to provide much help in the optimizer code (at this point).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants