[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column #30380

wangyum · 2020-11-15T15:03:12Z

What changes were proposed in this pull request?

This pr fix filter for int column and value class java.lang.String when pruning partition column.

How to reproduce this issue:

spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET")
spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test")
spark.sql("SELECT * FROM test_view WHERE id = '0'").explain

20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test
20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String
20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0']
java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
 at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
 at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745)
 at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
 at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
 at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
 at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
 at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743)

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2020-11-15T15:45:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35712/

SparkQA · 2020-11-15T16:07:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35712/

SparkQA · 2020-11-15T16:57:16Z

Test build #131109 has finished for PR 30380 at commit 585250f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-15T19:15:52Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala

+  test("getPartitionsByFilter: chunk in ('ab', 'ba') and ((cast(ds as string)>'20170102')") {
+    val day = (20170101 to 20170103, 0 to 4, Seq("ab", "ba"))
+    testMetastorePartitionFiltering(
+      attr("chunk").in("ab", "ba") && (attr("ds").cast(StringType) > "20170102"),


What happens for 20170102.1234?

It is the same because we didn't pruning it:

diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala index 7e10d49..6b976d9 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog._ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.types.{BooleanType, IntegerType, LongType, StructType} +import org.apache.spark.sql.types.{BooleanType, IntegerType, LongType, StringType, StructType} import org.apache.spark.util.Utils class HivePartitionFilteringSuite(version: String) @@ -272,6 +272,15 @@ class HivePartitionFilteringSuite(version: String) day1 :: day2 :: Nil) } + + test("getPartitionsByFilter: chunk in ('ab', 'ba') and " + + "((cast(ds as string)='20170101') or (cast(ds as string)='20170102'))") { + val day = (20170101 to 20170103, 0 to 4, Seq("ab", "ba")) + testMetastorePartitionFiltering(attr("chunk").in("ab", "ba") && + (attr("ds").cast(StringType) > "20170102.1234"), + day :: Nil) + } +

dongjoon-hyun · 2020-11-15T19:23:54Z

cc @cloud-fan

dongjoon-hyun · 2020-11-15T19:51:49Z

BTW, is this a subset of the existing PR from @bersprockets ?

[SPARK-33098][SQL] Avoid MetaException by not pushing down partition filters with incompatible types #30207

cc @sunchao

cloud-fan · 2020-11-16T07:57:23Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

      def unapply(expr: Expression): Option[Attribute] = {
        expr match {
          case attr: Attribute => Some(attr)
+          case Cast(IntegralType(), StringType, _) => None


good catch! I'm thinking if we should be more conservative here. How about

case Cast(child @ IntegralType(), dt: IntegralType, _) => if Cast.canUpCast...

# Conflicts: # sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala

SparkQA · 2020-11-16T15:08:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35767/

SparkQA · 2020-11-16T15:38:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35767/

SparkQA · 2020-11-16T16:40:14Z

Test build #131164 has finished for PR 30380 at commit 9e796ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

LGTM. How far do we need to backport?

wangyum · 2020-11-17T14:36:57Z

I think we need to backport to branch-2.4.

…g.String when pruning partition column ### What changes were proposed in this pull request? This pr fix filter for int column and value class java.lang.String when pruning partition column. How to reproduce this issue: ```scala spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET") spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test") spark.sql("SELECT * FROM test_view WHERE id = '0'").explain ``` ``` 20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test 20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String 20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0'] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743) ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30380 from wangyum/SPARK-27421. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 014e1fb) Signed-off-by: Yuming Wang <[email protected]>

HyukjinKwon · 2020-11-19T06:13:05Z

@wangyum, can you resolve the JIRA, and comment here which branchs you merged into?

wangyum · 2020-11-19T06:14:25Z

Merged to master and branch-3.0.

…a.lang.String when pruning partition column This pr backport #30380 to branch-2.4. ### What changes were proposed in this pull request? This pr fix filter for int column and value class java.lang.String when pruning partition column. How to reproduce this issue: ```scala spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET") spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test") spark.sql("SELECT * FROM test_view WHERE id = '0'").explain ``` ``` 20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test 20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String 20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0'] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743) ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30422 from wangyum/SPARK-27421-2.4. Authored-by: Yuming Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…hen pruning partition column refer to apache#30380

RuntimeException when querying a view on a partitioned parquet table

585250f

github-actions bot added the SQL label Nov 15, 2020

dongjoon-hyun reviewed Nov 15, 2020

View reviewed changes

cloud-fan reviewed Nov 16, 2020

View reviewed changes

wangyum added 2 commits November 16, 2020 22:11

Merge remote-tracking branch 'upstream/master' into SPARK-27421

eb95783

# Conflicts: # sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala

Address comments

9e796ab

cloud-fan approved these changes Nov 17, 2020

View reviewed changes

HyukjinKwon approved these changes Nov 18, 2020

View reviewed changes

wangyum closed this in 014e1fb Nov 19, 2020

wangyum mentioned this pull request Nov 19, 2020

[SPARK-27421][SQL][2.4] Fix filter for int column and value class java.lang.String when pruning partition column #30422

Closed

wangyum deleted the SPARK-27421 branch February 1, 2021 05:14

leejaywei added a commit to Kyligence/spark that referenced this pull request Oct 20, 2021

KE-31090 Fix filter for int column and value class java.lang.String w…

159b551

…hen pruning partition column refer to apache#30380

[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column #30380

[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column #30380

Uh oh!

Conversation

wangyum commented Nov 15, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 15, 2020

Uh oh!

SparkQA commented Nov 15, 2020

Uh oh!

SparkQA commented Nov 15, 2020

Uh oh!

dongjoon-hyun Nov 15, 2020

Choose a reason for hiding this comment

Uh oh!

wangyum Nov 16, 2020

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 15, 2020

Uh oh!

dongjoon-hyun commented Nov 15, 2020

Uh oh!

cloud-fan Nov 16, 2020

Choose a reason for hiding this comment

Uh oh!

wangyum Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

wangyum commented Nov 17, 2020

Uh oh!

HyukjinKwon commented Nov 19, 2020

Uh oh!

wangyum commented Nov 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants