[SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery #27960

HyukjinKwon · 2020-03-19T15:01:09Z

What changes were proposed in this pull request?

This PR backports #23248 which seems mistakenly not backported.

This is a regression introduced by #22104 at Spark 2.4.0.

When we have Python UDF in subquery, we will hit an exception

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF
	at scala.collection.immutable.Stream.map(Stream.scala:414)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815)
...

#22104 turned ExtractPythonUDFs from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once.

For a subquery, the OptimizeSubqueries rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again.

Unfortunately, the ExtractPythonUDFs rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans.

This PR proposes 2 changes to be double safe:

ExtractPythonUDFs should skip python exec plans, to make the rule idempotent
ExtractPythonUDFs should skip subquery

How was this patch tested?

a new test.

This is a regression introduced by apache#22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` apache#22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery a new test. Closes apache#23248 from cloud-fan/python. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

HyukjinKwon · 2020-03-19T15:04:46Z

cc @cloud-fan and @tgravescs

cloud-fan · 2020-03-19T15:08:05Z

I do remember I backported it as I manually fixed some conflicts, but ...

Maybe some network problems happened but I didn't notice. Anyway thanks for doing it!

tgravescs · 2020-03-19T15:51:28Z

thanks @HyukjinKwon looks like clean merge other then test change. LGTM pending jenkins

SparkQA · 2020-03-19T17:58:58Z

Test build #120060 has finished for PR 27960 at commit 7a916ac.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrowEvalPython(
case class BatchEvalPython(

dongjoon-hyun · 2020-03-19T18:15:51Z

The failure is relevant one, test_udf_in_subquery. Could you take a look, @HyukjinKwon ?

======================================================================
ERROR: test_udf_in_subquery (pyspark.sql.tests.SQLTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py", line 3581, in test_udf_in_subquery
    with self.tempView("v"):
AttributeError: 'SQLTests' object has no attribute 'tempView'

HyukjinKwon · 2020-03-19T23:58:12Z

Sure, will do.

SparkQA · 2020-03-20T03:14:48Z

Test build #120075 has finished for PR 27960 at commit 423644e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-20T03:16:22Z

Merged to branch-2.4.

…uery ## What changes were proposed in this pull request? This PR backports #23248 which seems mistakenly not backported. This is a regression introduced by #22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` #22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery ## How was this patch tested? a new test. Closes #27960 from HyukjinKwon/backport-SPARK-26293. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

dongjoon-hyun · 2020-03-20T03:18:17Z

Thank you, @HyukjinKwon and @cloud-fan .

HyukjinKwon force-pushed the backport-SPARK-26293 branch from b1d41b7 to 7a916ac Compare March 19, 2020 15:01

HyukjinKwon mentioned this pull request Mar 19, 2020

[SPARK-26293][SQL] Cast exception when having python udf in subquery #23248

Closed

cloud-fan approved these changes Mar 19, 2020

View reviewed changes

dongjoon-hyun added SQL PYSPARK labels Mar 19, 2020

Manually drop tempview

423644e

HyukjinKwon closed this Mar 20, 2020

HyukjinKwon deleted the backport-SPARK-26293 branch July 27, 2020 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery #27960

[SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery #27960

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

cloud-fan commented Mar 19, 2020

Uh oh!

tgravescs commented Mar 19, 2020

Uh oh!

SparkQA commented Mar 19, 2020

Uh oh!

dongjoon-hyun commented Mar 19, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

SparkQA commented Mar 20, 2020

Uh oh!

HyukjinKwon commented Mar 20, 2020

Uh oh!

dongjoon-hyun commented Mar 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery #27960

[SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery #27960

Uh oh!

Conversation

HyukjinKwon commented Mar 19, 2020

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

cloud-fan commented Mar 19, 2020

Uh oh!

tgravescs commented Mar 19, 2020

Uh oh!

SparkQA commented Mar 19, 2020

Uh oh!

dongjoon-hyun commented Mar 19, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

SparkQA commented Mar 20, 2020

Uh oh!

HyukjinKwon commented Mar 20, 2020

Uh oh!

dongjoon-hyun commented Mar 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants