[SPARK-26293][SQL] Cast exception when having python udf in subquery #23248

cloud-fan · 2018-12-06T12:09:52Z

What changes were proposed in this pull request?

This is a regression introduced by #22104 at Spark 2.4.0.

When we have Python UDF in subquery, we will hit an exception

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF
	at scala.collection.immutable.Stream.map(Stream.scala:414)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815)
...

#22104 turned ExtractPythonUDFs from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once.

For a subquery, the OptimizeSubqueries rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again.

Unfortunately, the ExtractPythonUDFs rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans.

This PR proposes 2 changes to be double safe:

ExtractPythonUDFs should skip python exec plans, to make the rule idempotent
ExtractPythonUDFs should skip subquery

How was this patch tested?

a new test.

cloud-fan · 2018-12-06T12:11:43Z

python/pyspark/sql/tests/test_udf.py

add the import here, as a lof of tests use it

Ah, yea. It's okay and I think it's good timing to clean up while we are here, and while it's broken down into multiple test files now.

cloud-fan · 2018-12-06T12:12:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

a different but related fix, to make the missingAttributes calculated correctly.

cloud-fan · 2018-12-06T12:14:59Z

cc @icexelloss @HyukjinKwon @ueshin @viirya @gatorsmile

SparkQA · 2018-12-06T12:17:19Z

Test build #99765 has finished for PR 23248 at commit 9477fb0.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrowEvalPython(
case class BatchEvalPython(

HyukjinKwon · 2018-12-06T12:54:10Z

Thanks, @cloud-fan. I will take a look within tomorrow - don't block by me.

SparkQA · 2018-12-06T14:30:04Z

Test build #99767 has finished for PR 23248 at commit d28089f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrowEvalPython(
case class BatchEvalPython(

cloud-fan · 2018-12-06T15:28:13Z

retest this please

icexelloss · 2018-12-06T18:35:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+  def apply(plan: LogicalPlan): LogicalPlan = plan match {
+    // SPARK-26293: A subquery will be rewritten into join later, and will go through this rule
+    // eventually. Here we skip subquery, as Python UDF only needs to be extracted once.
+    case _: Subquery => plan


Personally I found it a bit confusing when two seeming unrelated things are put together (Subquery and ExtractPythonUDFs).

I wonder if it's sufficient to make ExtractPythonUDFs idempotent?

I agree it's a bit confusing, but that's how Subquery is designed to work. See how RemoveRedundantAliases catches Subquery.

It's sufficient to make ExtractPythonUDFs idempotent, skip Subquery is just for double safe, and may have a little bit perf improvement, since this rule will be run less.

In general, I think we should skip Subquery here. This is why we create Subquery: we expect rules that don't want to be executed on subquery to skip it. I'll check more rules and see if they need to skip Subquery later.

I see. If it's common to skip Subquery in other rules, I guess it's ok to put it in here as well. But it would definitely be helpful to establish some kind of guidance, maybe sth like "All optimizer rule should skip Subquery because OptimizeSubqueries will execute them anyway"?

I think you have a point here. If subquery will be converted to join, why do we need to optimize subquery ahead?

Anyway, that's something we need to discuss later. cc @dilipbiswal for the subquery question.

I'm not sure if it is totally ok to skip Subquery for all optimizer rules.

For ExtractPythonUDFs I think it is ok because ExtractPythonUDFs is performed after the rules in RewriteSubquery. So we can skip ExtractPythonUDFs here and extract Python UDF after the subqueries are rewritten into join.

But for the rules which perform before RewriteSubquery, if we skip it on Subquery, we have no chance to do the rules after the subqueries are rewritten into join.

Basically, we want to ensure this rule is running once and only once. In the future, if we have another rule/function that calls Optimizer.this.execute(plan), this rule needs to be fixed again... We have a very strong hidden assumption in the implementation. This looks risky in the long term.

The current fix is fine for backporting to 2.4.

SparkQA · 2018-12-06T19:02:24Z

Test build #99773 has finished for PR 23248 at commit d28089f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrowEvalPython(
case class BatchEvalPython(

gatorsmile · 2018-12-10T04:02:22Z

LGTM to the surgical fix for backporting.

We need to fix this rule with the other rules for avoiding making such a strong and hidden assumption.

cloud-fan · 2018-12-10T06:00:12Z

If it's fine for 2.4, I think it's also fine for master as a temporary fix? We can create another ticket to clean up the subquery optimization hack. IIUC #23211 may help with it.

AdolphKK · 2018-12-10T15:35:15Z

looks good for me, +1 👍

cloud-fan · 2018-12-11T06:22:35Z

thanks, merging to master/2.4!

HyukjinKwon

late LGTM as well

HyukjinKwon · 2018-12-11T08:32:02Z

BTW, @cloud-fan, I think it's going to be a considerable conflict against branch-2.4 ... If the conflict is considerable, might better to open a PR.

cloud-fan · 2018-12-11T08:36:32Z

@HyukjinKwon the conflict is only the test. I just moved the test (without those cleanups) to the giant tests.py in 2.4.

HyukjinKwon · 2018-12-11T08:40:40Z

Ah, sounds good!

## What changes were proposed in this pull request? This is a regression introduced by apache#22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` apache#22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery ## How was this patch tested? a new test. Closes apache#23248 from cloud-fan/python. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

tgravescs · 2020-03-19T13:32:00Z

@cloud-fan @HyukjinKwon did this go into Spark 2.4? I'm seeing this error in 2.4.5. Jira claims it went into 2.4.1 but I don't see a commit for it?

HyukjinKwon · 2020-03-19T14:27:31Z

Indeed seems not ported back. Let me open a PR to backport.

This is a regression introduced by apache#22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` apache#22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery a new test. Closes apache#23248 from cloud-fan/python. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

HyukjinKwon · 2020-03-19T15:04:56Z

Here #27960

…uery ## What changes were proposed in this pull request? This PR backports #23248 which seems mistakenly not backported. This is a regression introduced by #22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` #22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery ## How was this patch tested? a new test. Closes #27960 from HyukjinKwon/backport-SPARK-26293. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

cloud-fan commented Dec 6, 2018

View reviewed changes

python udf in subquery

d28089f

cloud-fan force-pushed the python branch from 9477fb0 to d28089f Compare December 6, 2018 12:47

icexelloss reviewed Dec 6, 2018

View reviewed changes

asfgit closed this in 7d5f6e8 Dec 11, 2018

HyukjinKwon reviewed Dec 11, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Mar 19, 2020

[SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery #27960

Closed

[SPARK-26293][SQL] Cast exception when having python udf in subquery #23248

[SPARK-26293][SQL] Cast exception when having python udf in subquery #23248

Uh oh!

Conversation

cloud-fan commented Dec 6, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 6, 2018

Uh oh!

SparkQA commented Dec 6, 2018

Uh oh!

HyukjinKwon commented Dec 6, 2018

Uh oh!

SparkQA commented Dec 6, 2018

Uh oh!

cloud-fan commented Dec 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 6, 2018

Uh oh!

gatorsmile commented Dec 10, 2018

Uh oh!

cloud-fan commented Dec 10, 2018

Uh oh!

AdolphKK commented Dec 10, 2018

Uh oh!

cloud-fan commented Dec 11, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 11, 2018

Uh oh!

cloud-fan commented Dec 11, 2018

Uh oh!

HyukjinKwon commented Dec 11, 2018

Uh oh!

tgravescs commented Mar 19, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants