[SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF #24675

cloud-fan · 2019-05-22T07:35:57Z

What changes were proposed in this pull request?

In #22104 , we create the python-eval nodes at the end of the optimization phase, which causes a problem.

After the main optimization batch, Filter and Project nodes are usually pushed to the bottom, near the scan node. However, if we extract Python UDFs from Filter/Project, and create a python-eval node under Filter/Project, it will break column pruning/filter pushdown of the scan node.

There are some hacks in the ExtractPythonUDFs rule, to duplicate the column pruning and filter pushdown logic. However, it has some bugs as demonstrated in the new test case(only column pruning is broken). This PR removes the hacks and re-apply the column pruning and filter pushdown rules explicitly.

Before:

...
== Analyzed Logical Plan ==
a: bigint
Project [a#168L]
+- Filter dummyUDF(a#168L)
   +- Relation[a#168L,b#169L] parquet

== Optimized Logical Plan ==
Project [a#168L]
+- Project [a#168L, b#169L]
   +- Filter pythonUDF0#174: boolean
      +- BatchEvalPython [dummyUDF(a#168L)], [a#168L, b#169L, pythonUDF0#174]
         +- Relation[a#168L,b#169L] parquet

== Physical Plan ==
*(2) Project [a#168L]
+- *(2) Project [a#168L, b#169L]
   +- *(2) Filter pythonUDF0#174: boolean
      +- BatchEvalPython [dummyUDF(a#168L)], [a#168L, b#169L, pythonUDF0#174]
         +- *(1) FileScan parquet [a#168L,b#169L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/spark-798bae3c-a2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint>

After:

...
== Analyzed Logical Plan ==
a: bigint
Project [a#168L]
+- Filter dummyUDF(a#168L)
   +- Relation[a#168L,b#169L] parquet

== Optimized Logical Plan ==
Project [a#168L]
+- Filter pythonUDF0#174: boolean
   +- BatchEvalPython [dummyUDF(a#168L)], [pythonUDF0#174]
      +- Project [a#168L]
         +- Relation[a#168L,b#169L] parquet

== Physical Plan ==
*(2) Project [a#168L]
+- *(2) Filter pythonUDF0#174: boolean
   +- BatchEvalPython [dummyUDF(a#168L)], [pythonUDF0#174]
      +- *(1) FileScan parquet [a#168L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/spark-9500cafb-78..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint>

How was this patch tested?

new test

cloud-fan · 2019-05-22T07:37:44Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

to work with the ColumnPruning and PushDownPredicate rule, we must correctly implement the references method. resultAttrs are definitely not references.

If references only cover references in udfs, will some output attributes from child that aren't referred by udfs be pruned from BaseEvalPython?

Yea, and this is "column pruning".

cloud-fan · 2019-05-22T07:39:14Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

to work with the ColumnPruning rule, the python-eval node should be able to dynamically update its output if the child's output updated.

cloud-fan · 2019-05-22T07:40:50Z

cc @icexelloss @ueshin @HyukjinKwon @gatorsmile

HyukjinKwon · 2019-05-22T07:47:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala

+1 for moving out

I've to move out because I need to access them in PushdownPredicate, which is in catalyst module.

viirya · 2019-05-22T08:34:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

Is ExtractPythonUDFs newly added to nonExcludableRules? Is it also for the fix? Or just it should be there?

it should be there. We can do it in another PR, but since I'm touching this file, I just fixed it.

Looks good. Just out of curiosity.

SparkQA · 2019-05-22T08:55:18Z

Test build #105671 has finished for PR 24675 at commit bbc085d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait BaseEvalPython extends UnaryNode
case class BatchEvalPython(
case class ArrowEvalPython(
case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
abstract class EvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)

cloud-fan · 2019-05-22T09:23:47Z

retest this please

dongjoon-hyun · 2019-05-22T10:13:36Z

Retest this please.

SparkQA · 2019-05-22T10:43:53Z

Test build #105678 has finished for PR 24675 at commit bbc085d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait BaseEvalPython extends UnaryNode
case class BatchEvalPython(
case class ArrowEvalPython(
case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
abstract class EvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)

cloud-fan · 2019-05-22T12:44:26Z

retest this please

SparkQA · 2019-05-22T13:15:27Z

Test build #105684 has finished for PR 24675 at commit bbc085d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait BaseEvalPython extends UnaryNode
case class BatchEvalPython(
case class ArrowEvalPython(
case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
abstract class EvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)

SparkQA · 2019-05-22T15:49:57Z

Test build #105692 has finished for PR 24675 at commit 636b603.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait BaseEvalPython extends UnaryNode
case class BatchEvalPython(
case class ArrowEvalPython(
case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
abstract class EvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)

icexelloss · 2019-05-22T18:20:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

    case _: Repartition => true
    case _: ScriptTransformation => true
    case _: Sort => true
+    case _: BatchEvalPython => true


For my benefit, would you mind explain what does canPushThrough define? Are these nodes that a projection and/or filter can be pushed through?

This defines the nodes that we can push filters through.

icexelloss · 2019-05-22T18:26:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

-
-  // Split the original FilterExec to two FilterExecs. Only push down the first few predicates
-  // that are all deterministic.
-  private def trySplitFilter(plan: LogicalPlan): LogicalPlan = {


Can you explain a little why this is no longer needed?

quote from the PR description

There are some hacks in the ExtractPythonUDFs rule, to duplicate the column pruning and filter pushdown logic. However, it has some bugs as demonstrated in the new test case(only column pruning is broken). This PR removes the hacks and re-apply the column pruning and filter pushdown rules explicitly.

HyukjinKwon · 2019-05-23T05:33:56Z

makes sense to me.

viirya · 2019-05-23T05:58:49Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

  override val producedAttributes = AttributeSet(output)
 }
+
+trait BaseEvalPython extends UnaryNode {


Is producedAttributes missing from this? Previously, BatchEvalPython and ArrowEvalPython have it defined.

This is a problem I want to address later. I think producedAttributes makes no sense. It's only used to define missingInput, but we can overwrite reference to do the same thing.

More specifically, if reference is wrongly implemented, column pruning will be broken. If producedAttributes is not implemented, nothing serious will happen.

viirya

I think this looks good. We should have column pruning at single place, not like separately in ExtractPythonUDFs, previously.

HyukjinKwon · 2019-05-24T11:52:17Z

BTW, just to be sync'ed with you too @BryanCutler, @viirya and @icexelloss, I am planning to add a bunch of tests specific to regular Python UDF and Pandas Scalar UDF, which are possibly able to reused to Scala UDF too - I am trying to find a way to deduplicate as much as possible. I hopefully it makes sense to you guys.

This special rule ExtractPythonUDF[s|FromAggregate] has unevaluable expressions that always has to be wrapped with special plans. Seems like we remove some hacks now but I think we're not sure about the coverage.

I think we started to observe those issues since we turned those Python ones from physical plans to logical plans, which was (I think) right fix but couldn't catch many cases like this. My idea is basically to share (or partially duplicate) *.sql files for Python / Pandas / Scala UDFs - hope this idea prevents such issues in the future.

HyukjinKwon · 2019-05-24T12:00:11Z

Will get this in in few days if there are no more comments.

HyukjinKwon · 2019-05-27T12:39:37Z

Merged to master.

cloud-fan commented May 22, 2019

View reviewed changes

HyukjinKwon reviewed May 22, 2019

View reviewed changes

viirya reviewed May 22, 2019

View reviewed changes

fix column pruning for python UDF

636b603

cloud-fan force-pushed the python branch from bbc085d to 636b603 Compare May 22, 2019 12:45

icexelloss reviewed May 22, 2019

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-27803][SQL] fix column pruning for python UDF~~ [SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF May 23, 2019

viirya reviewed May 23, 2019

View reviewed changes

HyukjinKwon approved these changes May 24, 2019

View reviewed changes

HyukjinKwon closed this in 6506616 May 27, 2019

cloud-fan mentioned this pull request Jul 4, 2019

[SPARK-28250][SQL] QueryPlan#references should exclude producedAttributes #25052

Closed

[SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF #24675

[SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF #24675

Uh oh!

Conversation

cloud-fan commented May 22, 2019 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

cloud-fan commented May 22, 2019

Uh oh!

dongjoon-hyun commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

cloud-fan commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 24, 2019

Uh oh!

HyukjinKwon commented May 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented May 22, 2019 •

edited by HyukjinKwon

Loading

cloud-fan May 22, 2019 •

edited

Loading

cloud-fan May 23, 2019 •

edited

Loading

HyukjinKwon commented May 24, 2019 •

edited

Loading