[SPARK-18766] [SQL] Push Down Filter Through BatchEvalPython (Python UDF) #16193

gatorsmile · 2016-12-07T08:20:48Z

What changes were proposed in this pull request?

Currently, when users use Python UDF in Filter, BatchEvalPython is always generated below FilterExec. However, not all the predicates need to be evaluated after Python UDF execution. Thus, this PR is to push down the determinisitc predicates through BatchEvalPython.

>>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"])
>>> from pyspark.sql.functions import udf, col
>>> from pyspark.sql.types import BooleanType
>>> my_filter = udf(lambda a: a < 2, BooleanType())
>>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) & (df.value < "2"))
>>> sel.explain(True)

Before the fix, the plan looks like

== Optimized Logical Plan ==
Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2))
+- LogicalRDD [key#0L, value#1]

== Physical Plan ==
*Project [key#0L, value#1]
+- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2))
   +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9]
      +- Scan ExistingRDD[key#0L,value#1]

After the fix, the plan looks like

== Optimized Logical Plan ==
Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2))
+- LogicalRDD [key#0L, value#1]

== Physical Plan ==
*Project [key#0L, value#1]
+- *Filter pythonUDF0#9: boolean
   +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9]
      +- *Filter (isnotnull(value#1) && (value#1 < 2))
         +- Scan ExistingRDD[key#0L,value#1]

How was this patch tested?

Added both unit test cases for BatchEvalPythonExec and also add an end-to-end test case in Python test suite.

gatorsmile · 2016-12-07T08:23:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

+    val qualifiedPlanNodes = df.queryExecution.executedPlan.collect {
+      case f @ FilterExec(And(_: AttributeReference, _: AttributeReference), _) => f
+      case b: BatchEvalPythonExec => b
+      case f @ FilterExec(_: In, _) => f


The physical plan has a few hidden nodes that are not shown in Explain output. Thus, I did not compare the result with the expected tree structure.

gatorsmile · 2016-12-07T08:25:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

+    assert(qualifiedPlanNodes.size == 2)
+  }
+
+  test("Python UDF refers to the attributes from more than one child") {


This test case is not directly related to this PR. In the future, we need to add more unit test cases in Scala side for verifying BatchEvalPythonExec for improving the test case coverage.

gatorsmile · 2016-12-07T08:26:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

 }
+
+// This rule is to push deterministic predicates through BatchEvalPythonExec
+object PushPredicateThroughBatchEvalPython extends Rule[SparkPlan] with PredicateHelper {


Most of codes are from the optimizer rule PushDownPredicate. Not sure whether we should combine them. You know, this rule is for SparkPlan.

Having a predicate-pushdown rule for SparkPlan sounds bad, can we try to do this in extract()? for example

val splittedFilter = trySplitFilter(plan) val newChildren = splittedFilter.children.map { child => }

Good idea! The new commit does it.

gatorsmile · 2016-12-07T08:27:20Z

cc @cloud-fan @davies @liancheng

gatorsmile · 2016-12-07T08:28:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

  def apply(plan: SparkPlan): SparkPlan = plan transformUp {
-    case plan: SparkPlan => extract(plan)
+    case plan: SparkPlan =>
+      val newPlan = extract(plan)


extract is a recursive function. That is why I did not move the following logics into extract for performance reasons.

cloud-fan · 2016-12-07T09:01:42Z

python/pyspark/sql/tests.py

+        from pyspark.sql.types import BooleanType
+
+        my_filter = udf(lambda a: a < 2, BooleanType())
+        sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) & (df.value < "2"))


does this test fail before this PR?

Nope. This case works well.

cloud-fan · 2016-12-07T09:11:13Z

Would it be easier if we create a logical node for python evaluator? We do have one in Spark 1.6 but get removed in 2.0, not sure why

SparkQA · 2016-12-07T10:36:38Z

Test build #69787 has finished for PR 16193 at commit eaf740a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-07T12:41:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

+
+class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext {
+    import testImplicits.newProductEncoder
+    import testImplicits.localSeqToDatasetHolder


nit: indentation?

dongjoon-hyun · 2016-12-07T12:48:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+    // Only push down the predicates that is deterministic and all the referenced attributes
+    // come from grandchild.
+    val (candidates, containingNonDeterministic) =
+    splitConjunctivePredicates(filter.condition).span(_.deterministic)


nit. Indentation?

gatorsmile · 2016-12-07T18:50:41Z

@cloud-fan Let me do a history search and see why we dropped the logical plan node EvaluatePython

gatorsmile · 2016-12-07T18:55:19Z

#12127 dropped the node EvaluatePython . Based on the PR description, we removed the node for the following reasons:

Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning).
We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan.

gatorsmile · 2016-12-07T18:58:54Z

I also checked the plan of our 1.6.3 branch. The filter is not appropriately pushed down, even if we have the logical node EvaluatePython.

== Parsed Logical Plan ==
'Filter (PythonUDF#<lambda>('key) && (value#1 < 2))
+- Project [key#0L,value#1]
   +- LogicalRDD [key#0L,value#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2

== Analyzed Logical Plan ==
key: bigint, value: string
Project [key#0L,value#1]
+- Filter (pythonUDF#2 && (value#1 < 2))
   +- EvaluatePython PythonUDF#<lambda>(key#0L), pythonUDF#2: boolean
      +- Project [key#0L,value#1]
         +- LogicalRDD [key#0L,value#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2

== Optimized Logical Plan ==
Project [key#0L,value#1]
+- Filter (pythonUDF#2 && (value#1 < 2))
   +- EvaluatePython PythonUDF#<lambda>(key#0L), pythonUDF#2: boolean
      +- LogicalRDD [key#0L,value#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2

== Physical Plan ==
Project [key#0L,value#1]
+- Filter (pythonUDF#2 && (value#1 < 2))
   +- !BatchPythonEvaluation PythonUDF#<lambda>(key#0L), [key#0L,value#1,pythonUDF#2]
      +- Scan ExistingRDD[key#0L,value#1]

SparkQA · 2016-12-07T21:32:24Z

Test build #69808 has finished for PR 16193 at commit 2c3b917.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-08T06:52:19Z

Retest this please

SparkQA · 2016-12-08T06:57:41Z

Test build #69850 has started for PR 16193 at commit 2c3b917.

viirya · 2016-12-08T08:55:27Z

As we can push down predicates through data source scan, those predicates should be already pushed down if they are in the query plan above data source scan node. This seems only work on ExistingRDD scan which the predicates cannot be pushed down.

So the question is, should we push down predicates to ExistingRDD scan? I think there is not much benefit except for this case.

viirya · 2016-12-08T09:03:39Z

If we really want to do this, I'd suggest to push down predicates to rdd scan node during query planning stage. So we don't need to push down predicates to SparkPlan like this.

gatorsmile · 2016-12-08T17:32:37Z

@viirya I did not get your points. Why pushing down predicates through Python UDF does not have significant benefit? Based on my understanding, it could greatly reduce the number of rows consumed/processed by UDF. Normally, UDF is much more expensive than the built-in expressions.

gatorsmile · 2016-12-08T19:08:25Z

ExistingRDD might not be always the child of Filter. For example,

>>> sel = df.select('key', 'value', rand()).filter((my_filter(col("key"))) & (df.value < "2"))

== Optimized Logical Plan ==
Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2))
+- Project [key#0L, value#1, rand(9089678730530723370) AS rand(9089678730530723370)#20]
   +- LogicalRDD [key#0L, value#1]

== Physical Plan ==
*Project [key#0L, value#1, rand(9089678730530723370)#20]
+- *Filter pythonUDF0#26: boolean
   +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, rand(9089678730530723370)#20, pythonUDF0#26]
      +- *Filter (isnotnull(value#1) && (value#1 < 2))
         +- *Project [key#0L, value#1, rand(9089678730530723370) AS rand(9089678730530723370)#20]
            +- Scan ExistingRDD[key#0L,value#1]

davies · 2016-12-08T21:25:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+
+        if (pushDown.nonEmpty) {
+          val newChild = FilterExec(pushDown.reduceLeft(And), filter.child)
+          if (stayUp.nonEmpty) {


There are should be some UDFs, so this will not be empty

davies · 2016-12-08T21:25:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+        // come from child.
+        val (candidates, containingNonDeterministic) =
+          splitConjunctivePredicates(filter.condition).span(_.deterministic)
+        val (pushDown, rest) = candidates.partition(!hasPythonUDF(_))


nit: splitConjunctivePredicates(filter.condition).span(e => e.deterministic && !hasPythonUDF(e))

This will change the semantics. span and partition have different semantics. Thus, we still have to keep the existing behavior.

Let me write a comment to explain PythonUDF is always assumed to deterministic.

SparkQA · 2016-12-08T23:30:49Z

Test build #69878 has finished for PR 16193 at commit 3d9ba67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-12-09T00:57:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+  private def trySplitFilter(plan: SparkPlan): SparkPlan = {
+    plan match {
+      case filter: FilterExec =>
+        // Only push down the predicates that is deterministic and all the referenced attributes


Only push down the first few predicates that are all deterministic

davies · 2016-12-09T00:59:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+        // come from child.
+        val (candidates, containingNonDeterministic) =
+          splitConjunctivePredicates(filter.condition).span(_.deterministic)
+        // Python UDF is always deterministic


Is this one useful? We just won't push down expressions that has Python UDFs here.

davies · 2016-12-09T18:26:17Z

@cloud-fan There is no R UDF at this point.

davies · 2016-12-09T18:27:24Z

If no objection in next two hours, I will merge this one into master.

gatorsmile · 2016-12-09T19:28:19Z

@davies Just updated the code comments, as you suggested. It does not affect the code logics. Sorry for the late update.

gatorsmile · 2016-12-09T20:06:20Z

@cloud-fan If the functions in dapply and gapply can be called as UDF in SparkR, we have very limited support. In the plan output, it is represented as MapPartitionsInR and FlatMapGroupsInR.

More strictly, as @davies said, SparkR does not have actual SQL-level registered UDF.

gatorsmile · 2016-12-09T20:22:46Z

@viirya I think your idea is trying to resolve a different issue. It does not apply to all the cases for PythonUDF pushdown.

SparkQA · 2016-12-09T21:07:43Z

Test build #69931 has finished for PR 16193 at commit 04b0e9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-09T21:53:36Z

retest this please

SparkQA · 2016-12-10T00:41:13Z

Test build #69935 has finished for PR 16193 at commit 04b0e9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-10T06:02:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

      // Rewrite the child that has the input required for the UDF
-      val newChildren = plan.children.map { child =>
+      val newChildren =
+        splittedFilter.children.map { child =>


nit: no need to start a new line here?

cloud-fan · 2016-12-10T06:06:29Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

+      case f: FilterExec => f
+      case b: BatchEvalPythonExec => b
+    }
+    assert(qualifiedPlanNodes.size == 3)


it's really hard to tell the correctness by checking the number of plan nodes...

Let me improve them.

cloud-fan · 2016-12-10T06:09:45Z

It's a little hacky to me that we do optimization in a planner. How hard is it if we introduce a logical node for python evaluator? We can define an interface in catalyst, e.g. ExternalUDFEvaluator, so that R(or other languages in the future) UDF can also benefit from it.

davies · 2016-12-10T06:43:56Z

@cloud-fan It's not trivial to do this in optimizer, for example, we should split one Filter into two, that will conflict with another optimizer rule, that combine two filter into one.

viirya · 2016-12-10T06:48:27Z

If we add a logical node for python evaluator, we'd push down the Filter through it, so the optimizer rule won't combine two Filter into one again?

davies · 2016-12-10T06:53:07Z

The reason we move the PythonUDFEvaluator from logical plan into physical plan, because this one-off break many things, many rules need to treat specially.

davies · 2016-12-10T06:54:31Z

Pushing down predicates into data source is also during optimization in planner, I think this one is not the first that do optimization outside Optimizer.

SparkQA · 2016-12-10T07:52:42Z

Test build #69961 has started for PR 16193 at commit 2c8e593.

cloud-fan · 2016-12-10T08:21:57Z

LGTM

viirya · 2016-12-10T13:23:02Z

retest this please.

viirya · 2016-12-10T14:12:47Z

LGTM

SparkQA · 2016-12-10T15:43:33Z

Test build #69965 has finished for PR 16193 at commit 2c8e593.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-10T16:48:33Z

Thanks! Merging to master!

### What changes were proposed in this pull request? Currently, when users use Python UDF in Filter, BatchEvalPython is always generated below FilterExec. However, not all the predicates need to be evaluated after Python UDF execution. Thus, this PR is to push down the determinisitc predicates through `BatchEvalPython`. ```Python >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) >>> from pyspark.sql.functions import udf, col >>> from pyspark.sql.types import BooleanType >>> my_filter = udf(lambda a: a < 2, BooleanType()) >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) & (df.value < "2")) >>> sel.explain(True) ``` Before the fix, the plan looks like ``` == Optimized Logical Plan == Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2)) +- LogicalRDD [key#0L, value#1] == Physical Plan == *Project [key#0L, value#1] +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2)) +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9] +- Scan ExistingRDD[key#0L,value#1] ``` After the fix, the plan looks like ``` == Optimized Logical Plan == Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2)) +- LogicalRDD [key#0L, value#1] == Physical Plan == *Project [key#0L, value#1] +- *Filter pythonUDF0#9: boolean +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9] +- *Filter (isnotnull(value#1) && (value#1 < 2)) +- Scan ExistingRDD[key#0L,value#1] ``` ### How was this patch tested? Added both unit test cases for `BatchEvalPythonExec` and also add an end-to-end test case in Python test suite. Author: gatorsmile <[email protected]> Closes apache#16193 from gatorsmile/pythonUDFPredicatePushDown.

fix.

eaf740a

gatorsmile changed the title ~~[SPARK-18766] [SQL] Push Down Filter Through BatchEvalPython~~ [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPython (Python UDF) Dec 7, 2016

gatorsmile commented Dec 7, 2016

View reviewed changes

cloud-fan reviewed Dec 7, 2016

View reviewed changes

dongjoon-hyun reviewed Dec 7, 2016

View reviewed changes

fix the indents.

2c3b917

address comments.

3d9ba67

davies reviewed Dec 8, 2016

View reviewed changes

cleanup

6586c90

davies reviewed Dec 9, 2016

View reviewed changes

update the comments.

b60f7bb

update the comments.

04b0e9c

cloud-fan reviewed Dec 10, 2016

View reviewed changes

impove the test cases

2c8e593

asfgit closed this in 422a45c Dec 10, 2016

[SPARK-18766] [SQL] Push Down Filter Through BatchEvalPython (Python UDF) #16193

[SPARK-18766] [SQL] Push Down Filter Through BatchEvalPython (Python UDF) #16193

Uh oh!

Conversation

gatorsmile commented Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 7, 2016

Uh oh!

gatorsmile Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 7, 2016

Uh oh!

SparkQA commented Dec 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 7, 2016

Uh oh!

gatorsmile commented Dec 7, 2016

Uh oh!

gatorsmile commented Dec 7, 2016

Uh oh!

SparkQA commented Dec 7, 2016

Uh oh!

dongjoon-hyun commented Dec 8, 2016

Uh oh!

SparkQA commented Dec 8, 2016

Uh oh!

viirya commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Dec 8, 2016

Uh oh!

gatorsmile commented Dec 8, 2016

Uh oh!

gatorsmile commented Dec 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Dec 9, 2016

Uh oh!

davies commented Dec 9, 2016

Uh oh!

gatorsmile commented Dec 9, 2016

Uh oh!

gatorsmile commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gatorsmile commented Dec 7, 2016 •

edited

Loading

gatorsmile Dec 7, 2016 •

edited

Loading

viirya commented Dec 8, 2016 •

edited

Loading

gatorsmile commented Dec 9, 2016 •

edited

Loading