[SPARK-12981][SQL] Fix Python UDF extraction for aggregation. #10935

xguo27 · 2016-01-27T00:40:16Z

When Aggregate operator being applied ExtractPythonUDFs rule, it becomes a Project. This change fixes that and maintain Aggregate operator to the original type.

AmplabJenkins · 2016-01-27T00:42:14Z

Can one of the admins verify this patch?

xguo27 · 2016-02-26T19:07:48Z

@rxin Does this fix look good to you?

rxin · 2016-02-27T02:52:54Z

cc @davies

rxin · 2016-02-27T02:54:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+
+            if (plan.isInstanceOf[Aggregate]) {
+              transformed
+            }


a style nit: put else on the same line as the previous }

also can you add some comment explaining what's happening

xguo27 · 2016-03-01T05:44:56Z

Using these two functionally equavalent code snippets:

Scala

val data = Seq((1, "1"), (2, "2"), (3, "2"), (1, "3")).toDF("a","b")
val my_filter = sqlContext.udf.register("my_filter", (a:Int) => a==1)
data.select(col("a")).distinct().filter(my_filter(col("a")))

Python

data = sqlContext.createDataFrame([(1, "1"), (2, "2"), (3, "2"), (1, "3")], ["a", "b"])
my_filter = udf(lambda a: a == 1, BooleanType())
data.select(col("a")).distinct().filter(my_filter(col("a")))

The logical plan comes out execute(aggregateCondition) in here is as below:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Line 801 in 916fc34

val resolvedOperator = execute(aggregatedCondition)

Scala

Aggregate [a#8], [UDF(a#8) AS havingCondition#11]
+- Project [a#8]
   +- Project [_1#6 AS a#8,_2#7 AS b#9]
      +- LocalRelation [_1#6,_2#7], [[1,1],[2,2],[3,2],[1,3]]

Python

Project [havingCondition#2]
+- Aggregate [a#0L], [pythonUDF#3 AS havingCondition#2]
   +- EvaluatePython PythonUDF#<lambda>(a#0L), pythonUDF#3: boolean
      +- Project [a#0L]
         +- LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2

We can see in Python's case, we inject an extra Project when execute(aggregateCondition)going through ExtractPythonUDFs, but ResolveAggregateFunctions expects an Aggregate here:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines 801 to 805 in 916fc34

    
           val resolvedOperator = execute(aggregatedCondition) 
        
           def resolvedAggregateFilter = 
        
             resolvedOperator 
        
               .asInstanceOf[Aggregate] 
        
               .aggregateExpressions.head

With this fix, the logical plan generated for Python UDFs does not construct a Project if it is an Aggregate, making it consistent with its Scala counterpart, which gives correct results for ResolveAggregateFunctions to consume:

After fix, Python:

Aggregate [a#0L], [pythonUDF#3 AS havingCondition#2]
+- EvaluatePython PythonUDF#<lambda>(a#0L), pythonUDF#3: boolean
   +- Project [a#0L]
      +- LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2

davies · 2016-04-02T06:42:41Z

@xguo27 Thanks for working on this. I think the root cause here is that we extract Python UDFs too early (in analyzer), EvaluatePython is an special logical plan, many rules have no knowledge of it, which will break many things. We should extract Python UDFs later, in end of optimizer, or physical plan, I will send an PR to fix that.

xguo27 · 2016-04-03T01:06:10Z

Sure @davies . I will close this PR.

## What changes were proposed in this pull request? Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning). We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan. This PR extract Python UDFs in physical plan. Closes #10935 ## How was this patch tested? Added regression tests. Author: Davies Liu <[email protected]> Closes #12127 from davies/py_udf.

We'ved attempted to backport the following patch for a pretty major bug in 1.6 dataframes, hopefully it works... apache#10935

[SPARK-12981][SQL] Fix Python UDF extraction for aggregation.

1065bb2

rxin reviewed Feb 27, 2016
View reviewed changes

Update per Reynold's suggestion

e4d2629

davies mentioned this pull request Apr 2, 2016

[SPARK-12981] [SQL] extract Pyhton UDF in physical plan #12127

Closed

xguo27 closed this Apr 3, 2016

xguo27 deleted the SPARK-12981 branch April 3, 2016 01:08

tarnfeld added a commit to duedil-ltd/spark that referenced this pull request May 24, 2016

Backport fix for SPARK-12981 to 1.6

24e22ee

We'ved attempted to backport the following patch for a pretty major bug in 1.6 dataframes, hopefully it works... apache#10935

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-12981][SQL] Fix Python UDF extraction for aggregation. #10935

[SPARK-12981][SQL] Fix Python UDF extraction for aggregation. #10935

Uh oh!

xguo27 commented Jan 27, 2016

Uh oh!

AmplabJenkins commented Jan 27, 2016

Uh oh!

xguo27 commented Feb 26, 2016

Uh oh!

rxin commented Feb 27, 2016

Uh oh!

rxin Feb 27, 2016

Uh oh!

xguo27 commented Mar 1, 2016

Uh oh!

davies commented Apr 2, 2016

Uh oh!

xguo27 commented Apr 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-12981][SQL] Fix Python UDF extraction for aggregation. #10935

[SPARK-12981][SQL] Fix Python UDF extraction for aggregation. #10935

Uh oh!

Conversation

xguo27 commented Jan 27, 2016

Uh oh!

AmplabJenkins commented Jan 27, 2016

Uh oh!

xguo27 commented Feb 26, 2016

Uh oh!

rxin commented Feb 27, 2016

Uh oh!

rxin Feb 27, 2016

Choose a reason for hiding this comment

Uh oh!

xguo27 commented Mar 1, 2016

Uh oh!

davies commented Apr 2, 2016

Uh oh!

xguo27 commented Apr 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants