-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12981][SQL] Fix Python UDF extraction for aggregation. #10935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
@rxin Does this fix look good to you? |
|
cc @davies |
|
|
||
| if (plan.isInstanceOf[Aggregate]) { | ||
| transformed | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a style nit: put else on the same line as the previous }
also can you add some comment explaining what's happening
|
Using these two functionally equavalent code snippets: Scala Python The logical plan comes out spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Line 801 in 916fc34
Scala Python We can see in Python's case, we inject an extra Project when spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Lines 801 to 805 in 916fc34
With this fix, the logical plan generated for Python UDFs does not construct a Project if it is an Aggregate, making it consistent with its Scala counterpart, which gives correct results for ResolveAggregateFunctions to consume: After fix, Python: |
|
@xguo27 Thanks for working on this. I think the root cause here is that we extract Python UDFs too early (in analyzer), EvaluatePython is an special logical plan, many rules have no knowledge of it, which will break many things. We should extract Python UDFs later, in end of optimizer, or physical plan, I will send an PR to fix that. |
|
Sure @davies . I will close this PR. |
## What changes were proposed in this pull request? Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning). We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan. This PR extract Python UDFs in physical plan. Closes #10935 ## How was this patch tested? Added regression tests. Author: Davies Liu <[email protected]> Closes #12127 from davies/py_udf.
We'ved attempted to backport the following patch for a pretty major bug in 1.6 dataframes, hopefully it works... apache#10935
When Aggregate operator being applied ExtractPythonUDFs rule, it becomes a Project. This change fixes that and maintain Aggregate operator to the original type.