-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF #24675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to work with the ColumnPruning and PushDownPredicate rule, we must correctly implement the references method. resultAttrs are definitely not references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If references only cover references in udfs, will some output attributes from child that aren't referred by udfs be pruned from BaseEvalPython?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, and this is "column pruning".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to work with the ColumnPruning rule, the python-eval node should be able to dynamically update its output if the child's output updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for moving out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've to move out because I need to access them in PushdownPredicate, which is in catalyst module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ExtractPythonUDFs newly added to nonExcludableRules? Is it also for the fix? Or just it should be there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be there. We can do it in another PR, but since I'm touching this file, I just fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just out of curiosity.
|
Test build #105671 has finished for PR 24675 at commit
|
|
retest this please |
|
Retest this please. |
|
Test build #105678 has finished for PR 24675 at commit
|
|
retest this please |
|
Test build #105684 has finished for PR 24675 at commit
|
|
Test build #105692 has finished for PR 24675 at commit
|
| case _: Repartition => true | ||
| case _: ScriptTransformation => true | ||
| case _: Sort => true | ||
| case _: BatchEvalPython => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my benefit, would you mind explain what does canPushThrough define? Are these nodes that a projection and/or filter can be pushed through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defines the nodes that we can push filters through.
|
|
||
| // Split the original FilterExec to two FilterExecs. Only push down the first few predicates | ||
| // that are all deterministic. | ||
| private def trySplitFilter(plan: LogicalPlan): LogicalPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain a little why this is no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quote from the PR description
There are some hacks in the ExtractPythonUDFs rule, to duplicate the column pruning and filter pushdown logic. However, it has some bugs as demonstrated in the new test case(only column pruning is broken). This PR removes the hacks and re-apply the column pruning and filter pushdown rules explicitly.
|
makes sense to me. |
| override val producedAttributes = AttributeSet(output) | ||
| } | ||
|
|
||
| trait BaseEvalPython extends UnaryNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is producedAttributes missing from this? Previously, BatchEvalPython and ArrowEvalPython have it defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem I want to address later. I think producedAttributes makes no sense. It's only used to define missingInput, but we can overwrite reference to do the same thing.
More specifically, if reference is wrongly implemented, column pruning will be broken. If producedAttributes is not implemented, nothing serious will happen.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. We should have column pruning at single place, not like separately in ExtractPythonUDFs, previously.
|
BTW, just to be sync'ed with you too @BryanCutler, @viirya and @icexelloss, I am planning to add a bunch of tests specific to regular Python UDF and Pandas Scalar UDF, which are possibly able to reused to Scala UDF too - I am trying to find a way to deduplicate as much as possible. I hopefully it makes sense to you guys. This special rule I think we started to observe those issues since we turned those Python ones from physical plans to logical plans, which was (I think) right fix but couldn't catch many cases like this. My idea is basically to share (or partially duplicate) *.sql files for Python / Pandas / Scala UDFs - hope this idea prevents such issues in the future. |
|
Will get this in in few days if there are no more comments. |
|
Merged to master. |
What changes were proposed in this pull request?
In #22104 , we create the python-eval nodes at the end of the optimization phase, which causes a problem.
After the main optimization batch, Filter and Project nodes are usually pushed to the bottom, near the scan node. However, if we extract Python UDFs from Filter/Project, and create a python-eval node under Filter/Project, it will break column pruning/filter pushdown of the scan node.
There are some hacks in the
ExtractPythonUDFsrule, to duplicate the column pruning and filter pushdown logic. However, it has some bugs as demonstrated in the new test case(only column pruning is broken). This PR removes the hacks and re-apply the column pruning and filter pushdown rules explicitly.Before:
After:
How was this patch tested?
new test