-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29128][SQL] Split predicate code in OR expressions #25827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // of the whole stage codegen. But, we don't do so now because the performance changes that | ||
| // we don't expect might occur in many queries. Therefore, we currently apply | ||
| // this split function to specific performance-sensitive places only, | ||
| // e.g., common subexpression elimination for the whole stage codegen and OR expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya @cloud-fan Is this correct? I remember @viirya worked on this in #21140 long before, but the pr wasn't merged because of some reasons: performance or design issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that is because string-based manipulation was thought too buggy? I didn't remember it is because of performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see.
|
Test build #110865 has finished for PR 25827 at commit
|
|
retest this please |
| * Defines an independent function for a given expression and returns a caller-side code | ||
| * for the function as `ExprCode`. | ||
| */ | ||
| def defineIndependentFunction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Independent stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I made a mistake trying to write Individual.. I'll update soon.
|
Test build #110873 has finished for PR 25827 at commit
|
d116613 to
7d8a382
Compare
|
Test build #110967 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #110975 has finished for PR 25827 at commit
|
|
Anyone could check this? @cloud-fan @viirya @kiszk |
| } | ||
|
|
||
| /** | ||
| * Defines an independent function for a given expression and returns a caller-side code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it better to update independent?
kiszk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except one nit
| private def reduceCodeSize(ctx: CodegenContext, eval: ExprCode): Unit = { | ||
| // TODO: support whole stage codegen too | ||
| // | ||
| // NOTE: We could use `CodeGenerator.defineSingleSplitFunction` here for the code path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid duplicate work. Currently, this func and defineSingleSplitFunction have too similar logic.
| ev: ExprCode, | ||
| funcNameOption: Option[String] = None, | ||
| inputVarsOption: Option[Seq[VariableValue]] = None): ExprCode = { | ||
| val inputVars = inputVarsOption.getOrElse(getLocalInputVariableValues(this, expr).toSeq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am not sure if getLocalInputVariableValues is totally ok to extract all input variables from an expr. This is what I want to do when I was changing codegen. I remember not all expressions are changed to expose input variables information. I did the change but stoped in the middle because the community has different idea other than this string based framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I see. we might need more discussion around this code. If that's true, it might be dangerous to use this in splitAggrExpr and common subexpr elimination, too.
|
Test build #111099 has finished for PR 25827 at commit
|
ca6522a to
ab15e24
Compare
|
Test build #112716 has finished for PR 25827 at commit
|
ab15e24 to
47e28c4
Compare
47e28c4 to
543c016
Compare
|
Test build #114185 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #114184 has finished for PR 25827 at commit
|
|
hmm is it a common problem for all expressions? e.g. the children are very big expression trees. |
|
yea, I think so. @viirya worked on that issue a few years ago (IIRC that work was not merged cuz we couldn't reach a consensus for general solution). This pr targets to solve a part of the issue caused in OR expressions only. |
|
Test build #114230 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #115280 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #115303 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #116761 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #116824 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #116847 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #118037 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #118514 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #119045 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #119824 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #119848 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #120355 has finished for PR 25827 at commit
|
|
retest this please |
|
Test build #120380 has finished for PR 25827 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
retest this please |
|
Test build #124959 has finished for PR 25827 at commit
|
What changes were proposed in this pull request?
This pr is to split predicate code in OR expressions. When I checked if method bytecode size in
BenchmarkQueryTestwent over the OpenJDK default limit (8000) or not in #25788, I found TPCDSQuerySuite.modified-q3 had too big functions. That's because too long OR chains in the query generate too long code in a single function;modified-q3generates the code below in the masterThis pr split the predicate code into small functions below;
Why are the changes needed?
For better generated code.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests.