-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21717][SQL] Decouple consume functions of physical operators in whole-stage codegen #18931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
05274e7
e0e7a6e
413707d
0bb8c0e
6d600d5
502139a
5fe3762
4bef567
1694c9b
8f3b984
c04da15
9540195
1101b2c
ff77bfe
e36ec3c
edb73d6
601c225
476994f
bdc1146
58eaf00
2f2d1fd
9f0d1da
79d0106
6384aec
0c4173e
c859d53
11946e7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -177,6 +177,8 @@ case class SortExec( | |
| """.stripMargin.trim | ||
| } | ||
|
|
||
| override protected def doConsumeInChainOfFunc: Boolean = false | ||
|
||
|
|
||
| protected override val shouldStopRequired = false | ||
|
|
||
| override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -149,13 +149,143 @@ trait CodegenSupport extends SparkPlan { | |
|
|
||
| ctx.freshNamePrefix = parent.variablePrefix | ||
| val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) | ||
|
|
||
| // Under certain conditions, we can put the logic to consume the rows of this operator into | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added more comment to elaborate the idea. |
||
| // another function. So we can prevent a generated function too long to be optimized by JIT. | ||
| // The conditions: | ||
| // 1. The parent uses all variables in output. we can't defer variable evaluation when consume | ||
|
||
| // in another function. | ||
| // 2. The output variables are not empty. If it's empty, we don't bother to do that. | ||
| // 3. We don't use row variable. The construction of row uses deferred variable evaluation. We | ||
|
||
| // can't do it. | ||
| val requireAllOutput = output.forall(parent.usedInputs.contains(_)) | ||
| val consumeFunc = | ||
| if (row == null && outputVars.nonEmpty && requireAllOutput) { | ||
| constructDoConsumeFunction(ctx, inputVars) | ||
|
||
| } else { | ||
| parent.doConsume(ctx, inputVars, rowVar) | ||
| } | ||
| s""" | ||
| |${ctx.registerComment(s"CONSUME: ${parent.simpleString}")} | ||
| |$evaluated | ||
| |${parent.doConsume(ctx, inputVars, rowVar)} | ||
| |$consumeFunc | ||
| """.stripMargin | ||
| } | ||
|
|
||
| /** | ||
| * To prevent concatenated function growing too long to be optimized by JIT. Instead of inlining, | ||
| * we may put the consume logic of parent operator into a function and set this flag to `true`. | ||
| * The parent operator can know if its consume logic is inlined or in separated function. | ||
| */ | ||
| private var doConsumeInFunc: Boolean = false | ||
|
|
||
| /** | ||
| * Returning true means we have at least one consume logic from child operator or this operator is | ||
| * separated in a function. If this is `true`, this operator shouldn't use `continue` statement to | ||
| * continue on next row, because its generated codes aren't enclosed in main while-loop. | ||
| * | ||
| * For example, we have generated codes for a query plan like: | ||
| * Op1Exec | ||
| * Op2Exec | ||
| * Op3Exec | ||
| * | ||
| * If we put the consume code of Op2Exec into a separated function, the generated codes are like: | ||
| * while (...) { | ||
| * ... // logic of Op3Exec. | ||
| * Op2Exec_doConsume(...); | ||
| * } | ||
| * private boolean Op2Exec_doConsume(...) { | ||
| * ... // logic of Op2Exec to consume rows. | ||
| * } | ||
| * For now, `doConsumeInChainOfFunc` of Op2Exec will be `true`. | ||
| * | ||
| * Notice for some operators like `HashAggregateExec`, it doesn't chain previous consume functions | ||
| * but begins with its produce framework. We should override `doConsumeInChainOfFunc` to return | ||
| * `false`. | ||
| */ | ||
| protected def doConsumeInChainOfFunc: Boolean = { | ||
| val codegenChildren = children.map(_.asInstanceOf[CodegenSupport]) | ||
| doConsumeInFunc || codegenChildren.exists(_.doConsumeInChainOfFunc) | ||
| } | ||
|
|
||
| /** | ||
| * The actual java statement this operator should use if there is a need to continue on next row | ||
| * in its `doConsume` codes. | ||
| * | ||
| * while (...) { | ||
| * ... // logic of Op3Exec. | ||
| * Op2Exec_doConsume(...); | ||
| * } | ||
| * private boolean Op2Exec_doConsume(...) { | ||
| * ... // logic of Op2Exec to consume rows. | ||
| * continue; // Wrong. We can't use continue with the while-loop. | ||
| * } | ||
| * In above code, we can't use `continue` in `Op2Exec_doConsume`. | ||
| * | ||
| * Instead, we do something like: | ||
| * while (...) { | ||
| * ... // logic of Op3Exec. | ||
| * boolean continueForLoop = Op2Exec_doConsume(...); | ||
| * if (continueForLoop) continue; | ||
| * } | ||
| * private boolean Op2Exec_doConsume(...) { | ||
| * ... // logic of Op2Exec to consume rows. | ||
| * return true; // When we need to do continue, we return true. | ||
| * } | ||
| */ | ||
| protected def continueStatementInDoConsume: String = if (doConsumeInChainOfFunc) { | ||
| "return true;"; | ||
|
||
| } else { | ||
| "continue;" | ||
| } | ||
|
|
||
| /** | ||
| * To prevent concatenated function growing too long to be optimized by JIT. We can separate the | ||
| * parent's `doConsume` codes of a `CodegenSupport` operator into a function to call. | ||
| */ | ||
| protected def constructDoConsumeFunction( | ||
|
||
| ctx: CodegenContext, | ||
| inputVars: Seq[ExprCode]): String = { | ||
| val (callingParams, arguList, inputVarsInFunc) = | ||
|
||
| constructConsumeParameters(ctx, output, inputVars) | ||
| parent.doConsumeInFunc = true | ||
| val rowVar = ExprCode("", "false", "unsafeRow") | ||
| val doConsume = ctx.freshName("doConsume") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we put the operator name in this function name?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
| val doConsumeFuncName = ctx.addNewFunction(doConsume, | ||
| s""" | ||
| | private boolean $doConsume($arguList) throws java.io.IOException { | ||
| | ${parent.doConsume(ctx, inputVarsInFunc, rowVar)} | ||
| | return false; | ||
| | } | ||
| """.stripMargin) | ||
|
|
||
| s""" | ||
| | boolean continueForLoop = $doConsumeFuncName($callingParams); | ||
| | if (continueForLoop) $continueStatementInDoConsume | ||
| """.stripMargin | ||
| } | ||
|
|
||
| /** | ||
| * Returns source code for calling consume function and the argument list of the consume function | ||
| * and also the `ExprCode` for the argument list. | ||
| */ | ||
| protected def constructConsumeParameters( | ||
|
||
| ctx: CodegenContext, | ||
| attributes: Seq[Attribute], | ||
| variables: Seq[ExprCode]): (String, String, Seq[ExprCode]) = { | ||
| val params = variables.zipWithIndex.map { case (ev, i) => | ||
| val callingParam = ev.value + ", " + ev.isNull | ||
| val arguName = ctx.freshName(s"expr_$i") | ||
| val arguIsNull = ctx.freshName(s"exprIsNull_$i") | ||
| (callingParam, | ||
| ctx.javaType(attributes(i).dataType) + " " + arguName + ", boolean " + arguIsNull, | ||
|
||
| ExprCode("", arguIsNull, arguName)) | ||
| }.unzip3 | ||
| (params._1.mkString(", "), | ||
| params._2.mkString(", "), | ||
| params._3) | ||
|
||
| } | ||
|
|
||
| /** | ||
| * Returns source code to evaluate all the variables, and clear the code of them, to prevent | ||
| * them to be evaluated twice. | ||
|
|
@@ -252,6 +382,8 @@ case class InputAdapter(child: SparkPlan) extends UnaryExecNode with CodegenSupp | |
| child.execute() :: Nil | ||
| } | ||
|
|
||
| override protected def doConsumeInChainOfFunc: Boolean = false | ||
|
|
||
| override def doProduce(ctx: CodegenContext): String = { | ||
| val input = ctx.freshName("input") | ||
| // Right now, InputAdapter is only used when there is one input RDD. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be not 100% sure about your intention though, I feel this is a little confusing because
ExpandExecconsume functions can be chained in gen'd code, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
doConsumeproduces something like:So the consume logic of its parent node is actually wrapped in a local for-loop. It has the same effect as not chain the next consume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, probably we might need to describe more about exceptional cases we can't use this optimization like
HashAggregateExecin https://github.com/apache/spark/pull/18931/files#diff-28cb12941b992ff680c277c651b59aa0R204There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good news is, the just merged #19324 simplifies the usage of
continuein codegen. I'm now testing with it if I can remove this tricky part ofcontinue.