[SPARK-9192][SQL] add initialization phase for nondeterministic expression#7535
[SPARK-9192][SQL] add initialization phase for nondeterministic expression#7535cloud-fan wants to merge 6 commits intoapache:masterfrom
Conversation
|
Test build #37828 has finished for PR 7535 at commit
|
|
Test build #37829 has finished for PR 7535 at commit
|
|
Test build #37841 has finished for PR 7535 at commit
|
|
I discussed with @marmbrus on this. Here's the suggestion:
|
There was a problem hiding this comment.
case operator: LogicalPlan catches all cases, so why not just use a normal function here instead of pattern match?
|
Test build #38087 has finished for PR 7535 at commit
|
|
Test build #1158 has finished for PR 7535 at commit
|
There was a problem hiding this comment.
a un-related but small fix: check multiple should use length > 1 instead of nonEmpty
|
Test build #38091 has finished for PR 7535 at commit
|
|
Test build #38176 has finished for PR 7535 at commit
|
|
Test build #38194 has finished for PR 7535 at commit
|
62c28a7 to
6c6f332
Compare
|
Test build #38438 has finished for PR 7535 at commit
|
|
Test build #38440 has finished for PR 7535 at commit
|
|
Test build #38441 has finished for PR 7535 at commit
|
|
I'm going to merge this one first. Would be great to get a holistic review before 1.5 release on all the random related optimizer/analyzer changes. |
…oin Conditions
## What changes were proposed in this pull request?
```
sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show()
```
We will get the following error:
```
Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
```
Filters could be pushed down to the join conditions by the optimizer rule `PushPredicateThroughJoin`. However, Analyzer [blocks users to add non-deterministics conditions](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395) (For details, see the PR apache#7535).
We should not push down non-deterministic conditions; otherwise, we need to explicitly initialize the non-deterministic expressions. This PR is to simply block it.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes apache#17585 from gatorsmile/joinRandCondition.
Currently nondeterministic expression is broken without a explicit initialization phase.
Let me take
MonotonicallyIncreasingIDas an example. This expression need a mutable state to remember how many times it has been evaluated, so we use@transient var count: Longthere. By being transient, thecountwill be reset to 0 and only to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is no way to use another initial value forcount, until we add the explicit initialization phase.Another use case is local execution for
LocalRelation, there is no serialize and deserialize phase and thus we can't reset mutable states for it.