[SPARK-24020][SQL] Sort-merge join inner range optimization #21109

zecevicp · 2018-04-19T21:16:19Z

JIRA description

The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND <function calculation...>

However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then will it try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an ExternalAppendOnlyUnsafeRowArray). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible.

We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join".

This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used).

The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added.

To limit the impact of this change we also propose adding a new parameter (tentatively named spark.sql.join.smj.useInnerRangeOptimization) which could be used to switch off the optimization entirely.

What changes were proposed in this pull request?

The main changes are made to these classes:

ExtractEquiJoinKeys – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc.

ExternalAppendOnlyUnsafeRowArray – changed so that it can function as a queue too. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. The class is no longer "append only", but I didn't want to change too many things.

JoinSelection – a strategy that uses ExtractEquiJoinKeys and needs to be aware of the extracted range conditions

SortMergeJoinExec – the main implementation of the optimization. Needs to support two code paths:
• when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner)
• when whole-stage code generation is turned on (methods doProduce and genScanner)
SortMergeJoinInnerRangeScanner – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off

How was this patch tested?

Unit tests (InnerRangeSuite.scala, JoinSuite.scala, ExternalAppendOnlyUnsafeRowArraySuite)

Performance tests in JoinBenchmark. The tests show 8x improvement over non-optimized code. Although, it should be noted that the results depend on the exact range conditions and the calculations performed on each matched row.
In our case, we were not able to cross-match two rather large datasets (1.2 billion rows x 800 million rows) without this optimization. With the optimization, the cross-match finishes in less than 2 minutes.

gatorsmile · 2018-04-19T22:02:35Z

ok to test

SparkQA · 2018-04-20T03:08:05Z

Test build #89592 has finished for PR 21109 at commit 4c6a726.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

zecevicp · 2018-04-20T15:39:51Z

Can you please restart the build? It took 5 hours for some reason, but the "inner range join" tests completed successfully.

tedyu · 2018-04-20T16:08:28Z

retest this please

SparkQA · 2018-04-20T20:03:34Z

Test build #89641 has finished for PR 21109 at commit 4c6a726.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-04-25T05:09:36Z

Can you put performance numbers w/ this pr? Also, you'd be better to add benchmark code in JoinBenchmark.

zecevicp · 2018-04-27T17:13:57Z

I added benchmark code in JoinBenchmark. The tests show 8x improvement over non-optimized code. Although, it should be noted that the results depend on the exact range conditions and the calculations performed on each matched row.
In our case, we were not able to cross-match two rather large datasets (1.2 billion rows x 800 million rows) without this optimization. With the optimization, the cross-match finishes in less than 2 minutes.

SparkQA · 2018-04-27T20:41:47Z

Test build #89934 has finished for PR 21109 at commit fbec452.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-04-28T03:06:48Z

Thanks for working on this.

Based on the description on JIRA, I think the main cause of the bad performance is re-calculation an expensive function on matches rows. With the added benchmark, I adjust the order of conditions so the expensive UDF is put at the end of predicate. Below is the results. The first one is original benchmark. The second is the one with UDF at the end of predicate.

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.9.87-linuxkit-aufs
Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
sort merge inner range join:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
sort merge inner range join wholestage off      6913 / 6964          0.0     1080112.4       1.0X
sort merge inner range join wholestage on      2094 / 2224          0.0      327217.4       3.3X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Linux 4.9.87-linuxkit-aufs
Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
sort merge inner range join:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
sort merge inner range join wholestage off       675 /  704          0.0      105493.9       1.0X
sort merge inner range join wholestage on       374 /  398          0.0       58359.6       1.8X

It can be easily improved because short-circuit evaluation of predicate. This can be applied to also other conditions other than just range comparison. So I'm thinking if we just need a way to give a hint to Spark to adjust the order of expression for an expensive one like UDF.

viirya · 2018-04-28T02:56:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

+     *AMD EPYC 7401 24-Core Processor
+     *sort merge join:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+     *---------------------------------------------------------------------------------------------
+     *sort merge join wholestage off            12723 / 12800          0.0     1988008.4       1.0X


Why wholestage-off case doesn't get much improvement as wholestage-on case?

zecevicp · 2018-04-28T19:36:23Z

Hey Liang-Chi, thanks for looking into this.
Yes, the problem can be circumvented by changing the join condition as you describe, but only in the benchmark case, because my "expensive function" was a bit misleading.
The problem is not in the function itself, but in the number of rows that are checked for each pair of matching equi-join keys.
I changed the benchmark test case now so to better demonstrate this. I completely removed the expensive function and I'm only doing a count on the matched rows. The results are the following.
Without the optimization:

AMD EPYC 7401 24-Core Processor
sort merge join:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------
sort merge join wholestage off            30956 / 31374          0.0       75575.5       1.0X
sort merge join wholestage on             10864 / 11043          0.0       26523.6       2.8X

With the optimization:

AMD EPYC 7401 24-Core Processor
sort merge join:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------
sort merge join wholestage off            30734 / 31135          0.0       75035.2       1.0X
sort merge join wholestage on                959 / 1040          0.4        2341.3      32.0X

This shows a 10x improvement over the non-optimized case (as I already said, this depends on the range condition, number of matched rows, the calculated function, etc.).

Regarding your second question as to why is the "wholestage off" case in the optimized version so slow, that is because the optimization is turned off when the wholestage code generation is turned off.
And that is simply because it was too hard to debug it and I figured the wholestage generation is on by default, so I'm guessing (and hoping) that it would not be too hard of a requirement to have to turn wholestage codegen on if you want to use this optimization.

mgaido91 · 2018-04-28T22:56:07Z

@zecevicp wholestage codegen now is turned on by default only if we have few columns (less than 100). This can be false in many real use-cases. Is there any specific reason why this optimization cannot be applied to the non-wholestage codegen case? If not, I think it is worth to consider also this case.

zecevicp · 2018-04-28T23:04:02Z

Hi, Gaido, thanks for the comment. As I said, it was difficult to debug it and I didn't have time. We might open a different ticket for the non-wholestage codegen case, once this is merged?

SparkQA · 2018-04-28T23:19:27Z

Test build #89961 has finished for PR 21109 at commit e6e6628.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-29T06:41:20Z

Why do you say that it is difficult to debug? What was difficult?

zecevicp · 2018-04-30T22:45:52Z

The code path with the optimization but without wholegen code generation gives wrong results. And I haven't been able to figure out where is the bug. I spent several hours at this again today. I've been using this code as an example:

spark.conf.set("spark.sql.codegen.wholeStage", value = false)
spark.conf.set("spark.sql.shuffle.partitions", "1")
val df1 = sc.parallelize(1 to 2).cartesian(sc.parallelize(1 to 10)).toDF("col1a", "col1b")
val df2 = sc.parallelize(1 to 2).cartesian(sc.parallelize(1 to 10)).toDF("col2a", "col2b")
val res = df1.join(df2, 'col1a === 'col2a and ('col1b < 'col2b + 3) and ('col1b > 'col2b - 3))
res.count

With wholeStage off this gives 65 as a result and 88 if wholeStage is on (with the optimization turned on). Without the optimization it is always 88, which is correct.
So, if someone can figure out where the bug is, great. If not, maybe it's best to just turn the optimization off for the non-wholegen case (for now).

maropu · 2018-05-02T08:57:14Z

Since I feel this is a limited case, I'm not certainly sure this optimization needs to be handled in smj. For spatial or temporal use cases, is it not enough to add dummy join keys to split tasks into pieces for workaround?

'col1a === 'col2a and 'col1dummyKey === 'col2dummyKey and ('col1b < 'col2b + 3) and ('col1b > 'col2b - 3)

Btw, can you fix this issue by more simpler code change? (I'm not sure this big change pays the performance gain...)

maropu · 2018-05-02T09:06:45Z

@viirya I think predicate reordering based on cost estimation and others is an interesting topic in optimizer (other databases, e.g., postgresql, apply the optimization). But, IMO the topic is not directly related to this pr, it'd be better to file another jira to keep the discussion (and the benchmark result you reported)? I couldn't find a jira entry for the topic.

zecevicp · 2018-05-02T16:21:19Z

@maropu Regarding the first point, whether this belongs to SMJ or not, take a look at this paper, page 11. They describe several special cases of SMJ, one of them being "epsilon-join", which is exactly what is implemented here.

Regarding adding dummy join keys (some kind of binning) to do spatial or temporal joining, that wouldn't help in our case and I believe in many other cases because you would need to bin the data by the second column in a fixed manner. And there you would have the problems of how to join data beyond the borders of those bins (which would probably require additional data duplication). These bins would probably be calculated on the fly, so additional computing is required, and most probably an additional sorting step would be needed (additional pass through the data).

Regarding making it a simpler change, although the number of changed lines is somewhat large, the change is well-confined to a specific code path and the minimum of existing code is changed. Most of the changes are new additions (new code for code generation; the InMemoryUnsafeRowQueue class, which is used only from inside of the new code; SortMergeJoinInnerRangeScanner class, which should be used when whole-stage codegen is turned off, but has a difficult-to-debug bug now and is turned off).

The only thing that could be removed currently is the SortMergeJoinInnerRangeScanner class, which is 230 lines long. It could be moved to a separate JIRA PR (SMJ inner-range optimization with whole-stage codegen turned off). I tried to find the bug there but I failed, so separating this makes sense. I'll give it a few more tries, and if I fail again, I will move this part to a different JIRA PR.

SparkQA · 2018-05-10T20:05:07Z

Test build #90475 has finished for PR 21109 at commit 535d0d6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

zecevicp · 2018-05-10T20:09:11Z

I managed to fix the code path that is executed when the wholestage codegen is turned off. Now both code paths give the same results and have the optimization implemented. I also changed the tests in the InnerJoinSuite class so that they are run with both wholestage turned on off and on (which wasn't the case so far).
I updated the benchmark results in JoinBenchmark. The results are now the following.
Without inner range optimization:

sort merge join:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------
sort merge join wholestage off            25226 / 25244          0.0       61585.9       1.0X
sort merge join wholestage on              8581 / 8983          0.0       20948.6       2.9X

With inner range optimization:

sort merge join:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------
sort merge join wholestage off              1194 / 1212          0.3        2915.2       1.0X
sort merge join wholestage on                814 /  867          0.5        1988.4       1.5X

So, there is 10x improvement for wholestage ON case and 21x improvement for wholestage OFF case.

I believe this is now ready to be merged, which would greatly help us in our projects.

SparkQA · 2018-05-11T00:03:13Z

Test build #90477 has finished for PR 21109 at commit 6dc9000.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

I just left some comments after a quick pass, but I am a bit concerned about the amount of changes we need for this patch. May you please check if you can reduce the amount of code changes?

mgaido91 · 2018-05-15T09:59:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala


  def unapply(plan: LogicalPlan): Option[ReturnType] = plan match {
    case join @ Join(left, right, joinType, condition) =>
-      logDebug(s"Considering join on: $condition")


why removing this debug?

I added it in the first place. Tried to remove unnecessary code

I don't think so, this is the diff with the master, it was added in 2014....

OK, you could be right. I'll put it back.

mgaido91 · 2018-05-15T10:35:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

        val (leftKeys, rightKeys) = joinKeys.unzip
-        logDebug(s"leftKeys:$leftKeys | rightKeys:$rightKeys")
-        Some((joinType, leftKeys, rightKeys, otherPredicates.reduceOption(And), left, right))
+


nit: unnecessary empty line

mgaido91 · 2018-05-15T10:36:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+        // and which can be used for secondary sort optimizations.
+        val rangePreds: mutable.Set[Expression] = mutable.Set.empty
+        var rangeConditions: Seq[BinaryComparison] =
+          if (SQLConf.get.useSmjInnerRangeOptimization) { // && SQLConf.get.wholeStageEnabled) {


remove the comment please

mgaido91 · 2018-05-15T10:41:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+        // of the two tables being joined,
+        // which are not used in the equijoin expressions,
+        // and which can be used for secondary sort optimizations.
+        val rangePreds: mutable.Set[Expression] = mutable.Set.empty


why do we need this? Can't we just use rangeConditions?

We need to separate range conditions that are relevant for the optimizations from other join conditions. rangePreds is used later to remove range predicates from "other predicates" (if any).

I see that you are using it later but can't you use rangeConditions instead? they seem duplicates...they contain the same data IIUC

rangeConditions contain "vice-versa" conditions in case left and right plans need to be switched.

I see... maybe we can make this more clear in the comments then, what do you think?

Yes, will do

mgaido91 · 2018-05-15T10:41:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

  }
+
+  private def isValidRangeCondition(l : Expression, r : Expression,
+                                    left : LogicalPlan, right : LogicalPlan,


nit: bad indent

mgaido91 · 2018-05-15T10:45:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+      val equiset = joinKeys.filter{ case (ljk : Expression, rjk : Expression) =>
+        ljk.references.toSeq.contains(lattrs(0)) && rjk.references.toSeq.contains(rattrs(0)) }
+      if (equiset.isEmpty) {
+        "asis"


can we return something more meaningful than this string? Maybe a Option(Bool) in enough which tells whether to reverse or not.

OK, makes sense.

mgaido91 · 2018-05-15T10:48:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+      "none"
+    }
+    else if (canEvaluate(l, left) && canEvaluate(r, right)) {
+      val equiset = joinKeys.filter{ case (ljk : Expression, rjk : Expression) =>


what about using exists?

mgaido91 · 2018-05-15T10:59:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

+      val rightRngTmpKeyVars = createJoinKey(ctx, rightTmpRow,
+        rightUpperKeys.slice(0, 1), right.output)
+      val rightRngTmpKeyVarsDecl = rightRngTmpKeyVars.map(_.code).mkString("\n")
+      rightRngTmpKeyVars.foreach(_.code = "")


why are we doing this?

If you mean why are we clearing the code variable, I found the same thing in WholestageCodegenExec:263 where it's claimed that that prevents the variable to be evaluated twice. The code works, so I didn't investigate further.

mgaido91 · 2018-05-15T11:01:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

+        rightLowerKeys.size == 0 || rightUpperKeys.size == 0) {
+      ctx.addNewFunction("dequeueUntilUpperConditionHolds",
+        "private void dequeueUntilUpperConditionHolds() { }",
+        inlineToOuterClass = true)


do we really need this? can't we just use the returned name?

Can you elaborate please? I'm not sure what you mean.

can we avoid to use inlineToOuterClass = true? I think we can avoid doing that.

Ah, OK. I'll investigate.

mgaido91 · 2018-05-15T11:01:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

+ * @param spillThreshold Threshold for number of rows to be spilled by internal buffer
+ */
+private[joins] class SortMergeJoinInnerRangeScanner(
+                                           streamedKeyGenerator: Projection,


nit: indent

kiszk · 2018-05-15T12:18:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+                }
+              case _ => None
+            }
+          }


nit: } else {?

zecevicp · 2018-05-15T13:04:27Z

@mgaido91 Regarding the amount of code, maybe you can suggest how to reduce it? Because I don't see a way...
I think the code is well contained (mostly in separate new classes) and is not contaminating the existing codebase. There is a simple switch (parameter) that turns off the whole thing.
Some code also went into the use cases which contribute to the amount of changes.
I believe that the potentially significant speedup justifies the line count.

mgaido91 · 2018-05-15T13:28:53Z

@zecevicp for instance do we really need InMemoryUnsafeRowQueue? why ExternalAppendOnlyUnsafeRowArray is not ok?

zecevicp · 2018-05-15T13:54:17Z

Well, that is the essence of the contribution: to have a moving window over the data, instead of a fixed block (per equi-join match). To implement a moving window you need something like a queue.

zecevicp · 2018-05-15T14:20:10Z

Btw, thank you @mgaido91 and @kiszk for the comments.

SparkQA · 2018-05-15T16:25:57Z

Test build #90642 has finished for PR 21109 at commit 48b1c0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Just some tiny code comments. I don't feel qualified to merge this, but would instead get a nod from at least one of people like @marmbrus @cloud-fan @gatorsmile or others who know the SQL planning code.

srowen · 2018-06-07T12:48:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+
+        // Only using secondary join optimization when both lower and upper conditions
+        // are specified (e.g. t1.a < t2.b + x and t1.a > t2.b - x)
+        if(rangeConditions.size != 2 ||


Nit: space after "if" here and elsewhere

srowen · 2018-06-07T12:53:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+        // are specified (e.g. t1.a < t2.b + x and t1.a > t2.b - x)
+        if(rangeConditions.size != 2 ||
+            // Looking for one < and one > comparison:
+            rangeConditions.filter(x => x.isInstanceOf[LessThan] ||


Instead of checking .size == 0, something like rangeConditions.forall(... not instance of either ...)?

srowen · 2018-06-07T12:54:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+        // which are not used in the equijoin expressions,
+        // and which can be used for secondary sort optimizations.
+        // rangePreds will contain the original expressions to be filtered out later.
+        val rangePreds: mutable.Set[Expression] = mutable.Set.empty


I tend to prefer val rangePreds = mutable.Set.empty[Expression] as it's shorter, but that's just taste

I think I'll adopt your version

srowen · 2018-06-07T12:56:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+    if(lattrs.size != 1 || rattrs.size != 1) {
+      None
+    }
+    else if (canEvaluate(l, left) && canEvaluate(r, right)) {


Nit: pull else onto previous line

srowen · 2018-06-07T12:57:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+   */
+  private def checkRangeConditions(l : Expression, r : Expression,
+      left : LogicalPlan, right : LogicalPlan,
+      joinKeys : Seq[(Expression, Expression)]) = {


For clarity add a return type to this method. Option[Boolean]?

srowen · 2018-06-07T12:59:15Z

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala

-        leftPlan: SparkPlan,
-        rightPlan: SparkPlan,
-        side: BuildSide) = {
+                               leftKeys: Seq[Expression],


(Undo this whitespace change and the next one)

srowen · 2018-06-07T13:00:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala

+
+    // Disabling these because the code would never follow this path in case of a inner range join
+    if (!expectRangeJoin) {
+      var counter = 1


If you want to avoid a var, just configOptions.zipWithIndex.foreach { case ((config, confValue), counter) =>. Just a tiny bit more idiomatic.

OK, will do that.

srowen · 2018-06-07T13:01:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/InMemoryUnsafeRowQueue.scala

+ *   excessive disk writes. This may lead to a performance regression compared to the normal case
+ *   of using an [[ArrayBuffer]] or [[Array]].
+ */
+private[sql] class InMemoryUnsafeRowQueue(


No way to avoid making a custom queue implementation here? is it messier without such a structure?

A queue is needed here because it's a moving window instead of a fixed block of rows. Maybe I missed an existing class that could do this easily so I'll take another look. But, I believe any alternative would indeed be messier.

srowen · 2018-06-07T13:21:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .booleanConf
      .createWithDefault(true)

+  val USE_SMJ_INNER_RANGE_OPTIMIZATION =


Yes, at best make this internal. Are there conditions where you would not want to apply this? is it just a safety valve?

It's just a safety valve. In case there are some queries that I don't foresee now where this could get in the way.

…enchmark.

…e + tests

…InMemoryUnsafeRowQueue class.

SparkQA · 2018-09-07T22:31:34Z

Test build #95807 has finished for PR 21109 at commit 0a5c8de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-10T09:39:00Z

Test build #99907 has finished for PR 21109 at commit 07ff4d3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2019-09-19T15:20:35Z

Can one of the admins verify this patch?

github-actions · 2020-01-12T00:07:37Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

viirya reviewed Apr 28, 2018

View reviewed changes

mgaido91 reviewed May 15, 2018

View reviewed changes

kiszk reviewed May 15, 2018

View reviewed changes

srowen reviewed Jun 7, 2018

View reviewed changes

Petar Zecevic and others added 21 commits September 7, 2018 11:28

SMJ inner range optimization benchmarks

f5b9ca8

Removing "expensive function" from the SMJ inner range optimization b…

7457ab3

…enchmark.

SMJ inner range optimization with wholeStage codegen turned off - cod…

c47c8cd

…e + tests

Unit test fix. Benchmark results update.

3fbedfc

Scalastyle for comments

b8e1ee4

Code changes based on review comments.

2710957

Code review changes

52f2b70

Removing exception when numRowsInMemoryBufferThreshold is reached in …

169bd70

…InMemoryUnsafeRowQueue class.

Scala style

89169de

Unneeded import

eeaf048

A dot

75ce55d

A dot

77dd2a8

A dot

eac81b4

A dot

1abde55

A dot

dfb4c0f

Fixes for some rebase issues.

6d4c031

Merge with upstream

6d9cd12

Merge with upstream

7742c10

SMJ inner range spill over implementation and tests

3a717ee

External unsafe row dequeue test extension

64437e5

A dot

0a5c8de

zecevicp force-pushed the branch-pz-smj branch from af59b8a to 0a5c8de Compare September 7, 2018 18:30

Merge branch 'master' into branch-pz-smj

07ff4d3

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 12, 2020

github-actions bot closed this Jan 13, 2020

[SPARK-24020][SQL] Sort-merge join inner range optimization #21109

[SPARK-24020][SQL] Sort-merge join inner range optimization #21109

Uh oh!

Conversation

zecevicp commented Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA description

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Apr 19, 2018

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

zecevicp commented Apr 20, 2018

Uh oh!

tedyu commented Apr 20, 2018

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

maropu commented Apr 25, 2018

Uh oh!

zecevicp commented Apr 27, 2018

Uh oh!

SparkQA commented Apr 27, 2018

Uh oh!

viirya commented Apr 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zecevicp commented Apr 28, 2018

Uh oh!

mgaido91 commented Apr 28, 2018

Uh oh!

zecevicp commented Apr 28, 2018

Uh oh!

SparkQA commented Apr 28, 2018

Uh oh!

mgaido91 commented Apr 29, 2018

Uh oh!

zecevicp commented Apr 30, 2018

Uh oh!

maropu commented May 2, 2018

Uh oh!

maropu commented May 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zecevicp commented May 2, 2018

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

zecevicp commented May 10, 2018

Uh oh!

SparkQA commented May 11, 2018

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zecevicp commented Apr 19, 2018 •

edited

Loading

viirya commented Apr 28, 2018 •

edited

Loading

maropu commented May 2, 2018 •

edited

Loading