[SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins

mateiz · marmbrus · commit 55e9dd637bde · 2014-08-18T10:06:07.000-07:00
BroadcastHashJoin has a broadcastFuture variable that tries to collect the broadcasted table in a separate thread, but this doesn't help because it's a lazy val that only gets initialized when you attempt to build the RDD. Thus queries that broadcast multiple tables would collect and broadcast them sequentially. I changed this to a val to let it start collecting right when the operator is created. Author: Matei Zaharia <matei@databricks.com> Closes apache#1990 from mateiz/spark-3084 and squashes the following commits: f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins (cherry picked from commit 6a13dca) Signed-off-by: Michael Armbrust <michael@databricks.com>
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala
@@ -424,7 +424,7 @@ case class BroadcastHashJoin(
     UnspecifiedDistribution :: UnspecifiedDistribution :: Nil
 
   @transient
-  lazy val broadcastFuture = future {
+  val broadcastFuture = future {
     sparkContext.broadcast(buildPlan.executeCollect())
   }
 

Original file line number	Diff line number	Diff line change
`@@ -424,7 +424,7 @@ case class BroadcastHashJoin(`
`424`	`424`	`UnspecifiedDistribution :: UnspecifiedDistribution :: Nil`
`425`	`425`
`426`	`426`	`@transient`
`427`		`- lazy val broadcastFuture = future {`
	`427`	`+ val broadcastFuture = future {`
`428`	`428`	`sparkContext.broadcast(buildPlan.executeCollect())`
`429`	`429`	`}`
`430`	`430`