Skip to content

Conversation

@mateiz
Copy link
Contributor

@mateiz mateiz commented Aug 17, 2014

BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.

BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.
@SparkQA
Copy link

SparkQA commented Aug 17, 2014

QA tests have started for PR 1990 at commit f468766.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 17, 2014

QA tests have finished for PR 1990 at commit f468766.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CompressedSerializer(FramedSerializer):

@mateiz
Copy link
Contributor Author

mateiz commented Aug 17, 2014

test this please

@mateiz
Copy link
Contributor Author

mateiz commented Aug 17, 2014

Jenkins, retest this please

@marmbrus
Copy link
Contributor

All of the unit test failures look like connections refused to the thrift server (a known flakey test suite). I'm going to go ahead and merge this into master and 1.1. Thanks Matei!

asfgit pushed a commit that referenced this pull request Aug 18, 2014
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.

Author: Matei Zaharia <[email protected]>

Closes #1990 from mateiz/spark-3084 and squashes the following commits:

f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins

(cherry picked from commit 6a13dca)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 6a13dca Aug 18, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.

Author: Matei Zaharia <[email protected]>

Closes apache#1990 from mateiz/spark-3084 and squashes the following commits:

f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants