Skip to content

Conversation

@rishabhbhardwaj
Copy link
Contributor

What changes were proposed in this pull request?

To use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter to parallelize the operation of merging the bloom filters
(Please fill in changes proposed in this fix)

How was this patch tested?

unit tests passed
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

rishabhbhardwaj and others added 30 commits October 19, 2015 12:12
@HyukjinKwon
Copy link
Member

I think we need some figures here. Would you test before/after and share some figures?

@lovasoa
Copy link

lovasoa commented Jun 10, 2017

I ran some tests on a cluster with 5 nodes, 5 executors, and 5 threads per executor.
I created a bloomfilter with the parameters : numElements=150M, fpp=10%. This is a bloom filter of around 90MB.
I created the bloom filter on the column o_orderkey of the table ORDERS of TPC-H with a scale factor of 100. The data is stored in a parquet file on HDFS, with 48 partitions.

The cluster

image

Results

Using RDD.aggregate

stages

image

tasks

image

Notice the huge scheduler delay. The executors spend all their time sending their results back to the driver.

total time: 52 seconds

Using RDD.treeAggregate

stages

image

tasks

first stage

image

second stage

image

total time: 17 seconds

@lovasoa
Copy link

lovasoa commented Jun 11, 2017

Smaller data

You might be wondering: what about smaller data ?
Adding a stage to the computation of course adds some overhead.

I repeated the same experiment as mentioned above, but with a scale factor of 1, (and 1.5M elements in the bloom filter).

With RDD.aggregate

image

With RDD.treeAggregate

image

conclusion

There is indeed an overhead, but it is quite small. For the creation of a small bloom filter, the old method (with aggregate) takes 0.3 seconds, and the new one (with treeAggregate) 0.6 seconds.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to use treeAggregate where possible. I think the win for larger data sets is worthwhile.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too given both are semantically identical and the same reason with ^.

@SparkQA
Copy link

SparkQA commented Jun 13, 2017

Test build #3796 has finished for PR 18263 at commit 61bb509.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 13, 2017

Merged to master

@asfgit asfgit closed this in 9b2c877 Jun 13, 2017
dataknocker pushed a commit to dataknocker/spark that referenced this pull request Jun 16, 2017
…ataFrame.stat.bloomFilter

## What changes were proposed in this pull request?
To use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter to parallelize the operation of merging the bloom filters
(Please fill in changes proposed in this fix)

## How was this patch tested?
unit tests passed
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Rishabh Bhardwaj <[email protected]>
Author: Rishabh Bhardwaj <[email protected]>
Author: Rishabh Bhardwaj <[email protected]>
Author: Rishabh Bhardwaj <[email protected]>
Author: Rishabh Bhardwaj <[email protected]>

Closes apache#18263 from rishabhbhardwaj/SPARK-21039.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants