[SPARK-21039] [Spark Core] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter #18263

rishabhbhardwaj · 2017-06-10T18:44:53Z

What changes were proposed in this pull request?

To use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter to parallelize the operation of merging the bloom filters
(Please fill in changes proposed in this fix)

How was this patch tested?

unit tests passed
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

HyukjinKwon · 2017-06-10T20:09:00Z

I think we need some figures here. Would you test before/after and share some figures?

lovasoa · 2017-06-10T23:31:43Z

I ran some tests on a cluster with 5 nodes, 5 executors, and 5 threads per executor.
I created a bloomfilter with the parameters : numElements=150M, fpp=10%. This is a bloom filter of around 90MB.
I created the bloom filter on the column o_orderkey of the table ORDERS of TPC-H with a scale factor of 100. The data is stored in a parquet file on HDFS, with 48 partitions.

The cluster

Results

Using `RDD.aggregate`

stages

tasks

Notice the huge scheduler delay. The executors spend all their time sending their results back to the driver.

total time: 52 seconds

Using `RDD.treeAggregate`

stages

tasks

first stage

second stage

total time: 17 seconds

lovasoa · 2017-06-11T00:09:10Z

Smaller data

You might be wondering: what about smaller data ?
Adding a stage to the computation of course adds some overhead.

I repeated the same experiment as mentioned above, but with a scale factor of 1, (and 1.5M elements in the bloom filter).

With `RDD.aggregate`

With `RDD.treeAggregate`

conclusion

There is indeed an overhead, but it is quite small. For the creation of a small bloom filter, the old method (with aggregate) takes 0.3 seconds, and the new one (with treeAggregate) 0.6 seconds.

srowen

I'm inclined to use treeAggregate where possible. I think the win for larger data sets is worthwhile.

HyukjinKwon

LGTM too given both are semantically identical and the same reason with ^.

SparkQA · 2017-06-13T14:02:25Z

Test build #3796 has finished for PR 18263 at commit 61bb509.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-06-13T14:09:20Z

Merged to master

…ataFrame.stat.bloomFilter ## What changes were proposed in this pull request? To use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter to parallelize the operation of merging the bloom filters (Please fill in changes proposed in this fix) ## How was this patch tested? unit tests passed (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Rishabh Bhardwaj <[email protected]> Author: Rishabh Bhardwaj <[email protected]> Author: Rishabh Bhardwaj <[email protected]> Author: Rishabh Bhardwaj <[email protected]> Author: Rishabh Bhardwaj <[email protected]> Closes apache#18263 from rishabhbhardwaj/SPARK-21039.

rishabhbhardwaj and others added 30 commits October 19, 2015 12:12

[ SPARK-11180 ] [ SQL ] DataFrame.na.fill does not support Boolean Type

d152cb5

Merge remote-tracking branch 'upstream/master'

a53a20d

Merge remote-tracking branch 'upstream/master'

870cbb3

Merge remote-tracking branch 'upstream/master'

a21b0ed

Merge remote-tracking branch 'upstream/master'

079b1de

Merge remote-tracking branch 'upstream/master'

adb81bb

Merge remote-tracking branch 'upstream/master'

cefd76a

Merge remote-tracking branch 'upstream/master'

965386c

Merge remote-tracking branch 'upstream/master'

1d9ad79

Merge remote-tracking branch 'upstream/master'

90b863d

Merge remote-tracking branch 'upstream/master'

60015da

Merge remote-tracking branch 'upstream/master'

3d8975b

Merge remote-tracking branch 'upstream/master'

7e2c37d

Merge remote-tracking branch 'upstream/master'

e6d4e4f

Merge remote-tracking branch 'upstream/master'

810a56b

Merge remote-tracking branch 'upstream/master'

77f6aff

Merge remote-tracking branch 'upstream/master'

a3b5184

Merge remote-tracking branch 'upstream/master'

a9cce9e

Merge remote-tracking branch 'upstream/master'

e22d872

Merge remote-tracking branch 'upstream/master'

339ca8a

Merge remote-tracking branch 'upstream/master'

61b5144

Merge remote-tracking branch 'upstream/master'

6ab1706

Merge remote-tracking branch 'upstream/master'

cc70941

Merge remote-tracking branch 'upstream/master'

3c81145

Merge remote-tracking branch 'upstream/master'

1fc3228

Merge remote-tracking branch 'upstream/master'

f1ec09f

Merge remote-tracking branch 'upstream/master'

d224e46

Merge remote-tracking branch 'upstream/master'

f7431fb

Merge remote-tracking branch 'upstream/master'

c705726

Merge remote-tracking branch 'upstream/master'

43dead3

[SPARK-21039] Using treeAggregate instead of aggregate in bloomFilter

61bb509

srowen approved these changes Jun 11, 2017

View reviewed changes

HyukjinKwon approved these changes Jun 11, 2017

View reviewed changes

asfgit closed this in 9b2c877 Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-21039] [Spark Core] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter #18263

[SPARK-21039] [Spark Core] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter #18263

Uh oh!

rishabhbhardwaj commented Jun 10, 2017

Uh oh!

HyukjinKwon commented Jun 10, 2017

Uh oh!

lovasoa commented Jun 10, 2017 •

edited

Loading

Uh oh!

lovasoa commented Jun 11, 2017

Uh oh!

srowen left a comment

Uh oh!

HyukjinKwon left a comment

Uh oh!

SparkQA commented Jun 13, 2017

Uh oh!

srowen commented Jun 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-21039] [Spark Core] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter #18263

[SPARK-21039] [Spark Core] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter #18263

Uh oh!

Conversation

rishabhbhardwaj commented Jun 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jun 10, 2017

Uh oh!

lovasoa commented Jun 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The cluster

Results

Using RDD.aggregate

stages

tasks

Using RDD.treeAggregate

stages

tasks

first stage

second stage

Uh oh!

lovasoa commented Jun 11, 2017

Smaller data

With RDD.aggregate

With RDD.treeAggregate

conclusion

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 13, 2017

Uh oh!

srowen commented Jun 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lovasoa commented Jun 10, 2017 •

edited

Loading

Using `RDD.aggregate`

Using `RDD.treeAggregate`

With `RDD.aggregate`

With `RDD.treeAggregate`