[WIP][SPARK-1485][MLLIB] Implement Butterfly AllReduce #506

mengxr · 2014-04-23T10:02:13Z

The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This may create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce can help free up the driver. This PR contains a simple butterfly AllReduce implementation. Compared it with reduce + broadcast (http) on a 16-node EC2 cluster (with slow connection), and saw 2x speed-up on vectors of size 1k to 10m.

Possible improvements:

Each executor only needs one copy.
Better handling when the number of partitions is not a power of two?

AmplabJenkins · 2014-04-23T10:02:55Z

Merged build triggered.

mridulm · 2014-04-23T10:06:02Z

Might be a good idea to move this out of mllib and push this into core itself.
The utility of this PR seems more fundamental than just for ML (assuming it does something analogous to all reduce in mpi - note, I am yet to go through this in detail :-) ).

AmplabJenkins · 2014-04-23T10:08:38Z

Merged build started.

mengxr · 2014-04-23T10:10:14Z

I'm a little worried about misuse. Calling AllReduce with many small partitions can be very slow, at least with this implementation. Even in MLlib, this is package private.

AmplabJenkins · 2014-04-23T10:45:48Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-23T10:45:49Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14371/

rxin · 2014-04-23T18:14:27Z

@mridulm this is not a very general solution yet, and can have bad consequences (e.g. when data are not cached in memory). If we want a more reliable allReduce, we should probably look into some sort of shuffle dependency that is not all to all (the main problem modeling this using shuffle I see is having to send a bunch of 0s back to the driver for shuffle block size estimation; we might be able to just use run-length-encoding to make that transmission cheap).

mridulm · 2014-04-23T19:55:22Z

Agree, I was not suggesting that this specific change per-se makes it into core.
Just that there are a lot of applications for all-reduce support in spark : and if it were available out of the box, it will make porting a lot of algos quite trivial !

And in that context, all reduce support should go into core.
If we feel that this specific PR is not what we want due to limitations/design choices, that is fine by me.

mengxr · 2014-04-23T20:13:48Z

@mridulm This implementation is experimental, and I'm looking for comments and suggestions to make it better. @etrain @shivaram ?

Since it already outperforms reduce + broadcast, it is interesting to see how far we can go.

shivaram · 2014-04-23T20:16:53Z

mllib/src/main/scala/org/apache/spark/mllib/rdd/PartitionSlicingRDD.scala

Isn't this the same as PartitionPruningRDD ?

Yes, I didn't know there is one.

shivaram · 2014-04-23T20:22:34Z

@mengxr This is really cool and the performance wins look awesome. Apart from the inline comments, I just one more idea: Instead of using cache + rdd re-partitioning in each step, how expensive is it to do a reduceByKey at each iteration and adjust the keys appropriately ? I think some serialization + de-serialization overheads might add up, but it'll simplify the clean up / caching etc.

…#506. SPARK-1062 Add rdd.intersection(otherRdd) method Author: Andrew Ash <[email protected]> == Merge branch commits == commit 5d9982b171b9572649e9828f37ef0b43f0242912 Author: Andrew Ash <[email protected]> Date: Thu Feb 6 18:11:45 2014 -0800 Minor fixes - style: (v,null) => (v, null) - mention the shuffle in Javadoc commit b86d02f14e810902719cef893cf6bfa18ff9acb0 Author: Andrew Ash <[email protected]> Date: Sun Feb 2 13:17:40 2014 -0800 Overload .intersection() for numPartitions and custom Partitioner commit bcaa34911fcc6bb5bc5e4f9fe46d1df73cb71c09 Author: Andrew Ash <[email protected]> Date: Sun Feb 2 13:05:40 2014 -0800 Better naming of parameters in intersection's filter commit b10a6af2d793ec6e9a06c798007fac3f6b860d89 Author: Andrew Ash <[email protected]> Date: Sat Jan 25 23:06:26 2014 -0800 Follow spark code format conventions of tab => 2 spaces commit 965256e4304cca514bb36a1a36087711dec535ec Author: Andrew Ash <[email protected]> Date: Fri Jan 24 00:28:01 2014 -0800 Add rdd.intersection(otherRdd) method

AmplabJenkins · 2014-05-13T18:02:58Z

Merged build triggered.

AmplabJenkins · 2014-05-13T18:03:07Z

Merged build started.

AmplabJenkins · 2014-05-13T18:04:02Z

Merged build finished.

AmplabJenkins · 2014-05-13T18:04:02Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14943/

mengxr · 2014-06-17T23:55:14Z

Thanks all for reviewing this PR! I found the butterfly pattern introduces complex dependency that slows down the computation. In my tests, a good approach for Spark is tree reduce + bt broadcast. So I'm closing this one now in favor of #1110 .

…#506. SPARK-1062 Add rdd.intersection(otherRdd) method Author: Andrew Ash <[email protected]> == Merge branch commits == commit 5d9982b171b9572649e9828f37ef0b43f0242912 Author: Andrew Ash <[email protected]> Date: Thu Feb 6 18:11:45 2014 -0800 Minor fixes - style: (v,null) => (v, null) - mention the shuffle in Javadoc commit b86d02f14e810902719cef893cf6bfa18ff9acb0 Author: Andrew Ash <[email protected]> Date: Sun Feb 2 13:17:40 2014 -0800 Overload .intersection() for numPartitions and custom Partitioner commit bcaa34911fcc6bb5bc5e4f9fe46d1df73cb71c09 Author: Andrew Ash <[email protected]> Date: Sun Feb 2 13:05:40 2014 -0800 Better naming of parameters in intersection's filter commit b10a6af2d793ec6e9a06c798007fac3f6b860d89 Author: Andrew Ash <[email protected]> Date: Sat Jan 25 23:06:26 2014 -0800 Follow spark code format conventions of tab => 2 spaces commit 965256e4304cca514bb36a1a36087711dec535ec Author: Andrew Ash <[email protected]> Date: Fri Jan 24 00:28:01 2014 -0800 Add rdd.intersection(otherRdd) method

@pwendell

In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big. SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing. I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster. code: ~~~ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) { val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n)) var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9) start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9) } ~~~ out: | n | treeReduce(,2) | reduce | |---|---------------------|-----------| | 10 | 0.215538731 | 0.204206899 | | 100 | 0.278405907 | 0.205732582 | | 1000 | 0.208972182 | 0.214298272 | | 10000 | 0.194792071 | 0.349353687 | | 100000 | 0.347683285 | 6.086671892 | | 1000000 | 2.589350682 | 66.572906702 | CC: @pwendell This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate. Author: Xiangrui Meng <[email protected]> Closes #1110 from mengxr/tree and squashes the following commits: c6cd267 [Xiangrui Meng] make depth default to 2 b04b96a [Xiangrui Meng] address comments 9bcc5d3 [Xiangrui Meng] add depth for readability 7495681 [Xiangrui Meng] fix compile error 142a857 [Xiangrui Meng] merge master d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib 8a2a59c [Xiangrui Meng] Merge branch 'master' into tree be6a88a [Xiangrui Meng] use treeAggregate in mllib 0f94490 [Xiangrui Meng] add docs eb71c33 [Xiangrui Meng] add treeReduce fe42a5e [Xiangrui Meng] add treeAggregate

@pwendell

In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big. SPARK-1485 (apache#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing. I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster. code: ~~~ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) { val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n)) var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9) start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9) } ~~~ out: | n | treeReduce(,2) | reduce | |---|---------------------|-----------| | 10 | 0.215538731 | 0.204206899 | | 100 | 0.278405907 | 0.205732582 | | 1000 | 0.208972182 | 0.214298272 | | 10000 | 0.194792071 | 0.349353687 | | 100000 | 0.347683285 | 6.086671892 | | 1000000 | 2.589350682 | 66.572906702 | CC: @pwendell This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate. Author: Xiangrui Meng <[email protected]> Closes apache#1110 from mengxr/tree and squashes the following commits: c6cd267 [Xiangrui Meng] make depth default to 2 b04b96a [Xiangrui Meng] address comments 9bcc5d3 [Xiangrui Meng] add depth for readability 7495681 [Xiangrui Meng] fix compile error 142a857 [Xiangrui Meng] merge master d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib 8a2a59c [Xiangrui Meng] Merge branch 'master' into tree be6a88a [Xiangrui Meng] use treeAggregate in mllib 0f94490 [Xiangrui Meng] add docs eb71c33 [Xiangrui Meng] add treeReduce fe42a5e [Xiangrui Meng] add treeAggregate

sidps · 2018-03-17T14:02:03Z

I've been curious about underlying implementations of such operations, has the ring all-reduce technique been considered? http://research.baidu.com/bringing-hpc-techniques-deep-learning/

This reverts commit 9d46fae.

…ache#498) * add initial bypass merge sort shuffle writer benchmarks * dd unsafe shuffle writer benchmarks * changes in bypassmergesort benchmarks * cleanup * add circle script * add this branch for testing * fix circle attempt 1 * checkout code * add some caches? * why is it not pull caches... * save as artifact instead of publishing * mkdir * typo * try uploading artifacts again * try print per iteration to avoid circle erroring out on idle * blah (apache#495) * make a PR comment * actually delete files * run benchmarks on test build branch * oops forgot to enable upload * add sort shuffle writer benchmarks * add stdev * cleanup sort a bit * fix stdev text * fix sort shuffle * initial code for read side * format * use times and sample stdev * add assert for at least one iteration * cleanup shuffle write to use fewer mocks and single base interface * shuffle read works with transport client... needs lots of cleaning * test running in cicle * scalastyle * dont publish results yet * cleanup writer code * get only git message * fix command to get PR number * add SortshuffleWriterBenchmark * writer code * cleanup * fix benchmark script * use ArgumentMatchers * also in shufflewriterbenchmarkbase * scalastyle * add apache license * fix some scale stuff * fix up tests * only copy benchmarks we care about * increase size for reader again * delete two writers and reader for PR * SPARK-25299: Add shuffle reader benchmarks (apache#506) * Revert "SPARK-25299: Add shuffle reader benchmarks (apache#506)" This reverts commit 9d46fae. * add -e to bash script * blah * enable upload as a PR comment and prevent running benchmarks on this branch * Revert "enable upload as a PR comment and prevent running benchmarks on this branch" This reverts commit 13703fa. * try machine execution * try uploading benchmarks (apache#498) * only upload results when merging into the feature branch * lock down machine image * don't write input data to disk * run benchmark test * stop creating file cleanup threads for every block manager * use alphanumeric again * use a new random everytime * close the writers -__________- * delete branch and publish results as comment * close in finally

Manageiq jobs are failed due to can't match the libssl packages when wget for ubuntufailing bundler fixing. This change uses wildcard to match the packages. Closes: theopenlab/openlab#242

mengxr added 3 commits April 17, 2014 21:43

init impl of allReduce

76f4bb7

move allReduce to mllib

d143005

allow arbitrary number of partitions

98c329d

mengxr closed this Apr 23, 2014

mengxr reopened this Apr 23, 2014

shivaram reviewed Apr 23, 2014
View reviewed changes

mengxr added 2 commits April 25, 2014 01:07

use PartitionPruningRDD

49b42cb

add binaryTreeReduce

97b5588

mengxr changed the title ~~[SPARK-1485][MLLIB] Implement Butterfly AllReduce~~ [WIP][SPARK-1485][MLLIB] Implement Butterfly AllReduce May 13, 2014

mengxr mentioned this pull request Jun 17, 2014

[SPARK-2174][MLLIB] treeReduce and treeAggregate #1110

Closed

mengxr closed this Jun 17, 2014

mengxr mentioned this pull request Aug 1, 2014

Add normalizeByCol method to mllib.util.MLUtils. #1698

Closed

yifeih added a commit to yifeih/spark that referenced this pull request Mar 5, 2019

SPARK-25299: Add shuffle reader benchmarks (apache#506)

9d46fae

yifeih added a commit to yifeih/spark that referenced this pull request Mar 5, 2019

Revert "SPARK-25299: Add shuffle reader benchmarks (apache#506)"

9f51758

This reverts commit 9d46fae.

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [COLD-150][K8S] Fix metrics copy (apache#506)

96bb885

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022

KE-38100 Fix vulnerability (apache#506)

e7ff088

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022

KE-38100 Fix vulnerability (apache#506)

6390d3c

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[MINOR] Add debug info to check the task num in executor (apache#506)

e9a2f5a

[WIP][SPARK-1485][MLLIB] Implement Butterfly AllReduce #506

[WIP][SPARK-1485][MLLIB] Implement Butterfly AllReduce #506

Uh oh!

Conversation

mengxr commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

mridulm commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

mengxr commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

rxin commented Apr 23, 2014

Uh oh!

mridulm commented Apr 23, 2014

Uh oh!

mengxr commented Apr 23, 2014

Uh oh!

shivaram Apr 23, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Apr 23, 2014

Choose a reason for hiding this comment

Uh oh!

shivaram commented Apr 23, 2014

Uh oh!

AmplabJenkins commented May 13, 2014

Uh oh!

AmplabJenkins commented May 13, 2014

Uh oh!

AmplabJenkins commented May 13, 2014

Uh oh!

AmplabJenkins commented May 13, 2014

Uh oh!

mengxr commented Jun 17, 2014

Uh oh!

sidps commented Mar 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants