[SPARK-12869] Implemented an improved version of the toIndexedRowMatrix #10839

Fokko · 2016-01-19T21:55:03Z

Hi guys,

I've implemented an improved version of the toIndexedRowMatrix function on the BlockMatrix. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.

…he BlockMatrix

hvanhovell · 2016-02-14T17:55:58Z

cc @mengxr

MLnick · 2016-02-23T13:20:54Z

ok to test

MLnick · 2016-02-26T21:08:00Z

ok to test

SparkQA · 2016-02-26T21:14:30Z

Test build #52080 has finished for PR 10839 at commit 4d7c297.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-13T19:12:34Z

Test build #53036 has finished for PR 10839 at commit 67fd902.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-14T19:42:30Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

This assumes that a partition can hold an entire block row, which is not always valid. I would suggest the following:

for each block, break the matrix block into rows and then emit (rowIdx, (colStartIndex, row)). You can map the matrix block to a breeze matrix, and then call rows.

call groupByKey and then concat breeze vectors. Note that there could be missing vectors.

This could be a more scalable implementation.

Thanks @mengxr, that's a very good idea. I'll update the code and push it within 24 hours. Cheers!

Fokko · 2016-03-15T22:47:14Z

I've improved the PR based on the feedback. Beside that I've also updated the benchmark:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

If there are any questions, please let me know.

SparkQA · 2016-03-15T22:57:07Z

Test build #53239 has finished for PR 10839 at commit a9bc894.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

Fokko · 2016-03-15T23:10:50Z

ok to test

SparkQA · 2016-03-16T01:00:06Z

Test build #53238 has finished for PR 10839 at commit ba7791f.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-03-16T01:35:46Z

Test build #53240 has finished for PR 10839 at commit fe1842e.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

mengxr · 2016-03-16T07:05:35Z

@Fokko The implementation doesn't take care of sparsity yet. I created #11757 to add row/column iterators to local matrices. After that one gets merged, you can simply the implementation here.

Fokko · 2016-03-16T09:17:49Z

Nice work, as soon as the PR will be merged I will update the code accordingly.

mengxr · 2016-03-16T21:32:20Z

@Fokko That PR was merged. Could you merge the current master and update your implementation? Note that when you concat the vectors, it is useful to check the sparsity and then decide whether to create a dense vector or a sparse vector. Allocating a dense vector directly could be expensive.

SparkQA · 2016-03-17T07:48:31Z

Test build #53401 has finished for PR 10839 at commit c043e77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-17T11:09:25Z

Test build #53419 has finished for PR 10839 at commit d3c780d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Fokko · 2016-03-18T21:16:15Z

@mengxr I've updated the code according to your PR :)

mengxr · 2016-03-21T19:24:59Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

+      }
+    }.groupByKey().map { case (rowIdx, vectors) =>
+
+      val wholeVector = vectors.head match {


The output vector type should depend on the total number of active elements (or nonzeros) instead of the first one. Could you try vectors.map(_.activeSize).sum and compare it with numCols to decide which vector type to use?

mengxr · 2016-03-21T21:58:23Z

Btw, you also need to merge with the current master to resolve conflicts.

…Dense and Sparse vectors

SparkQA · 2016-03-21T23:36:18Z

Test build #53722 has finished for PR 10839 at commit 25c5f66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Fokko · 2016-03-26T13:23:10Z

@mengxr did you have a chance to look at the updated version? I also extended the test to check the conversion to dense/sparse vectors.

mengxr · 2016-04-15T00:32:32Z

Sorry for late response! This LGTM. Merged into master. Thanks!

Implemented an improved version of the toIndexedRowMatrix method of t…

4d7c297

…he BlockMatrix

mengxr reviewed Mar 14, 2016
View reviewed changes

Fokko force-pushed the master branch 2 times, most recently from ba7791f to a9bc894 Compare March 15, 2016 22:45

Updated the method based on the suggestion of @mengxr

fe1842e

Fokko force-pushed the master branch from a9bc894 to fe1842e Compare March 15, 2016 23:10

Resolved conflicts

c043e77

Updated the code based to make use of the Matrix iterator

d3c780d

mengxr reviewed Mar 21, 2016
View reviewed changes

Fokko added 2 commits March 21, 2016 23:34

Processed the feedback of @mengxr and added tests for the mapping of …

6c0b58d

…Dense and Sparse vectors

Merged master

25c5f66

asfgit closed this in c80586d Apr 15, 2016

[SPARK-12869] Implemented an improved version of the toIndexedRowMatrix #10839

[SPARK-12869] Implemented an improved version of the toIndexedRowMatrix #10839

Uh oh!

Conversation

Fokko commented Jan 19, 2016

Uh oh!

hvanhovell commented Feb 14, 2016

Uh oh!

MLnick commented Feb 23, 2016

Uh oh!

MLnick commented Feb 26, 2016

Uh oh!

SparkQA commented Feb 26, 2016

Uh oh!

SparkQA commented Mar 13, 2016

Uh oh!

mengxr Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 15, 2016

Choose a reason for hiding this comment

Uh oh!

Fokko commented Mar 15, 2016

Uh oh!

SparkQA commented Mar 15, 2016

Uh oh!

Fokko commented Mar 15, 2016

Uh oh!

SparkQA commented Mar 16, 2016

Uh oh!

SparkQA commented Mar 16, 2016

Uh oh!

mengxr commented Mar 16, 2016

Uh oh!

Fokko commented Mar 16, 2016

Uh oh!

mengxr commented Mar 16, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

Fokko commented Mar 18, 2016

Uh oh!

mengxr Mar 21, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Mar 21, 2016

Uh oh!

SparkQA commented Mar 21, 2016

Uh oh!

Fokko commented Mar 26, 2016

Uh oh!

mengxr commented Apr 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants