Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Jan 19, 2016

Hi guys,

I've implemented an improved version of the toIndexedRowMatrix function on the BlockMatrix. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.

@hvanhovell
Copy link
Contributor

cc @mengxr

@MLnick
Copy link
Contributor

MLnick commented Feb 23, 2016

ok to test

1 similar comment
@MLnick
Copy link
Contributor

MLnick commented Feb 26, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Feb 26, 2016

Test build #52080 has finished for PR 10839 at commit 4d7c297.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2016

Test build #53036 has finished for PR 10839 at commit 67fd902.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that a partition can hold an entire block row, which is not always valid. I would suggest the following:

  • for each block, break the matrix block into rows and then emit (rowIdx, (colStartIndex, row)). You can map the matrix block to a breeze matrix, and then call rows.
  • call groupByKey and then concat breeze vectors. Note that there could be missing vectors.

This could be a more scalable implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mengxr, that's a very good idea. I'll update the code and push it within 24 hours. Cheers!

@Fokko Fokko force-pushed the master branch 2 times, most recently from ba7791f to a9bc894 Compare March 15, 2016 22:45
@Fokko
Copy link
Contributor Author

Fokko commented Mar 15, 2016

I've improved the PR based on the feedback. Beside that I've also updated the benchmark:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

If there are any questions, please let me know.

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53239 has finished for PR 10839 at commit a9bc894.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Mar 15, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53238 has finished for PR 10839 at commit ba7791f.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53240 has finished for PR 10839 at commit fe1842e.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Mar 16, 2016

@Fokko The implementation doesn't take care of sparsity yet. I created #11757 to add row/column iterators to local matrices. After that one gets merged, you can simply the implementation here.

@Fokko
Copy link
Contributor Author

Fokko commented Mar 16, 2016

Nice work, as soon as the PR will be merged I will update the code accordingly.

@mengxr
Copy link
Contributor

mengxr commented Mar 16, 2016

@Fokko That PR was merged. Could you merge the current master and update your implementation? Note that when you concat the vectors, it is useful to check the sparsity and then decide whether to create a dense vector or a sparse vector. Allocating a dense vector directly could be expensive.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53401 has finished for PR 10839 at commit c043e77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53419 has finished for PR 10839 at commit d3c780d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Mar 18, 2016

@mengxr I've updated the code according to your PR :)

}
}.groupByKey().map { case (rowIdx, vectors) =>

val wholeVector = vectors.head match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output vector type should depend on the total number of active elements (or nonzeros) instead of the first one. Could you try vectors.map(_.activeSize).sum and compare it with numCols to decide which vector type to use?

@mengxr
Copy link
Contributor

mengxr commented Mar 21, 2016

Btw, you also need to merge with the current master to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53722 has finished for PR 10839 at commit 25c5f66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Mar 26, 2016

@mengxr did you have a chance to look at the updated version? I also extended the test to check the conversion to dense/sparse vectors.

@mengxr
Copy link
Contributor

mengxr commented Apr 15, 2016

Sorry for late response! This LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in c80586d Apr 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants