-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12869] Implemented an improved version of the toIndexedRowMatrix #10839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @mengxr |
|
ok to test |
1 similar comment
|
ok to test |
|
Test build #52080 has finished for PR 10839 at commit
|
|
Test build #53036 has finished for PR 10839 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that a partition can hold an entire block row, which is not always valid. I would suggest the following:
- for each block, break the matrix block into rows and then emit (rowIdx, (colStartIndex, row)). You can map the matrix block to a breeze matrix, and then call rows.
- call groupByKey and then concat breeze vectors. Note that there could be missing vectors.
This could be a more scalable implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mengxr, that's a very good idea. I'll update the code and push it within 24 hours. Cheers!
ba7791f to
a9bc894
Compare
|
I've improved the PR based on the feedback. Beside that I've also updated the benchmark: If there are any questions, please let me know. |
|
Test build #53239 has finished for PR 10839 at commit
|
|
ok to test |
|
Test build #53238 has finished for PR 10839 at commit
|
|
Test build #53240 has finished for PR 10839 at commit
|
|
Nice work, as soon as the PR will be merged I will update the code accordingly. |
|
@Fokko That PR was merged. Could you merge the current master and update your implementation? Note that when you concat the vectors, it is useful to check the sparsity and then decide whether to create a dense vector or a sparse vector. Allocating a dense vector directly could be expensive. |
|
Test build #53401 has finished for PR 10839 at commit
|
|
Test build #53419 has finished for PR 10839 at commit
|
|
@mengxr I've updated the code according to your PR :) |
| } | ||
| }.groupByKey().map { case (rowIdx, vectors) => | ||
|
|
||
| val wholeVector = vectors.head match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output vector type should depend on the total number of active elements (or nonzeros) instead of the first one. Could you try vectors.map(_.activeSize).sum and compare it with numCols to decide which vector type to use?
|
Btw, you also need to merge with the current master to resolve conflicts. |
…Dense and Sparse vectors
|
Test build #53722 has finished for PR 10839 at commit
|
|
@mengxr did you have a chance to look at the updated version? I also extended the test to check the conversion to dense/sparse vectors. |
|
Sorry for late response! This LGTM. Merged into master. Thanks! |
Hi guys,
I've implemented an improved version of the
toIndexedRowMatrixfunction on theBlockMatrix. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:https://github.com/Fokko/BlockMatrixToIndexedRowMatrix
If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.