-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could sparse matrix-vector multiplication be optimized #732
Comments
Since you seem only interested in CPU performance (and not memory performance), you should benchmark without allocating 160 MByte of memory: r = copy(v)
@benchmark mul!(r,M,v) You can test yourself whether CSR is faster: calculate I looked at the generated assembler code on my system (Intel(R) Core(TM) i7-8850H). The generated code looks good; it's certainly efficient and vectorized. Since the matrix is 160 MByte large, it doesn't fit into the CPU cache and has to be read from memory. What is your memory bandwidth? Usually, CPU floating point performance is many times higher than memory bandwidth, and calculations such as these are limited by memory accesses, not CPU calculations. It might be possible to improve performance by tiling the loops to ensure that either portions of |
But seriously, when it takes ~100ns to get data from RAM and you can do SIMD operations with 256+ bytes per instruction, counting FLOPs becomes a very rough approximation. Sparse arrays in general do a lot of indirect reads which can be inefficient since it is hard to predict them and fetch the data before it is being used. JuliaLang/julia#29525 is a "trivial" speedup though for the transpose case by just multithreading it. |
It does indeed seem that M'*v is faster (on my AMD 8350):
I also tried it for n = 10^8 on a bigger and faster computer (with an Intel CPU).
I then tried
which didn't seem to give much speedup. |
That looks interesting. I couldn't tell from reading the issue why it hasn't been merged. Is there still a problem with it? |
That sounds very promising! |
I don't think there's anything specific to do here. We already have the multi-threading PR. |
Is there no value in supporting CSR? It looks like you get an easy speedup that way. |
There are old issues discussing that. But basically that is unlikely to happen here. Someone would have to do it in a package. |
Am I wrong that CSR is just Tranpose{CSC}? Can't one just specialize on that type if one wants the performance benefits of CSR? |
I may not be understanding something but if you create the matrix initially in CSC (which is currently the only Julia option) then you need to first physically transpose it and then multiply the transpose to get the speed advantage. Amongst other issues, If the matrix is very large then transposing it may use too much RAM as there is no in-place transpose of sparse matrices. Or did you mean you can avoid all that by a clever use of specialization? |
I guess what I mean is that you construct the transposed matrix as CSC and then call transpose on it to make it behave as the matrix you wanted. |
Right. That is an option but it is potentially a pain for the coder to do. |
If you could construct |
If you construct a matrix via The same issue, by the way, exists for dense matrices. Some operations are faster if they are stored transposed, others are slower. |
Just for the sake of some history - we did attempt having CSR in JuliaLang/julia#7029 |
It's not clear (to me) from reading that PR why it died. |
Maybe for the same reason we don't have |
Yeah - it wasn't clear the effort was worthwhile. Since you can always work with the transpose. |
Is that the effort of implementing, supporting and documenting CSR or an extra mental effort for the coder? |
All of that. Someone who really needs it can make a package with it. |
Take this simple code:
At this point M has the following property.:
The computation M*v should take exactly 37,999,566 multiplications and no more than that number of additions.
This seems slower than I expected given the number of multiplications and additions.
Is the problem the CSC format of the matrix? Would it be faster if M were in CSR format?
The text was updated successfully, but these errors were encountered: