Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could sparse matrix-vector multiplication be optimized #732

Closed
lesshaste opened this issue May 10, 2020 · 20 comments
Closed

Could sparse matrix-vector multiplication be optimized #732

lesshaste opened this issue May 10, 2020 · 20 comments
Labels
sparse Sparse arrays

Comments

@lesshaste
Copy link

lesshaste commented May 10, 2020

Take this simple code:

using BenchmarkTools
n = 10^7
M = convert(SparseMatrixCSC{UInt32, UInt32}, round.(10*sprand( n, n,  4*1/n)))
v = ones(UInt32, n)
@benchmark M*v

At this point M has the following property.:

10000000×10000000 SparseMatrixCSC{UInt32,UInt32} with 37999566 stored entries

The computation M*v should take exactly 37,999,566 multiplications and no more than that number of additions.

BenchmarkTools.Trial: 
   memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     760.211 ms (0.00% GC)
  median time:      760.683 ms (0.00% GC)
  mean time:        762.270 ms (0.00% GC)
  maximum time:     766.944 ms (0.00% GC)
   --------------
  samples:          7
  evals/sample:     1

This seems slower than I expected given the number of multiplications and additions.

Is the problem the CSC format of the matrix? Would it be faster if M were in CSR format?

@eschnett
Copy link
Contributor

Since you seem only interested in CPU performance (and not memory performance), you should benchmark without allocating 160 MByte of memory:

r = copy(v)
@benchmark mul!(r,M,v)

You can test yourself whether CSR is faster: calculate M'*v. The operation M' does not actually transpose the matrix, it only remembers that it needs to be transposed when it is used, which effectively creates a CSR matrix.

I looked at the generated assembler code on my system (Intel(R) Core(TM) i7-8850H). The generated code looks good; it's certainly efficient and vectorized.

Since the matrix is 160 MByte large, it doesn't fit into the CPU cache and has to be read from memory. What is your memory bandwidth? Usually, CPU floating point performance is many times higher than memory bandwidth, and calculations such as these are limited by memory accesses, not CPU calculations.

It might be possible to improve performance by tiling the loops to ensure that either portions of v, M, or r can be held in the cache.

@KristofferC
Copy link
Member

image

But seriously, when it takes ~100ns to get data from RAM and you can do SIMD operations with 256+ bytes per instruction, counting FLOPs becomes a very rough approximation.

Sparse arrays in general do a lot of indirect reads which can be inefficient since it is hard to predict them and fetch the data before it is being used.

JuliaLang/julia#29525 is a "trivial" speedup though for the transpose case by just multithreading it.

@lesshaste
Copy link
Author

lesshaste commented May 16, 2020

It does indeed seem that M'*v is faster (on my AMD 8350):

@benchmark M'*v
BenchmarkTools.Trial: 
  memory estimate:  38.15 MiB
  allocs estimate:  3
  --------------
  minimum time:     608.389 ms (0.00% GC)
  median time:      622.434 ms (0.00% GC)
  mean time:        621.909 ms (0.19% GC)
  maximum time:     648.444 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     779.605 ms (0.00% GC)
  median time:      782.413 ms (0.00% GC)
  mean time:        782.924 ms (0.15% GC)
  maximum time:     790.527 ms (1.05% GC)
  --------------
  samples:          7
  evals/sample 1

I also tried it for n = 10^8 on a bigger and faster computer (with an Intel CPU).

@benchmark M*v
BenchmarkTools.Trial: 
  memory estimate:  381.47 MiB
  allocs estimate:  2
  --------------
  minimum time:     4.782 s (0.00% GC)
  median time:      4.868 s (0.00% GC)
  mean time:        4.868 s (0.00% GC)
  maximum time:     4.955 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

@benchmark M'*v
BenchmarkTools.Trial: 
  memory estimate:  381.47 MiB
  allocs estimate:  3
  --------------
  minimum time:     4.666 s (0.00% GC)
  median time:      4.688 s (0.00% GC)
  mean time:        4.688 s (0.00% GC)
  maximum time:     4.709 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

I then tried

@benchmark mul!(r,M,v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.735 s (0.00% GC)
  median time:      4.735 s (0.00% GC)
  mean time:        4.735 s (0.00% GC)
  maximum time:     4.735 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

which didn't seem to give much speedup.

@lesshaste
Copy link
Author

lesshaste commented May 16, 2020

JuliaLang/julia#29525 is a "trivial" speedup though for the transpose case by just multithreading it.

That looks interesting. I couldn't tell from reading the issue why it hasn't been merged. Is there still a problem with it?

@lesshaste
Copy link
Author

lesshaste commented May 16, 2020

It might be possible to improve performance by tiling the loops to ensure that either portions of v, M, or r can be held in the cache.

That sounds very promising!

@dkarrasch dkarrasch added the sparse Sparse arrays label May 16, 2020
@ViralBShah
Copy link
Member

I don't think there's anything specific to do here. We already have the multi-threading PR.

@lesshaste
Copy link
Author

Is there no value in supporting CSR? It looks like you get an easy speedup that way.

@ViralBShah
Copy link
Member

There are old issues discussing that. But basically that is unlikely to happen here. Someone would have to do it in a package.

@StefanKarpinski
Copy link
Member

Am I wrong that CSR is just Tranpose{CSC}? Can't one just specialize on that type if one wants the performance benefits of CSR?

@lesshaste
Copy link
Author

lesshaste commented Jun 3, 2020

I may not be understanding something but if you create the matrix initially in CSC (which is currently the only Julia option) then you need to first physically transpose it and then multiply the transpose to get the speed advantage. Amongst other issues, If the matrix is very large then transposing it may use too much RAM as there is no in-place transpose of sparse matrices.

Or did you mean you can avoid all that by a clever use of specialization?

@StefanKarpinski
Copy link
Member

I guess what I mean is that you construct the transposed matrix as CSC and then call transpose on it to make it behave as the matrix you wanted.

@lesshaste
Copy link
Author

Right. That is an option but it is potentially a pain for the coder to do.

@StefanKarpinski
Copy link
Member

If you could construct Transpose{CSC} matrices directly, that would effectively be CSR.

@eschnett
Copy link
Contributor

eschnett commented Jun 3, 2020

If you construct a matrix via sparse, then it suffices to exchange the I and J arguments when calling it, and transposing the result. This should fit on the same line. What's less clear is how many programmers are aware of this, or whether they understand the performance characteristics of CSC vs. CSR. It's not that CSR is faster in general – it's faster for some operations, slower for others.

The same issue, by the way, exists for dense matrices. Some operations are faster if they are stored transposed, others are slower.

@ViralBShah
Copy link
Member

Just for the sake of some history - we did attempt having CSR in JuliaLang/julia#7029

@lesshaste
Copy link
Author

It's not clear (to me) from reading that PR why it died.

@KristofferC
Copy link
Member

Maybe for the same reason we don't have ColumnMajorMatrix and RowMajorMatrix?

@ViralBShah
Copy link
Member

ViralBShah commented Jun 3, 2020

Yeah - it wasn't clear the effort was worthwhile. Since you can always work with the transpose.

@lesshaste
Copy link
Author

Is that the effort of implementing, supporting and documenting CSR or an extra mental effort for the coder?

@KristofferC
Copy link
Member

All of that. Someone who really needs it can make a package with it.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sparse Sparse arrays
Projects
None yet
Development

No branches or pull requests

6 participants