Could sparse matrix-vector multiplication be optimized #732

lesshaste · 2020-05-10T21:44:20Z

Take this simple code:

using BenchmarkTools
n = 10^7
M = convert(SparseMatrixCSC{UInt32, UInt32}, round.(10*sprand( n, n,  4*1/n)))
v = ones(UInt32, n)
@benchmark M*v

At this point M has the following property.:

10000000×10000000 SparseMatrixCSC{UInt32,UInt32} with 37999566 stored entries

The computation M*v should take exactly 37,999,566 multiplications and no more than that number of additions.

BenchmarkTools.Trial: 
   memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     760.211 ms (0.00% GC)
  median time:      760.683 ms (0.00% GC)
  mean time:        762.270 ms (0.00% GC)
  maximum time:     766.944 ms (0.00% GC)
   --------------
  samples:          7
  evals/sample:     1

This seems slower than I expected given the number of multiplications and additions.

Is the problem the CSC format of the matrix? Would it be faster if M were in CSR format?

The text was updated successfully, but these errors were encountered:

eschnett · 2020-05-15T19:41:46Z

Since you seem only interested in CPU performance (and not memory performance), you should benchmark without allocating 160 MByte of memory:

r = copy(v)
@benchmark mul!(r,M,v)

You can test yourself whether CSR is faster: calculate M'*v. The operation M' does not actually transpose the matrix, it only remembers that it needs to be transposed when it is used, which effectively creates a CSR matrix.

I looked at the generated assembler code on my system (Intel(R) Core(TM) i7-8850H). The generated code looks good; it's certainly efficient and vectorized.

Since the matrix is 160 MByte large, it doesn't fit into the CPU cache and has to be read from memory. What is your memory bandwidth? Usually, CPU floating point performance is many times higher than memory bandwidth, and calculations such as these are limited by memory accesses, not CPU calculations.

It might be possible to improve performance by tiling the loops to ensure that either portions of v, M, or r can be held in the cache.

KristofferC · 2020-05-15T20:49:46Z

But seriously, when it takes ~100ns to get data from RAM and you can do SIMD operations with 256+ bytes per instruction, counting FLOPs becomes a very rough approximation.

Sparse arrays in general do a lot of indirect reads which can be inefficient since it is hard to predict them and fetch the data before it is being used.

JuliaLang/julia#29525 is a "trivial" speedup though for the transpose case by just multithreading it.

lesshaste · 2020-05-16T09:37:26Z

It does indeed seem that M'*v is faster (on my AMD 8350):

@benchmark M'*v
BenchmarkTools.Trial: 
  memory estimate:  38.15 MiB
  allocs estimate:  3
  --------------
  minimum time:     608.389 ms (0.00% GC)
  median time:      622.434 ms (0.00% GC)
  mean time:        621.909 ms (0.19% GC)
  maximum time:     648.444 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  38.15 MiB
  allocs estimate:  2
  --------------
  minimum time:     779.605 ms (0.00% GC)
  median time:      782.413 ms (0.00% GC)
  mean time:        782.924 ms (0.15% GC)
  maximum time:     790.527 ms (1.05% GC)
  --------------
  samples:          7
  evals/sample 1

I also tried it for n = 10^8 on a bigger and faster computer (with an Intel CPU).

@benchmark M*v
BenchmarkTools.Trial: 
  memory estimate:  381.47 MiB
  allocs estimate:  2
  --------------
  minimum time:     4.782 s (0.00% GC)
  median time:      4.868 s (0.00% GC)
  mean time:        4.868 s (0.00% GC)
  maximum time:     4.955 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

@benchmark M'*v
BenchmarkTools.Trial: 
  memory estimate:  381.47 MiB
  allocs estimate:  3
  --------------
  minimum time:     4.666 s (0.00% GC)
  median time:      4.688 s (0.00% GC)
  mean time:        4.688 s (0.00% GC)
  maximum time:     4.709 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

I then tried

@benchmark mul!(r,M,v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.735 s (0.00% GC)
  median time:      4.735 s (0.00% GC)
  mean time:        4.735 s (0.00% GC)
  maximum time:     4.735 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

which didn't seem to give much speedup.

lesshaste · 2020-05-16T09:38:50Z

JuliaLang/julia#29525 is a "trivial" speedup though for the transpose case by just multithreading it.

That looks interesting. I couldn't tell from reading the issue why it hasn't been merged. Is there still a problem with it?

lesshaste · 2020-05-16T09:39:10Z

It might be possible to improve performance by tiling the loops to ensure that either portions of v, M, or r can be held in the cache.

That sounds very promising!

ViralBShah · 2020-06-02T21:42:05Z

I don't think there's anything specific to do here. We already have the multi-threading PR.

lesshaste · 2020-06-03T03:00:23Z

Is there no value in supporting CSR? It looks like you get an easy speedup that way.

ViralBShah · 2020-06-03T04:06:52Z

There are old issues discussing that. But basically that is unlikely to happen here. Someone would have to do it in a package.

StefanKarpinski · 2020-06-03T14:51:20Z

Am I wrong that CSR is just Tranpose{CSC}? Can't one just specialize on that type if one wants the performance benefits of CSR?

lesshaste · 2020-06-03T14:55:07Z

I may not be understanding something but if you create the matrix initially in CSC (which is currently the only Julia option) then you need to first physically transpose it and then multiply the transpose to get the speed advantage. Amongst other issues, If the matrix is very large then transposing it may use too much RAM as there is no in-place transpose of sparse matrices.

Or did you mean you can avoid all that by a clever use of specialization?

StefanKarpinski · 2020-06-03T14:58:04Z

I guess what I mean is that you construct the transposed matrix as CSC and then call transpose on it to make it behave as the matrix you wanted.

lesshaste · 2020-06-03T14:59:37Z

Right. That is an option but it is potentially a pain for the coder to do.

StefanKarpinski · 2020-06-03T16:17:41Z

If you could construct Transpose{CSC} matrices directly, that would effectively be CSR.

eschnett · 2020-06-03T18:21:55Z

If you construct a matrix via sparse, then it suffices to exchange the I and J arguments when calling it, and transposing the result. This should fit on the same line. What's less clear is how many programmers are aware of this, or whether they understand the performance characteristics of CSC vs. CSR. It's not that CSR is faster in general – it's faster for some operations, slower for others.

The same issue, by the way, exists for dense matrices. Some operations are faster if they are stored transposed, others are slower.

ViralBShah · 2020-06-03T19:11:48Z

Just for the sake of some history - we did attempt having CSR in JuliaLang/julia#7029

lesshaste · 2020-06-03T19:25:01Z

It's not clear (to me) from reading that PR why it died.

KristofferC · 2020-06-03T19:49:55Z

Maybe for the same reason we don't have ColumnMajorMatrix and RowMajorMatrix?

ViralBShah · 2020-06-03T21:04:35Z

Yeah - it wasn't clear the effort was worthwhile. Since you can always work with the transpose.

lesshaste · 2020-06-04T08:12:03Z

Is that the effort of implementing, supporting and documenting CSR or an extra mental effort for the coder?

KristofferC · 2020-06-04T09:54:33Z

All of that. Someone who really needs it can make a package with it.

dkarrasch added the sparse Sparse arrays label May 16, 2020

ViralBShah closed this as completed Jun 2, 2020

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could sparse matrix-vector multiplication be optimized #732

Could sparse matrix-vector multiplication be optimized #732

lesshaste commented May 10, 2020 •

edited

Loading

eschnett commented May 15, 2020

KristofferC commented May 15, 2020

lesshaste commented May 16, 2020 •

edited

Loading

lesshaste commented May 16, 2020 •

edited

Loading

lesshaste commented May 16, 2020 •

edited

Loading

ViralBShah commented Jun 2, 2020

lesshaste commented Jun 3, 2020

ViralBShah commented Jun 3, 2020

StefanKarpinski commented Jun 3, 2020

lesshaste commented Jun 3, 2020 •

edited

Loading

StefanKarpinski commented Jun 3, 2020

lesshaste commented Jun 3, 2020

StefanKarpinski commented Jun 3, 2020

eschnett commented Jun 3, 2020

ViralBShah commented Jun 3, 2020

lesshaste commented Jun 3, 2020

KristofferC commented Jun 3, 2020

ViralBShah commented Jun 3, 2020 •

edited

Loading

lesshaste commented Jun 4, 2020

KristofferC commented Jun 4, 2020

Could sparse matrix-vector multiplication be optimized #732

Could sparse matrix-vector multiplication be optimized #732

Comments

lesshaste commented May 10, 2020 • edited Loading

eschnett commented May 15, 2020

KristofferC commented May 15, 2020

lesshaste commented May 16, 2020 • edited Loading

lesshaste commented May 16, 2020 • edited Loading

lesshaste commented May 16, 2020 • edited Loading

ViralBShah commented Jun 2, 2020

lesshaste commented Jun 3, 2020

ViralBShah commented Jun 3, 2020

StefanKarpinski commented Jun 3, 2020

lesshaste commented Jun 3, 2020 • edited Loading

StefanKarpinski commented Jun 3, 2020

lesshaste commented Jun 3, 2020

StefanKarpinski commented Jun 3, 2020

eschnett commented Jun 3, 2020

ViralBShah commented Jun 3, 2020

lesshaste commented Jun 3, 2020

KristofferC commented Jun 3, 2020

ViralBShah commented Jun 3, 2020 • edited Loading

lesshaste commented Jun 4, 2020

KristofferC commented Jun 4, 2020

lesshaste commented May 10, 2020 •

edited

Loading

lesshaste commented May 16, 2020 •

edited

Loading

lesshaste commented May 16, 2020 •

edited

Loading

lesshaste commented May 16, 2020 •

edited

Loading

lesshaste commented Jun 3, 2020 •

edited

Loading

ViralBShah commented Jun 3, 2020 •

edited

Loading