faster sprand, new randsubseq (fixes #6714 and #6708) #6726

stevengj · 2014-05-03T02:35:20Z

This patch implements a much faster (at least 5× on my laptop, but on an older machine I tried it was only 2×) sprand (fixing #6708), by filling in the matrix one (nonempty) column at a time. It also changes the meaning of the specified density to something that is (I think) easier to explain: the probability of any element of the matrix being nonzero is (independently) given by density.

A subroutine used by both this sprand and the previous sprand is a function to compute a random subsequence of a given AbstractArray (e.g. 1:n). Since this is generally useful functionality, it seemed sensible to export it as functions randsubseq(A, p) and randsubseq!(S, A, p). What these functions do is to create a random (ordered) subsequence of A by selecting each element of A with independent probability p. (This is different from the problem of finding a random subset of a fixed size.)

randsubseq is implemented using a new algorithm different from the previous one (which had the same mean (modulo the bug in #6714) but a more complicated, and I think less useful, distribution). The new algorithm is somewhat faster, requires no temporary arrays, has Θ(p * length(A)) complexity, and avoids the statistical bug of #6714; it is inspired by an algorithm by Vitter for a related problem.

(As discussed in #6714, it might be nice to also have a randsubseq(S, A) that finds a random subsequence of fixed length == length(S). The algorithm in StatsBase for this problem is currently quite suboptimal according to my benchmarks as well as theoretically. However, that may go into StatsBase rather than Base and in any case should be a separate PR.)

cc: @timholy, @StefanKarpinski, @lindahua

stevengj · 2014-05-03T03:11:00Z

Whoops, the statistics of sprand still aren't quite right for small matrices: once I've selected a column to fill, I need to ensure that the rows array is not empty. This could be accomplished by while isempty(rows); randsubseq!(rows, 1:m, density); end, but that is quite inefficient for low density. I'm pretty sure I can do better with a simple fix, however, just by computing the expected last nonzero index of a nonempty column.... will revise shortly.

Okay, should be fixed now:

julia> ns = 10 .^ [2:7]; er = [mean([nfilled(sprand(10,10,0.1)) for i = 1:n]) for n in ns] .- 10
6-element Array{Any,1}:
 -0.23    
 -0.141   
  0.0075  
  0.00411 
  0.003047
 -0.001179

stevengj · 2014-05-03T03:59:25Z

cc: @alanedelman, our resident random-matrix expert

andreasnoack · 2014-05-03T09:02:21Z

I have always been confused by MATLAB's sprandn. The documentation doesn't say how the random sparsity pattern is chosen, but only how many non-zeros you should expect (and that number happens to be way off). The number of non-zeros doesn't characterise the distribution at all.

I think it is a big improvement over MATLAB's implementation that you, in addition to have the right expected number of non-zeros, are explicit about how the random sparsity structure is chosen. As I understand your explanation, your new sprandn would return something like m*n independent Bernoulli(density) draws where each non-zero is then replaced by a normal variate. If so, I think the independence property should be made explicit in the documentation.

A minor comment is that I think the label density is misleading and I would prefer p or probability or something like that.

stevengj · 2014-05-03T12:06:57Z

@andreasnoackjensen, that's right. I will update the docs.

timholy · 2014-05-04T23:00:22Z

LGTM

stevengj · 2014-05-04T23:10:25Z

Hmm, Travis error seems unrelated:

exception on 9: Worker 2 terminated.
ERROR: ProcessExitedException()
in remotecall_fetch at multi.jl:696
in remotecall_fetch at multi.jl:701
in runtests at /tmp/julia/share/julia/test/testdefs.jl:5
in anonymous at multi.jl:847
in run_work_thunk at multi.jl:613
in anonymous at task.jl:847
while loading parallel.jl, in expression starting on line 1276
ERROR: ProcessExitedException()
in anonymous at task.jl:1352
while loading /tmp/julia/share/julia/test/runtests.jl, in expression starting on line 46

Another one of the usual random Travis failures.

StefanKarpinski · 2014-05-04T23:35:53Z

Seems like a very positive change. It would be good to have @alanedelman's input on the new behavior. Is this behavior something that's useful? Are random matrices generated like this handy in practice? Do they have well understood statistical structure? Is there some other kind of random sparse matrix that's particularly useful? Should this be in base at all?

stevengj · 2014-05-05T02:12:31Z

There seem to be a large number of papers on sparse random matrices (google "random sparse matrix").

Many (though not all) seem to use a distribution similar to my sprand (e.g. this paper and the ~100 papers that cite it, or more recent papers such as this one), albeit often restricted to symmetric sparsity patterns, motivated by various types of random-network models (e.g. spin glass models). (It might be useful to have an analogous sprandsym for random symmetric sparse matrices, similar to Matlab except with a documented and useful distribution.)

Aside from models of physical systems, random networks etc., and compressive-sensing applications, random sparse matrices are also pretty useful for tests and benchmarks of sparse-matrix algorithms (for which application the precise distribution probably doesn't matter much, but it is useful to have a well-defined distribution in any case).

I'll try to bug Alan about it tomorrow if I run into him.

stevengj · 2014-05-05T02:33:12Z

Seems to compare well to Matlab in terms of performance:

julia> @time sprand(10^7,10^7,1e-7);
elapsed time: 2.889734603 seconds (770679184 bytes allocated)

vs.

>> tic; sprand(10^7,10^7,1e-7); toc
Elapsed time is 12.458704 seconds.

on the same machine with a single thread.

alanedelman · 2014-05-05T11:34:53Z

The random matrix literature on sparse matrices may not yet be ready to be relevant
for computation -- though it's an interesting question.

One could ask for random matrices given a specific sparsity pattern, or an approximate nnz number.
Very cute is the algorithm that generates a random sparse matrix with given singular values
or condition number, by picking jacobi rotations carefully --- though you only have approximate
control over this so you get not great answers in small cases.

StefanKarpinski · 2014-05-05T12:29:39Z

Any comments on whether this particular sparse random matrix generation is a decent default choice?

stevengj · 2014-05-05T16:50:08Z

I just spoke to Alan. Matlab's sprand has three variants: sprand(S), which keeps the sparsity pattern of S but randomizes the entries; sprand(m,n,density), which is kinda like ours but has a poorly specified distribution; and sprand(m,n,density,rc), in which you specify the desired reciprocal condition number rc (or the desired singular values). Alan only seems to have strong feelings that the third variant is useful (for testing) and would be good to have, and isn't much interested one way or the other in sprand(m,n,density).

stevengj · 2014-05-05T20:47:08Z

If we are going to have an sprand(m,n,density) function, it seems like this is a reasonable default choice (possibly the most reasonable), and is a clear improvement over the previous version. Okay to merge?

JeffBezanson · 2014-05-05T21:06:11Z

Ok by me.

StefanKarpinski · 2014-05-05T23:34:39Z

Go for it.

On May 5, 2014, at 5:06 PM, Jeff Bezanson [email protected] wrote:

Ok by me.

—
Reply to this email directly or view it on GitHub.

faster sprand, new randsubseq (fixes #6714 and #6708)

stevengj mentioned this pull request May 4, 2014

incorrect statistics for sprand #6714

Closed

faster sprand, new randsubseq (fixes JuliaLang#6714 and JuliaLang#6708)

f6702e2

mention Bernoulli sampling in randsubseq docs

ef94b35

stevengj added a commit that referenced this pull request May 6, 2014

Merge pull request #6726 from stevengj/sprand

593a9b8

faster sprand, new randsubseq (fixes #6714 and #6708)

stevengj merged commit 593a9b8 into JuliaLang:master May 6, 2014

stevengj deleted the sprand branch May 6, 2014 02:26

stevengj mentioned this pull request Dec 24, 2018

Faster and simpler sprand for SparseMatrixCSC #30494

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster sprand, new randsubseq (fixes #6714 and #6708) #6726

faster sprand, new randsubseq (fixes #6714 and #6708) #6726

stevengj commented May 3, 2014

stevengj commented May 3, 2014

stevengj commented May 3, 2014

andreasnoack commented May 3, 2014

stevengj commented May 3, 2014

timholy commented May 4, 2014

stevengj commented May 4, 2014

StefanKarpinski commented May 4, 2014

stevengj commented May 5, 2014

stevengj commented May 5, 2014

alanedelman commented May 5, 2014

StefanKarpinski commented May 5, 2014

stevengj commented May 5, 2014

stevengj commented May 5, 2014

JeffBezanson commented May 5, 2014

StefanKarpinski commented May 5, 2014

faster sprand, new randsubseq (fixes #6714 and #6708) #6726

faster sprand, new randsubseq (fixes #6714 and #6708) #6726

Conversation

stevengj commented May 3, 2014

stevengj commented May 3, 2014

stevengj commented May 3, 2014

andreasnoack commented May 3, 2014

stevengj commented May 3, 2014

timholy commented May 4, 2014

stevengj commented May 4, 2014

StefanKarpinski commented May 4, 2014

stevengj commented May 5, 2014

stevengj commented May 5, 2014

alanedelman commented May 5, 2014

StefanKarpinski commented May 5, 2014

stevengj commented May 5, 2014

stevengj commented May 5, 2014

JeffBezanson commented May 5, 2014

StefanKarpinski commented May 5, 2014