Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed of sprand() #6708

Closed
ViralBShah opened this issue May 1, 2014 · 30 comments
Closed

Speed of sprand() #6708

ViralBShah opened this issue May 1, 2014 · 30 comments
Labels
domain:arrays:sparse Sparse arrays performance Must go faster

Comments

@ViralBShah
Copy link
Member

Based on the discussion here, we need to investigate if sparse() can be sped up.

https://groups.google.com/forum/#!topic/julia-users/X8Jca4SZMxo

Potential opportunities include experimenting with different ways of sorting inputs, avoiding the use of the generic combine function, @inbounds, etc.

The following example takes 1.7 seconds for me:

s = sprand(70000,70000,0.001)
@ViralBShah
Copy link
Member Author

On quick checks, @inbounds stuck in front of the for loops and removing the combine function has no effect on time. Checking pre-sorted inputs should help.

@ViralBShah
Copy link
Member Author

Cc: @rwgardner

@stevengj
Copy link
Member

stevengj commented May 1, 2014

On my machine, almost all of the time in your sprand test is spent in two lines:

  • ind = sort(rand(1:N, ik)) (here): takes about 0.6 seconds. (Probably this should be sort! to save memory, but that doesn't help performance much.)
  • I, J = ind2sub((m,n), uind) (here): takes about 1.3 seconds.

I'm not sure we can do much about the first, short of speeding up rand and sort! (although we might be able to avoid the sort entirely by a more clever statistical algorithm that generates the indices in order, via computing the probability that each term in 1:N appears P times ... this requires some careful statistics though). But maybe we can make ind2sub faster or avoid it entirely somehow?

Basically, to make this faster it seems like we should compute the entries row by row to start with.

@StefanKarpinski
Copy link
Sponsor Member

Is that code trying to get ik unique values from 1:N? If so, this seems like it could be done much more efficiently. Note that choosing a random subset of N elements is equivalent to choosing a random N-bit unsigned integer, or equivalently, N/64 random Uint64 values. That avoids any need for sorting or determining uniqueness.

@stevengj
Copy link
Member

stevengj commented May 1, 2014

@StefanKarpinski, I think we should probably avoid computing random elements in 1:N entirely. Instead, at least for low-density filling of large arrays, we should randomly generate the indices to set in each row column (since we are using CSC).

@StefanKarpinski
Copy link
Sponsor Member

Yes, I guess when something's very sparse that's likely to be much better. Of course when the number of non-zeros is substantial, regardless of how sparse things are, it's still a lot of work.

@stevengj
Copy link
Member

stevengj commented May 1, 2014

I don't think that large matrices with a large fraction of nonzeros are a practical concern. You shouldn't be using sparse formats at all in that case.

@StefanKarpinski
Copy link
Sponsor Member

I meant huge matrices with a small fraction of non-zeros where the number of non-zeros is still large.

@stevengj
Copy link
Member

stevengj commented May 1, 2014

Of course the work will always be Ω(# nonzeros) (right now it is Θ(# log #)). This is mostly about improving the constant factor (although my proposal also reduces the complexity by a factor log(#/n)/log(#)).

@stevengj
Copy link
Member

stevengj commented May 1, 2014

@ViralBShah, this is all about speeding up sprand; the benchmark doesn't seem to have anything to do with sparse(). Should the issue be renamed, or is there a different benchmark showing performance problems in sparse?

@stevengj
Copy link
Member

stevengj commented May 1, 2014

@StefanKarpinski, your solution of generating a random N-bit integer is not practical here, because it requires Θ(N) work and storage, which could be many orders of magnitude greater than the number of nonzeros. In any case, working by columns should make the log factor in the sort! negligible.

@StefanKarpinski
Copy link
Sponsor Member

No, I see that. I'm trying to figure out where this is slow. Maybe ind2sub can be sped up, which is a good thing whether we manage to avoid it here or not.

@StefanKarpinski
Copy link
Sponsor Member

I suspect it's because our div and rem operations are so horribly pessimized, but I haven't narrowed it down to that yet. Inlining is making precise profiling a bit difficult.

@ViralBShah ViralBShah changed the title Speed of sparse() Speed of sprand() May 1, 2014
@ViralBShah
Copy link
Member Author

Originally, I meant this to track down the performance issues raised by @rwgardner in sparse, but I got distracted with other issues between the sprand test and the sparse test. I renamed this issue to track sprand, and will file a separate one for sparse if required.

@timholy
Copy link
Sponsor Member

timholy commented May 1, 2014

The recent history behind sprand is described #4726.

Yes, ind2sub is a big bottleneck and surely comes down to our div and rem being slow.

@StefanKarpinski
Copy link
Sponsor Member

I still think it's crazy for our integer division to be so damned slow. We do the fast, efficient but slightly unsafe thing with integers everywhere else in the language. I don't think div should be an exception. We could provide a checked_div operation like checked_add, but that seems silly since unlike overflow, division by zero is easy to explicitly check for. There's also the trick where you can implement div by a fixed divisor with a multiply and a shift by factors computed in advance, which would be applicable to vectorized ind2sub for example.

@JeffBezanson
Copy link
Sponsor Member

There's a big difference between overflow and undefined behavior. Results that overflow are still well-defined, and can sometimes even be exploited to get the answer you want. The undefined answers, however, are random numbers, or possibly a trap if you're lucky. Getting different results for no clear reason is just not ok.

@StefanKarpinski
Copy link
Sponsor Member

It seems like generating each sparse column separately would be a big win here. The expected density of each column is the same as the density of the matrix of a whole, so that's pretty simple. The hard part is the patching up after-the-fact if the fraction isn't close. It may make more sense to allocate non-zeros to the columns up front and then just choose that many unique indices per column and sort those.

@StefanKarpinski
Copy link
Sponsor Member

So we're just not going to have a way to do integer division efficiently?

@JeffBezanson
Copy link
Sponsor Member

We can if LLVM adds intrinsics for x86 division, or something similar. We can also add our own unsafe division intrinsics, and a compiler flag to turn off the checks globally. We could also decide that the typemin/-1 case is not important, and at least remove the check for that.

@timholy
Copy link
Sponsor Member

timholy commented May 1, 2014

I really like the idea of generating the columns separately, I should have thought of that myself. The big win is not needing to use ind2sub, but as a nice bonus it will slightly decrease the sort time too (due to the logN part).

@timholy
Copy link
Sponsor Member

timholy commented May 1, 2014

Certainly for linear indexing of array views, having fast integer division would be huge, and is presumably a good example of a case where the unsafe algorithm should be fine?

@tkelman
Copy link
Contributor

tkelman commented May 2, 2014

Possibly relevant? Proposed addition of LLVM intrinsics for safe division: http://article.gmane.org/gmane.comp.compilers.llvm.devel/72466

@StefanKarpinski
Copy link
Sponsor Member

This implementation is much faster:

using Distributions

function spr(m,n,p)
    M = Multinomial(iround(m*n*p), n)
    c = cumsum([1; rand(M)])
    r = rand(1:m, c[end]-1)
    for i = 1:length(c)-1
        lo, hi = c[i], c[i+1]-1
        sort!(r, lo, hi, QuickSort, Base.Order.Forward)
    end
    v = rand(length(r))
    S = SparseMatrixCSC(m,n,c,r,v)
end

julia> @time spr(70000, 70000, 0.001)
elapsed time: 0.254208626 seconds (89577536 bytes allocated)
70000x70000 sparse matrix with 4900000 Float64 entries:
    [312  ,     1]  =  0.759103
    [1498 ,     1]  =  0.197246
    [5473 ,     1]  =  0.192934
    [7402 ,     1]  =  0.742496
    [9401 ,     1]  =  0.538901
    [9555 ,     1]  =  0.150737
    [9859 ,     1]  =  0.880329
    
    [60202, 70000]  =  0.290513
    [61760, 70000]  =  0.828304
    [62626, 70000]  =  0.123417
    [64031, 70000]  =  0.272032
    [64734, 70000]  =  0.451746
    [65528, 70000]  =  0.205945
    [66508, 70000]  =  0.0673513
    [66696, 70000]  =  0.938025

As compared to this:

julia> @time sprand(70000, 70000, 0.001);
elapsed time: 1.817448442 seconds (548796520 bytes allocated)

Of course, it's not quite correct since it allows duplicate row indices in the same column. When I insert some code to try to handle that, it slows back down to the same as what we have now:

function spr(m,n,p)
    M = Multinomial(iround(m*n*p), n)
    c = cumsum([1; rand(M)])
    r = rand(1:m, c[end]-1)
    for i = 1:length(c)-1
        lo, hi = c[i], c[i+1]-1
        sort!(r, lo, hi, QuickSort, Base.Order.Forward)
        while true
            dups = false
            for j = lo:hi-1
                r[j] == r[j+1] || continue
                r[j] = rand(1:m)
                dups = true
            end
            dups || break
            sort!(r, lo, hi, InsertionSort, Base.Order.Forward)
        end
    end
    v = rand(length(r))
    S = SparseMatrixCSC(m,n,c,r,v)
end

julia> @time spr(70000, 70000, 0.001)
elapsed time: 1.94048937 seconds (652454080 bytes allocated)
70000x70000 sparse matrix with 4900000 Float64 entries:
    [200  ,     1]  =  0.870909
    [316  ,     1]  =  0.675384
    [2708 ,     1]  =  0.561914
    [3678 ,     1]  =  0.919305
    [3717 ,     1]  =  0.777615
    [4848 ,     1]  =  0.628789
    [6701 ,     1]  =  0.00253897
    

If we can figure out a faster way to handle duplicates, this is a better algorithm by far. Of course, we'd need to copy the multinomial sampling code out of Distributions. Another nice feature of this implementation is that it tends to give exactly the requested sparsity.

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Won't this implementation will exponentially slow down as the density increases (dups will start being true most of the time)? But I really like the fact that it hits the target density exactly (within integer rounding).

@stevengj
Copy link
Member

stevengj commented May 2, 2014

This implementation, which fills in the matrix by columns as I suggested, is about four times faster than the original on my machine:

# Return a random sorted subset of 1:N, where each element is included with probability p.
function randomsubset!(p,N,ind,uind)
    if p == 1
        resize!(uind, N)
        uind[:] = 1:N
        return uind
    end
    resize!(uind, 0)
    p == 0 && return uind
    if N < 100 # O(N) algorithm for small N
        for i = 1:N
            rand() < p && push!(uind, i)
        end
    else
        K = p <= 0.5 ? N*p : N*(1-p)
        # Use Newton's method to invert the birthday problem
        L = log1p(-1/N)
        KN = K/N
        k = K
        k = k + (expm1(-k*L) - exp(-k*L)*KN)/L
        k = k + (expm1(-k*L) - exp(-k*L)*KN)/L # for K<N/2, 2 iterations suffice
        ik = ifloor(k)
        if rand() < k - ik
            ik += 1
        end
        if ik == 0
            if p <= 0.5
                return uind
            else
                resize!(uind, N)
                uind[:] = 1:N
                return uind
            end
        end
        if (ik > length(ind))
            resize!(ind, ik)
        end
        for i = 1:ik
            ind[i] = rand(1:N)
        end
        sort!(ind,1,ik,QuickSort, Base.Order.Forward)
        if p <= 0.5
            j = ind[1]
            push!(uind, j)
            uj = j
            for i = 2:ik
                j = ind[i]
                if j != uj
                    push!(uind, j)
                    uj = j
                end
            end
        else
            push!(ind, N+1) # sentinel
            ii = 1
            for i = 1:N
                if i != ind[ii]
                    push!(uind, i)
                else
                    while (i == ind[ii])
                        ii += 1
                    end
                end
            end
        end
    end
    uind
end
function randomsubset(p, N)
    uind = Array(Int, int(N*p))
    sizehint(uind, int(N*p))
    randomsubset!(p,N,Int[],uind)
end
import Base.SparseMatrix.sparse_IJ_sorted!
function sprand2{T}(m::Integer, n::Integer, density::FloatingPoint, rng::Function,::Type{T}=eltype(rng(1)))
    0 <= density <= 1 || throw(ArgumentError("density must be between 0 and 1"))
    N = n*m
    N == 0 && return spzeros(T,m,n)
    N == 1 && return rand() <= density ? sparse(rng(1)) : spzeros(T,1,1)

    I, J = Array(Int, 0), Array(Int, 0) # indices of nonzero elements
    sizehint(I, int(N*density))
    sizehint(J, int(N*density))

    # coldensity = density of nonzero columns
    if density*m > 1
        coldensity = 1 - (1-density)^m 
    else # use Taylor series (= binomial expansion) to avoid cancellation errors
        x = -(coldensity = density * m)
        for k = 2:m
            x *= (k-m-1)*density/k
            oldcoldensity = coldensity
            coldensity -= x
            if oldcoldensity == coldensity
                break
            end
        end
    end

    rows = Array(Int, 0)
    sizehint(rows, int(m*density))
    ind = Int[]
    for j in randomsubset(coldensity, n)
        randomsubset!(density, m, ind, rows)
        append!(I, rows)
        nrows = length(rows)
        Jlen = length(J)
        resize!(J, Jlen+nrows)
        for i = Jlen+1:length(J)
            J[i] = j
        end
    end
    return sparse_IJ_sorted!(I, J, rng(length(I)), m, n, +)  # it will never need to combine
end
sprand2(m::Integer, n::Integer, density::FloatingPoint) = sprand2(m,n,density,rand,Float64)

@StefanKarpinski
Copy link
Sponsor Member

@timholy – yes, and the duplicate elimination is way too slow anyway, so it's not really meant as a real proposal, but a starting point. The multinomial sampling approach for choosing column counts is quite good though. I had some thoughts about how to deduplicate more efficiently, although perhaps @stevengj's algorithm is better than that.

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

A 4x speedup is great. And I like the idea of splitting that function out on its own, as it's surely useful in other contexts.

@stevengj
Copy link
Member

stevengj commented May 2, 2014

Note, however, that my solution in its current form is probably not what we want: its distribution of nonzero entries is not equivalent to the original uniform sampling.

This is easily fixed by randomizing the number of entries per column properly.

stevengj added a commit to stevengj/julia that referenced this issue May 3, 2014
stevengj added a commit to stevengj/julia that referenced this issue May 3, 2014
stevengj added a commit to stevengj/julia that referenced this issue May 3, 2014
stevengj added a commit to stevengj/julia that referenced this issue May 3, 2014
stevengj added a commit to stevengj/julia that referenced this issue May 3, 2014
stevengj added a commit to stevengj/julia that referenced this issue May 3, 2014
stevengj added a commit to stevengj/julia that referenced this issue May 3, 2014
stevengj added a commit to stevengj/julia that referenced this issue May 4, 2014
stevengj added a commit that referenced this issue May 6, 2014
faster sprand, new randsubseq (fixes #6714 and #6708)
@stevengj stevengj closed this as completed May 6, 2014
@stevengj
Copy link
Member

stevengj commented May 6, 2014

(Fix merged.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays:sparse Sparse arrays performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants