Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: SparseVectors, Take 2 #13440

Merged
merged 5 commits into from
Oct 13, 2015
Merged

RFC: SparseVectors, Take 2 #13440

merged 5 commits into from
Oct 13, 2015

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Oct 3, 2015

This introduces the SparseVector to the standard library. It is the descendent in spirit of #11424, but the entirety of the code is imported from JuliaSparse/SparseVectors.jl, with credit to @lindahua as the primary author. I've simply been working on getting it to mesh with Base idioms and names.

TODO:

  • Get tests passing
  • Implement transpose(::SparseVector). Probably has to just return a dense array for now.
  • Rename the module; module SparseMatrix no longer makes sense.
  • Documentation review: look through the manual for instances where matrix needs to be changed to vector or matrix.
  • Determine how or if we should deprecate sparsevec. It's a conglomeration of two orthogonal behaviors that already have independent names: sparse and vec. Unfortunately, the sparse function has so many signatures defined for it there's no unambiguous way to tack on the sparsevec functionality for a few of the methods. Feedback about desired APIs here would be useful. See this comment for specifics. Decision: punt. The current API works fine for now. We can revisit this later.
  • Missing sparse functionality for SparseVector that makes sense:
    • Define rowvals(::SparseVector)? Effectively the same functionality is defined as the currently-non-exported nonzeroinds. Decision: wait to see if this is needed for writing generic code.
    • spones(::SparseVector)
    • sprandbool(::Integer, ::AbstractFloat)
  • Double-check coverage for large missing segments. SparseVectors.jl is nicely covered, so I don't anticipate big missing areas here. Report as of 0d9f1a1: https://codecov.io/github/mbauman/julia?ref=a389967eb3fec6742d3037b386bedbd6df0f7d27 (sparse vectors is over 90%). Only large missing block was indexing by a UnitRange; this is now tested.
  • Squash all my intermediate commits

@mbauman mbauman added sparse Sparse arrays breaking This change will break code labels Oct 3, 2015
@lindahua
Copy link
Contributor

lindahua commented Oct 3, 2015

Thanks a lot for working on this!

end
half_screen_rows = limit ? div(rows - 8, 2) : typemax(Int)
pad = ndigits(n)
k = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this, and the k += 1 below are redundant since k is the loop variable

@ViralBShah ViralBShah added this to the 0.5.0 milestone Oct 4, 2015
@ViralBShah
Copy link
Member

How about deprecating sparsevec? The SparseVector constructor should be good enough to replace it, right?

@lindahua
Copy link
Contributor

lindahua commented Oct 4, 2015

Generally, I think SparseVector may be just a field setter, while functions like sparsevec or sparse can do complicated things (such as sorting/combining elements).

I think we probably don't need sparsevec, but let sparse returns an instance of SparseVector when there is only one dimension.

@ViralBShah
Copy link
Member

I deleted comments that were completely off-topic here.

doc"""
ldltfact(::Union{SparseMatrixCSC,Symmetric{Float64,SparseMatrixCSC{Flaot64,SuiteSparse_long}},Hermitian{Complex{Float64},SparseMatrixCSC{Complex{Float64},SuiteSparse_long}}}; shift=0, perm=Int[]) -> CHOLMOD.Factor

Compute the `LDLt` factorization of a sparse symmetric or Hermitian matrix. A fill-reducing permutation is used. `F = ldltfact(A)` is most frequently used to solve systems of equations `A*x = b` with `F\b`, but also the methods `diag`, `det`, `logdet` are defined for `F`. You can also extract individual factors from `F`, using `F[:L]`. However, since pivoting is on by default, the factorization is internally represented as `A == P'*L*D*L'*P` with a permutation matrix `P`; using just `L` without accounting for `P` will give incorrect answers. To include the effects of permutation, it's typically preferable to extact "combined" factors like `PtL = F[:PtL]` (the equivalent of `P'*L`) and `LtP = F[:UP]` (the equivalent of `L'*P`). The complete list of supported factors is `:L, :PtL, :D, :UP, :U, :LD, :DU, :PtLD, :DUP`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording here is a little awkward - comma splice. Maybe F = ldltfact(A) is most frequently used to solve systems of equations A*x = b with F\b. The factorization also the methods diag, det, logdet defined.`?

Would be nice to have an explanation of what all these factors are (what is DUP?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is DUP

I agree, but I'm not the one to write that documentation. I just moved the docs here from helpdb.jl.

@mbauman
Copy link
Member Author

mbauman commented Oct 5, 2015

How about deprecating sparsevec? The SparseVector constructor should be good enough to replace it, right?

I think for now it can be split into a combination of the orthogonal names sparse and vec. I'm not entirely sure what to do with the SparseVector name; it'd be nice to have the constructor mirror Vector, but that's awkward until the deprecation for the SparseMatrix module goes through.

r += 1
i2 = I[r]
if i2 == i # accumulate r-th to the l-th entry
V[l] = call(combine, V[l], V[r])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine(V[l], V[r]) to be similar with the SparseMatrix construct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yup, thanks, this comes from SparseVectors.jl's 0.3 support.

@mbauman
Copy link
Member Author

mbauman commented Oct 5, 2015

I think for now [sparsevec] can be split into a combination of the orthogonal names sparse and vec

So close. There'd be a clash of meanings with sparse(::AbstractVector, ::AbstractVector, ::Integer).

Is sparse([1,2],[1,2],3) a 2x2 matrix with two threes ([3 0; 0 3]) or is it a 3-element sparse vector ([1,2,0])? Any thoughts here?

@@ -0,0 +1,1441 @@
### common.jl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need the standard license header? Also the file is not common.jl.

@ScottPJones
Copy link
Contributor

I'd asked before about transposing a sparse vector (but Viral deleted question), why couldn't that be returned as a sparse matrix? It wouldn't be quite as compact, but I'd think it would still be smaller than a dense array, or wouldn't it?

@mbauman
Copy link
Member Author

mbauman commented Oct 5, 2015

Our sparse matrices are CSC format. It'd be good for you to thoroughly understand how CSC works if you're interesting in contributing to our sparse library.

julia> S = sparse([1,10^7],[1,1],[1,1])
10000000x1 sparse matrix with 2 Int64 entries:
    [1       ,        1]  =  1
    [10000000,        1]  =  1

julia> Base.summarysize(S)
88

julia> Base.summarysize(S')
80000080

julia> Base.summarysize(zeros(10^7))
80000000

@ScottPJones
Copy link
Contributor

I do understand perfectly well how the CSC format works.
Try the following:
S = sparse(Int32[1,10^7],Int32[1,1],[10,20])
Base.summarysize(S)
You will then see that in that case, you can save almost 50% of the memory.
However, this is really dependent on the # of zero entries, and the type of the value array.
S = sparse(Int32[1,10^7],Int32[1,1],Int128[10,20])
Base.summarysize(S)
Compare that to Base.summarysize(zeros(Int128,10^7)), and you'll see the difference.

@@ -0,0 +1,649 @@
## sparsevec.jl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has incorrect name and is missing Julia license info.

@lindahua
Copy link
Contributor

lindahua commented Oct 6, 2015

Looks really great!

@jakebolewski
Copy link
Member

I see that you replaced the comparison Ty = eltype(SparseVector(...)); v != zero(Ty) with v != 0. Does this assumption (not assuming a zero method for a given element type) actually hold throughout the sparsematrix code? I feel that comparison to numeric zero has been gradually removed as it can cause perf problems due to type instability.

@mbauman
Copy link
Member Author

mbauman commented Oct 6, 2015

The SparseMatrixCSC code is a little mixed as to whether it compares to zero(T) or 0. These changes were required to get tests to pass. At least some of those tests fail because they store Char, and probably date back to the days when Char <: Integer.

@jakebolewski
Copy link
Member

I think we should be more explicit as to what the interface is assumed rather than having partially implemented / ill defined behavior.

ex:

julia> A = sparse([1,5,10],[1,5,10],['a','b','c'])
10x10 sparse matrix with 3 Char entries:
    [1 ,  1]  =  'a'
    [5 ,  5]  =  'b'
    [10, 10]  =  'c'

julia> full(A)
ERROR: MethodError: `zero` has no method matching zero(::Type{Char})
 in full at sparse/sparsematrix.jl:227


julia> A[1]
'a'

julia> A[2]
ERROR: MethodError: `zero` has no method matching zero(::Type{Char})
 in getindex at sparse/sparsematrix.jl:1398
 in getindex at abstractarray.jl:483

but that might be another issue / PR.

@mbauman mbauman changed the title WIP: SparseVectors, Take 2 RFC: SparseVectors, Take 2 Oct 7, 2015
@mbauman
Copy link
Member Author

mbauman commented Oct 7, 2015

Thank you all for the review comments so far. Unless anyone has anything else, this is now ready as far as I'm concerned. I'm sure there will be a long tail of missing methods, performance improvements, and perhaps some bugs, but I think it's really quite good. It's well-covered by tests, and behaves as I'd expect.

lindahua and others added 5 commits October 13, 2015 11:12
Renamings:
* Rename sparse module to SparseArrays, and improve its separation from base. This makes it very simple to dynamically reload the sparse module. Move docstrings to their proper place
* _copy_convert → collect
* Rename sparsevector to the existing spzeros and sparsevec.
* Use call overloading instead of call

Remove functionality from SparseVectors.jl:
* Simplify and remove some functors
* Remove SparseVectorView
* Remove no-longer-needed ambiguity preventers

Add functionality for SparseVectors:
* Add similar for SparseVector
* Allow sparsevec(::AbstractArray), not just vectors
* Add spzeros(n); adapt some tests to SparseVector
* Allow CHOLMOD linalg with vectors
* Implement (c)transpose(::SparseVector). Returns a dense vector since a one-row CSC structure is effectively dense but with worse performance.
* Add vector sprandbool and allow passing RNG to all vector sprand* functions. Harden tests against random failures.
* Implement, test and doc spones(::SparseVector)

Improve performance for SparseVector indexing:
* Punt to SparseMatrix for some indexing behaviors. Since the datastructures are compatible and SparseMatrix's routines are more optimized, it is easiest to just construct a temporary SparseMatrix and index into that. This is faster in all but the smallest of cases (N<10)
* Use searchsorted for indexing SparseVector by UnitRange. This is about 20% slower on very small vectors, but is faster in general.

Change SparseMatrix behaviors to take advantage of vectors
* Add indexing behaviors for SparseMatrix->SparseVector
* `vec` and `sparsevec` for CSC return SparseVectors
* Update documentation to incorporate vectors

Minor bugfixes and changes to SparseVectors:
* Compare to literal 0 in vector construction and setindex. This matches SparseMatrix semantics, and makes indexing semantics consistent with regard to stored zeros
* Use checkbounds more thoroughly
* Use error types that are consistent with SparseMatrixCSC
* Minor sparse vector display tweaks. Turn on output limiting by default, remove unused variable `k`, implement and use Base.summary
* Fix missing return, add test

Add some tests:
* Add a test and comment to ensure nobody else tries to share data between vectors and matrices
* Ensure indexing is consistent between one-column sparse matrices and sparse vectors, with special attention to stored zeros.
@mbauman
Copy link
Member Author

mbauman commented Oct 13, 2015

Both 64 bit linux workers got hit by the OOM killer.

mbauman added a commit that referenced this pull request Oct 13, 2015
@mbauman mbauman merged commit 6e4c9f1 into master Oct 13, 2015
@mbauman mbauman deleted the mb/sparsevec branch October 13, 2015 21:34
@jakebolewski
Copy link
Member

🍰

@KristofferC
Copy link
Member

Great!

@malmaud
Copy link
Contributor

malmaud commented Oct 14, 2015

🔢

Is there a constructor to create an all-zero sparse matrix of a given size? My brief foray didn't reveal one.

@StefanKarpinski
Copy link
Member

spzeros?

@malmaud
Copy link
Contributor

malmaud commented Oct 14, 2015

yep, thanks.

On Wed, Oct 14, 2015 at 4:54 AM, Stefan Karpinski [email protected]
wrote:

spzeros?


Reply to this email directly or view it on GitHub
#13440 (comment).

@malmaud malmaud mentioned this pull request Oct 14, 2015
@lindahua
Copy link
Contributor

Great, thanks a lot!

@ViralBShah
Copy link
Member

This is amazing! Thanks everyone. Kicking the tires.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code sparse Sparse arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.