Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending inner products #16573

Closed
cortner opened this issue May 25, 2016 · 28 comments · Fixed by #32739
Closed

Extending inner products #16573

cortner opened this issue May 25, 2016 · 28 comments · Fixed by #32739
Labels

Comments

@cortner
Copy link

cortner commented May 25, 2016

Two related questions:

  1. I have the need for a generalised euclidean inner product
   dot(x, M, y)  =  dot(x, M * y)

which I don't want to hard-code prefer an ability to overload it. Is this functionality sufficiently general (I think it is) that it should be in Base by default for the standard arrays?

EDIT: second question is deleted; it needs more thinking.

@andreasnoack
Copy link
Member

I that this would be useful to have in base and I've thought about adding it several times. One of the benefits is that you can completely avoid a temporary M*y.

@cortner
Copy link
Author

cortner commented May 25, 2016

yes - that was exactly the origin of this - I should have said, sorry.

@stevengj stevengj added the domain:linear algebra Linear algebra label May 25, 2016
@stevengj
Copy link
Member

stevengj commented May 25, 2016

Would be nice to get some benefit from BLAS here, though it's only a BLAS2 operation. @xianyi, do you have a suggestion on the most efficient way to use BLAS for x'*M*y?

One option would be to break x into chunks of say, 16 elements (big enough to exceed a cache line but small enough that the memory allocation of a length-16 temporary vector is an issue, and compute x'*M on that chunk via BLAS2 using a subset of the columns of M (since Julia is column-major, this doesn't require any copying), then accumulate that chunk of the dot product with y. However, this is suboptimal because it doesn't fully exploit multi-threading: each of the chunk dot products could ideally be computed in a parallel thread.

@stevengj
Copy link
Member

stevengj commented May 25, 2016

Ideally you'd also want a specialized method for the common case where M is a diagonal matrix, especially a diagonal real matrix.

And tridiagonal "mass" matrices are also common, as well as general sparse matrices M, in problems arising from discretized PDEs.

@cortner
Copy link
Author

cortner commented May 25, 2016

I was mostly thinking of sparse cases for my own use. These should be relatively easy to implement in Julia?

@andreasnoack
Copy link
Member

For the matvec, it seems that multithreading is almost the only speedup you'll get from BLAS. For a 500x500 problem and a Julia double loop with @inbounds and @simd I get

julia> blas_set_num_threads(1);

julia> @time for i = 1:10000; Ac_mul_B!(y, A, x); end # OpenBLAS 1 thread
  0.484342 seconds

julia> blas_set_num_threads(4);

julia> @time for i = 1:10000; Ac_mul_B!(y, A, x); end # OpenBLAS 4 threads
  0.184681 seconds

julia> @time for i = 1:10000; mymul(y, A, x); end # Julia
  0.509110 seconds

so if/when the @threadsoverhead is gone, a pure Julia version should probably be okay.

@stevengj
Copy link
Member

I'm surprised there isn't some SIMD speedup.

@mschauer
Copy link
Contributor

Probably also worthwhile to special case x === y and issym(A).

@andreasnoack
Copy link
Member

@stevengj I don't understand. Both OpenBLAS and Julia use SIMD so where would you have expected to have seen the speedup?

@stevengj
Copy link
Member

In OpenBLAS the SIMD is hard-coded (I thought), whereas in Julia we rely on the compiler to do its magic, which doesn't always work as well. Maybe this is one of the cases where the compiler does a good job in exploiting SIMD.

@JaredCrean2
Copy link
Contributor

In my experiments with mat-vec and mat-mat, Julia has done a good job exploiting SIMD. For mat-vec my implemention was asymptotically as fast as OpenBlas, although they use a better algorithm for mat-mat, so theirs was faster in that case (about about 3x IIRC).

@simonbyrne
Copy link
Contributor

To follow up on this, for when everything is sparse we can also do much better by interleaving the computations. e.g. in a recent project I did:

function mul_ut_X_v(u::SparseVector, X::SparseMatrixCSC, v::SparseVector)
    u.n == X.m && X.n == u.n || throw(DimensionMismatch())
    
    z = zero(promote_type(eltype(u),eltype(X),eltype(v)))
    for (vi,vv) in zip(v.nzind, v.nzval)        
        X_ptr_lo = X.colptr[vi] 
        X_ptr_hi = X.colptr[vi+1] - 1
        if X_ptr_lo <= X_ptr_hi        
            z += Base.SparseArrays._spdot(dot, 1, length(u.nzind), u.nzind, u.nzval,
                                            X_ptr_lo, X_ptr_hi, X.rowval, X.nzval) * vv
        end
    end
    z
end

Once we have lazy transposes, we could overload the ternary multiplication operator u' * X * v for this. Until then, I would be fine with overloading dot, or adding a bilinear function.

@simonbyrne simonbyrne added the domain:arrays:sparse Sparse arrays label Oct 13, 2017
@simonbyrne
Copy link
Contributor

As pointed out by @andyferris, we can now overload ternary * for this, e.g.

function Base.:*(u::RowVector{T}, X::Matrix{T}, v::Vector{T}) where {T}
    m,n = size(X)
    length(u) == m && length(v) == n || throw(DimensionMismatch())
    z = zero(T)
    for j = 1:n
        zz = zero(T)
        @simd for i = 1:m
            @inbounds zz += u[i]*X[i,j]
        end
        z += zz*v[j]
    end
    z
end

This gives a 2x speedup on my machine.

@andyferris
Copy link
Member

Credit to @ViralBShah also I think? (Note we need to make the mechanics of Ac_mul_B a bit different (nuke it from orbit) to make this work).

@stevengj
Copy link
Member

I still think it would be nice to have dot(x, A, y), even if ternary * calls this.

@simonbyrne
Copy link
Contributor

@andyferris Why would we need to change Ac_mul_B?

@andyferris
Copy link
Member

I just mean the lowering. You want it to come out as *(adjoint(x), A, y) rather than Ac_mul_B(x, A*y).

@simonbyrne
Copy link
Contributor

Overloading the ternary * seems to do that already

@mbauman
Copy link
Sponsor Member

mbauman commented Oct 17, 2017

Yeah, as long as you use an explicit *. It's fiddly, though:

julia> expand(Main, :(x'*A*y))
:(adjoint(x) * A * y)

julia> expand(Main, :(x'A*y))
:(Ac_mul_B(x, A) * y)

julia> expand(Main, :(x'*A))
:(Ac_mul_B(x, A))

julia> expand(Main, :(x'A))
:(Ac_mul_B(x, A))

@andreasnoack
Copy link
Member

This would be fixed if we settle on a definition of (and implement) a matrix transpose because then all the Ax_mul_Bxs can go away.

@mschauer
Copy link
Contributor

Have there been any regrets about implementing a vector transpose?

@andreasnoack
Copy link
Member

I think that people are generally happy with the vector transpose. There have been a few comments where people have been surprised/confused/annoyed but I think that is to be expected given the length of #4774.

@StefanKarpinski
Copy link
Sponsor Member

Yes, I have to say the experience with RowVectors behaviorally embedded as row matrices except for linear algebraic operations has been lovely. They generally just do what you want.

@ViralBShah
Copy link
Member

The generalized inner product would be nice to have without creating a temporary.

@chriscoey
Copy link

chriscoey commented May 9, 2019

What is the current thinking on bringing this to Julia? This type of quadratic form x'Qy (often with x = y and Q symmetric) is ubiquitous in math-optimization (I'm surprised there is no blas function for it). I see there is some implementation in PDMats.jl for the case where Q is PSD (which is restrictive). Brief related discussion: https://discourse.julialang.org/t/matrix-multiplication-a-b-a

@andyferris
Copy link
Member

@chriscoey I suspect this might be an issue waiting for someone to contribute a pull request to LinearAlgebra? The discussion above seems pretty receptive to the idea.

@dkarrasch
Copy link
Member

So, we do have a ternary * for Diagonals, and I quickly wrote up one for SparseMatrixCSC and AbstractVectors:

function *(x::Adjoint{<:Any,<:AbstractVector}, A::SparseMatrixCSC, y::AbstractVector)
    length(x) == A.m || throw(DimensionMismatch())
    A.n == length(y) || throw(DimensionMismatch())
    T = promote_type(eltype(x), eltype(A), eltype(y))
    out = T(0)
    rvals = rowvals(A)
    nzvals = nonzeros(A)

    @inbounds for col = 1:A.n
        temp = T(0)
        for k in nzrange(A, col)
            temp += x[rvals[k]] * nzvals[k]
        end
        out += temp * y[col]
    end
    return out
end

Would we insert this into SparseArrays as is (perhaps together with @simonbyrne's all-sparse version above), or should we introduce a generic dot(x, A, y) first, make it throw in general, but have more specific implementations and make the ternary *s call those?

@stevengj
Copy link
Member

I would have the dot(x, A, y) first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.