Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on small matrix algebra #2489

Closed
lindahua opened this issue Mar 7, 2013 · 23 comments
Closed

Performance on small matrix algebra #2489

lindahua opened this issue Mar 7, 2013 · 23 comments
Labels
performance Must go faster

Comments

@lindahua
Copy link
Contributor

lindahua commented Mar 7, 2013

The following code compares current Julia implementation and hand-crafted implementation of matrix multiplication and inversion on 2-by-2 matrices.

a = rand(2, 2)
b = rand(2, 2)

r = a * b  # warming up
@time for i = 1 : 10^6
    r = a * b
end

function mul2(a::Matrix, b::Matrix)
    c = Array(Float64, 2, 2)

    c[1] = a[1] * b[1] + a[3] * b[2]
    c[2] = a[2] * b[1] + a[4] * b[2]
    c[3] = a[1] * b[3] + a[3] * b[4]
    c[4] = a[2] * b[3] + a[4] * b[4]
    c
end

mul2(a, b) # warming up
@time for i = 1 : 10^6
    r = mul2(a, b)
end

mul2(a, b) is about 4x faster than a * b.

Even more significant difference is observed for inversion.

a = rand(2, 2)

r = inv(a)
@time for i = 1 : 10^6
    r = inv(a)
end

function inv2(a::Matrix)
    c = Array(Float64, 2, 2)

    detv = a[1] * a[4] - a[2] * a[3]
    inv_d = 1 / detv

    c[1] = a[4] * inv_d
    c[2] = -a[2] * inv_d
    c[3] = -a[3] * inv_d
    c[4] = a[1] * inv_d
    c
end

r = inv2(a)
@time for i = 1 : 10^6
    r = inv2(a)
end

inv2 is about 25x faster than inv.

A carefully crafted SIMD implementation can lead to 5x to 10x gain compared to the scalar version inv2 and mul2 -- which is at least two orders of magnitude faster than current Julia version. There are some reference SIMD implementation in Intel's website. Eigen also implements some of these.

Fast implementation on small matrices (e.g. 2 x 2, 3 x 3, and 4 x 4) are important for graphics, geometric computation, and some image processing applications.

@lindahua
Copy link
Contributor Author

lindahua commented Mar 7, 2013

It is known that the overhead of using BLAS is significant for small matrices. They are better than naive implementation only when the matrix is relatively large (e.g. larger than 8 x 8 or 16 x 16).

@JeffBezanson
Copy link
Member

We actually have special-case 2x2 and 3x3 matmul, but unfortunately over time they've been shoved behind more and more layers of dispatch. We should add at least your inv2, plus a cutoff for calling BLAS.

@ViralBShah
Copy link
Member

It would be nice to revisit these layers and have faster code paths for smaller matrices. I am not sure if it is worthwhile for the time being to have a SIMD small matrix linear algebra library. I would still much rather push for having SIMD capabilities in julia rather than do something special purpose.

@ViralBShah
Copy link
Member

SIMD capabilities are discussed in #2299.

@timholy
Copy link
Member

timholy commented Mar 7, 2013

Yes, but check this out:

julia> @time for i = 1:10^6; r = mul2(a,b); end
elapsed time: 0.15086007118225098 seconds

@time for i = 1:10^6; Base.matmul2x2(r, 'N', 'N', a, b); end
elapsed time: 0.05599689483642578 seconds

About 3x faster than mul2, because it avoids allocation of the output. (It may also help to do the array access only once, since it avoids unnecessary bounds checks. See #1392.)

I agree with you on inv2, that's definitely important and currently missing. It would also be good to have 4x4 covered.

@lindahua
Copy link
Contributor Author

lindahua commented Mar 7, 2013

I agree with Viral that we should first make SIMD capability available. Small matrix computation (e.g. multiplication, equation solving, inversion, etc) can be implemented in a incredibly fast way if SIMD instructions are exposed.

For example, an entire 2x2 matrix can be loaded into a single AVX register, and it takes no more than 5 - 6 CPU cycles to compute the inversion if done carefully (much faster than the scalar version inv2 above). Intel demonstrated how one can compute inversion of a 4x4 matrix within 200+ CPU cycles using SSE . There are codes for doing many such things in many C++ libraries, e.g. Eigen.

To achieve this performance, the intrinsics for shuffling an SIMD vector have to be exposed.

This is not an urgent issue, but we should keep it here and revisit this once SIMD is ready.

@ViralBShah
Copy link
Member

I am thinking it may be interesting to have a SmallMatrix package for experimentation with small linear algebra. It would be really interesting if some of these small objects can be allocated on the stack.

Just for fun, here is something I tried with 2x2, but such a package could have many other things in it, support matrices up to a size that can comfortably fit in L1 cache, and operate seamlessly with larger matrices. We can also experiment with SIMD, when we get it. Eventually, if this is compelling enough, we can move it to Base.

importall Base

immutable smat2x2{T}         
   a::T
   b::T
   c::T
   d::T
end

*(x::smat2x2, y::smat2x2) = smat2x2(x.a * y.a + x.b * y.c,
                                           x.a * y.b + x.b * y.d,
                                           x.c * y.a + x.d * y.c,
                                           x.c * y.b + x.d * y.d)

Now, trying it out, it is marginally slower than matmul2x2:

julia> x = smat2x2(1.0,1.0,1.0,1.0)
julia> y = smat2x2(1.0,1.0,1.0,1.0)
julia> @time for i=1:10^6; z = x*y; end
elapsed time: 0.058766577 seconds

@pao
Copy link
Member

pao commented Mar 10, 2013

While I'm making unreasonable requests (i.e., modtau), if we're playing with small matrices, a nice way to deal with 3x3 and 3x1 cases while getting the performance of 4x4 and 4x1 with SIMD intrinsics would be a really sweet thing to have.

@JeffBezanson
Copy link
Member

Nope, it is way faster:

julia> @time for i=1:10^6; z = x*y; end
elapsed time: 0.050591359 seconds

julia> @time for i=1:10^6; z = Base.matmul2x2('N','N',a,b); end
elapsed time: 0.132124836 seconds

matmul2x2 is only faster if you add manual memory management.

@ViralBShah
Copy link
Member

My comparisons were with the manual memory management version. Also, it seems that using immutable here does not make much of a difference.

@lindahua
Copy link
Contributor Author

It would be great to have a high performance small matrix package. Implementing a high performance small matrix library generally needs the support of SIMD.

More than a year ago, I used to develop a C++ template library that implements small matrix algebra between 2x2, 2x3, 2x4, 3x2, 3x3, 3x4, 4x2, 4x3, 4x4 and short vectors of length 2, 3, 4. (http://code.google.com/p/light-simd/) using SSE.

That took quite a lot of efforts. Basically, for each specific matrix size, you have to write a specialized routine. For example, a 2x2 float matrix can fit in a SSE register, while you need to use two SSE registers to store a 2x2 double matrix. However, one AVX register is able to accommodate a 2x2 double matrix. The codes for these different settings are quite different.

Here is the code for computing matrix product of 2x2 matrices:

# suppose SSE vector a holds a 2x2 float matrix A, and b holds B
__m128 t1 = _mm_movelh_ps(a, a);   // t1 = [a11, a21, a11, a21]
__m128 t2 = _mm_moveldup_ps(b);   // t2 = [b11, b11, b12, b12]
__m128 t3 = _mm_movehl_ps(a, a);   // t3 = [a12, a22, a12, a22]
__m128 t4 = _mm_movehdup_ps(b);  // t4 = [b21, b21, b22, b22]
t1 = _mm_mul_ps(t1, t2);   // t1 = [a11 * b11, a21 * b11, a11 * b12, a21 * b12]
t3 = _mm_mul_ps(t3, t4);   // t2 = [a12 * b21, a22 * b21, a12 * b22, a22 * b22]
__m128 r = _mm_add_ps(t1, t3);  // the result

I implemented such products for each combination of (m x k) * (k x n) for m, n, k from 1 to 4, which took thousands of C++ lines of code to provide all these specialized versions. However, the performance was amazing, as the capability of CPU is pushed to the limit.

If SIMD is available in Julia, I can easily port these codes (provided that the intrinsics for swizzling/shuffling are available).

However, SIMD capability is the prerequisite to make all these happen.

@ViralBShah
Copy link
Member

How difficult would it be to use the light-simd library with ccall? I also wonder if the overhead of ccall is too high to dwarf the gains here, but it may be worth the experiment.

@lindahua
Copy link
Contributor Author

light-simd was a C++ template library ... I think it was out of the capability of ccall.

@lindahua
Copy link
Contributor Author

What I was thinking is that we can re-implement these things in Julia, but take light-simd as a reference.

@ViralBShah
Copy link
Member

I just realized that myself. It would need C wrappers, and it is better to instead focus on adding SIMD instructions to julia.

@lindahua
Copy link
Contributor Author

I agreed.

@JeffBezanson
Copy link
Member

That looks like a nice library, very compact and clever. We could potentially call it using swig-generated C wrappers for a couple argument types.

@JeffBezanson
Copy link
Member

Also, yes, immutable won't affect this simple benchmark, since in every case a matrix is just created to be stored into an untyped context, one at a time, and then thrown away. If you had more elaborate nests of loops and function calls going on you'd probably start to see a difference.

@ViralBShah
Copy link
Member

If the operations are a few avx instructions, would swig wrappers not destroy the gains? What would be the effort for adding the simd types and instructions?

@JeffBezanson
Copy link
Member

It's true, we'd want to be able to inline them. We could potentially compile them to llvm bitcode and include that.

@felixrehren
Copy link
Contributor

Current status:

  • a * b takes 25% longer than mul2(a,b)
  • inv(c) takes x10 longer than inv2(c)

Big improvements from OP. Tests:

julia> a = rand(2,2);
julia> b = rand(2,2);
julia> r = a * b;
julia> @time for i = 1 : 10^6
       r = a * b
       end
  0.112353 seconds (1000.00 k allocations: 106.812 MB, 8.97% gc time)
julia> mul2(a,b);
julia> @time for i = 1 : 10^6
       r = mul2(a,b)
       end
  0.087943 seconds (1000.00 k allocations: 106.812 MB, 6.36% gc time)

julia> c = rand(2,2);
julia> r = inv(c);
julia> @time for i in 1:10^6
       r = inv(c)
       end
  1.072095 seconds (10.00 M allocations: 1.654 GB, 8.07% gc time)
julia> r = inv2(c);
julia> @time for i in 1:10^6
       r = inv2(c)
       end
  0.085211 seconds (1000.00 k allocations: 106.812 MB, 8.60% gc time)

Is this already benchmarked?

@andreasnoack
Copy link
Member

Notice that we now have https://github.com/JuliaArrays/StaticArrays.jl. Also, although the Base methods possibly could do better for small matrices, the fast versions used in the benchmarks here are not directly comparable to the Base versions because the fast versions use less precise algorithms.

@JeffBezanson JeffBezanson modified the milestones: 1.x, 1.0 May 10, 2017
@KristofferC
Copy link
Member

KristofferC commented May 26, 2017

I feel this is sufficiently covered by StaticArrays (which also automatically uses SIMD in matmul with -O3). Please reopen if anyone disagrees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

8 participants