Performance on small matrix algebra #2489

lindahua · 2013-03-07T04:50:34Z

The following code compares current Julia implementation and hand-crafted implementation of matrix multiplication and inversion on 2-by-2 matrices.

a = rand(2, 2)
b = rand(2, 2)

r = a * b  # warming up
@time for i = 1 : 10^6
    r = a * b
end

function mul2(a::Matrix, b::Matrix)
    c = Array(Float64, 2, 2)

    c[1] = a[1] * b[1] + a[3] * b[2]
    c[2] = a[2] * b[1] + a[4] * b[2]
    c[3] = a[1] * b[3] + a[3] * b[4]
    c[4] = a[2] * b[3] + a[4] * b[4]
    c
end

mul2(a, b) # warming up
@time for i = 1 : 10^6
    r = mul2(a, b)
end

mul2(a, b) is about 4x faster than a * b.

Even more significant difference is observed for inversion.

a = rand(2, 2)

r = inv(a)
@time for i = 1 : 10^6
    r = inv(a)
end

function inv2(a::Matrix)
    c = Array(Float64, 2, 2)

    detv = a[1] * a[4] - a[2] * a[3]
    inv_d = 1 / detv

    c[1] = a[4] * inv_d
    c[2] = -a[2] * inv_d
    c[3] = -a[3] * inv_d
    c[4] = a[1] * inv_d
    c
end

r = inv2(a)
@time for i = 1 : 10^6
    r = inv2(a)
end

inv2 is about 25x faster than inv.

A carefully crafted SIMD implementation can lead to 5x to 10x gain compared to the scalar version inv2 and mul2 -- which is at least two orders of magnitude faster than current Julia version. There are some reference SIMD implementation in Intel's website. Eigen also implements some of these.

Fast implementation on small matrices (e.g. 2 x 2, 3 x 3, and 4 x 4) are important for graphics, geometric computation, and some image processing applications.

The text was updated successfully, but these errors were encountered:

lindahua · 2013-03-07T04:52:50Z

It is known that the overhead of using BLAS is significant for small matrices. They are better than naive implementation only when the matrix is relatively large (e.g. larger than 8 x 8 or 16 x 16).

JeffBezanson · 2013-03-07T04:56:40Z

We actually have special-case 2x2 and 3x3 matmul, but unfortunately over time they've been shoved behind more and more layers of dispatch. We should add at least your inv2, plus a cutoff for calling BLAS.

ViralBShah · 2013-03-07T05:42:06Z

It would be nice to revisit these layers and have faster code paths for smaller matrices. I am not sure if it is worthwhile for the time being to have a SIMD small matrix linear algebra library. I would still much rather push for having SIMD capabilities in julia rather than do something special purpose.

ViralBShah · 2013-03-07T07:29:56Z

SIMD capabilities are discussed in #2299.

timholy · 2013-03-07T10:19:47Z

Yes, but check this out:

julia> @time for i = 1:10^6; r = mul2(a,b); end
elapsed time: 0.15086007118225098 seconds

@time for i = 1:10^6; Base.matmul2x2(r, 'N', 'N', a, b); end
elapsed time: 0.05599689483642578 seconds

About 3x faster than mul2, because it avoids allocation of the output. (It may also help to do the array access only once, since it avoids unnecessary bounds checks. See #1392.)

I agree with you on inv2, that's definitely important and currently missing. It would also be good to have 4x4 covered.

lindahua · 2013-03-07T13:10:48Z

I agree with Viral that we should first make SIMD capability available. Small matrix computation (e.g. multiplication, equation solving, inversion, etc) can be implemented in a incredibly fast way if SIMD instructions are exposed.

For example, an entire 2x2 matrix can be loaded into a single AVX register, and it takes no more than 5 - 6 CPU cycles to compute the inversion if done carefully (much faster than the scalar version inv2 above). Intel demonstrated how one can compute inversion of a 4x4 matrix within 200+ CPU cycles using SSE . There are codes for doing many such things in many C++ libraries, e.g. Eigen.

To achieve this performance, the intrinsics for shuffling an SIMD vector have to be exposed.

This is not an urgent issue, but we should keep it here and revisit this once SIMD is ready.

ViralBShah · 2013-03-10T03:10:46Z

I am thinking it may be interesting to have a SmallMatrix package for experimentation with small linear algebra. It would be really interesting if some of these small objects can be allocated on the stack.

Just for fun, here is something I tried with 2x2, but such a package could have many other things in it, support matrices up to a size that can comfortably fit in L1 cache, and operate seamlessly with larger matrices. We can also experiment with SIMD, when we get it. Eventually, if this is compelling enough, we can move it to Base.

importall Base

immutable smat2x2{T}         
   a::T
   b::T
   c::T
   d::T
end

*(x::smat2x2, y::smat2x2) = smat2x2(x.a * y.a + x.b * y.c,
                                           x.a * y.b + x.b * y.d,
                                           x.c * y.a + x.d * y.c,
                                           x.c * y.b + x.d * y.d)

Now, trying it out, it is marginally slower than matmul2x2:

julia> x = smat2x2(1.0,1.0,1.0,1.0)
julia> y = smat2x2(1.0,1.0,1.0,1.0)
julia> @time for i=1:10^6; z = x*y; end
elapsed time: 0.058766577 seconds

pao · 2013-03-10T03:58:40Z

While I'm making unreasonable requests (i.e., modtau), if we're playing with small matrices, a nice way to deal with 3x3 and 3x1 cases while getting the performance of 4x4 and 4x1 with SIMD intrinsics would be a really sweet thing to have.

JeffBezanson · 2013-03-10T04:14:32Z

Nope, it is way faster:

julia> @time for i=1:10^6; z = x*y; end
elapsed time: 0.050591359 seconds

julia> @time for i=1:10^6; z = Base.matmul2x2('N','N',a,b); end
elapsed time: 0.132124836 seconds

matmul2x2 is only faster if you add manual memory management.

ViralBShah · 2013-03-10T04:26:55Z

My comparisons were with the manual memory management version. Also, it seems that using immutable here does not make much of a difference.

lindahua · 2013-03-10T04:37:07Z

It would be great to have a high performance small matrix package. Implementing a high performance small matrix library generally needs the support of SIMD.

More than a year ago, I used to develop a C++ template library that implements small matrix algebra between 2x2, 2x3, 2x4, 3x2, 3x3, 3x4, 4x2, 4x3, 4x4 and short vectors of length 2, 3, 4. (http://code.google.com/p/light-simd/) using SSE.

That took quite a lot of efforts. Basically, for each specific matrix size, you have to write a specialized routine. For example, a 2x2 float matrix can fit in a SSE register, while you need to use two SSE registers to store a 2x2 double matrix. However, one AVX register is able to accommodate a 2x2 double matrix. The codes for these different settings are quite different.

Here is the code for computing matrix product of 2x2 matrices:

# suppose SSE vector a holds a 2x2 float matrix A, and b holds B
__m128 t1 = _mm_movelh_ps(a, a);   // t1 = [a11, a21, a11, a21]
__m128 t2 = _mm_moveldup_ps(b);   // t2 = [b11, b11, b12, b12]
__m128 t3 = _mm_movehl_ps(a, a);   // t3 = [a12, a22, a12, a22]
__m128 t4 = _mm_movehdup_ps(b);  // t4 = [b21, b21, b22, b22]
t1 = _mm_mul_ps(t1, t2);   // t1 = [a11 * b11, a21 * b11, a11 * b12, a21 * b12]
t3 = _mm_mul_ps(t3, t4);   // t2 = [a12 * b21, a22 * b21, a12 * b22, a22 * b22]
__m128 r = _mm_add_ps(t1, t3);  // the result

I implemented such products for each combination of (m x k) * (k x n) for m, n, k from 1 to 4, which took thousands of C++ lines of code to provide all these specialized versions. However, the performance was amazing, as the capability of CPU is pushed to the limit.

If SIMD is available in Julia, I can easily port these codes (provided that the intrinsics for swizzling/shuffling are available).

However, SIMD capability is the prerequisite to make all these happen.

ViralBShah · 2013-03-10T04:47:49Z

How difficult would it be to use the light-simd library with ccall? I also wonder if the overhead of ccall is too high to dwarf the gains here, but it may be worth the experiment.

lindahua · 2013-03-10T04:50:23Z

light-simd was a C++ template library ... I think it was out of the capability of ccall.

lindahua · 2013-03-10T04:51:56Z

What I was thinking is that we can re-implement these things in Julia, but take light-simd as a reference.

ViralBShah · 2013-03-10T04:52:13Z

I just realized that myself. It would need C wrappers, and it is better to instead focus on adding SIMD instructions to julia.

lindahua · 2013-03-10T04:53:33Z

I agreed.

JeffBezanson · 2013-03-10T04:57:20Z

That looks like a nice library, very compact and clever. We could potentially call it using swig-generated C wrappers for a couple argument types.

JeffBezanson · 2013-03-10T04:59:05Z

Also, yes, immutable won't affect this simple benchmark, since in every case a matrix is just created to be stored into an untyped context, one at a time, and then thrown away. If you had more elaborate nests of loops and function calls going on you'd probably start to see a difference.

ViralBShah · 2013-03-10T05:32:13Z

If the operations are a few avx instructions, would swig wrappers not destroy the gains? What would be the effort for adding the simd types and instructions?

JeffBezanson · 2013-03-10T05:38:05Z

It's true, we'd want to be able to inline them. We could potentially compile them to llvm bitcode and include that.

felixrehren · 2017-01-23T17:40:53Z

Current status:

a * b takes 25% longer than mul2(a,b)
inv(c) takes x10 longer than inv2(c)

Big improvements from OP. Tests:

julia> a = rand(2,2);
julia> b = rand(2,2);
julia> r = a * b;
julia> @time for i = 1 : 10^6
       r = a * b
       end
  0.112353 seconds (1000.00 k allocations: 106.812 MB, 8.97% gc time)
julia> mul2(a,b);
julia> @time for i = 1 : 10^6
       r = mul2(a,b)
       end
  0.087943 seconds (1000.00 k allocations: 106.812 MB, 6.36% gc time)

julia> c = rand(2,2);
julia> r = inv(c);
julia> @time for i in 1:10^6
       r = inv(c)
       end
  1.072095 seconds (10.00 M allocations: 1.654 GB, 8.07% gc time)
julia> r = inv2(c);
julia> @time for i in 1:10^6
       r = inv2(c)
       end
  0.085211 seconds (1000.00 k allocations: 106.812 MB, 8.60% gc time)

Is this already benchmarked?

andreasnoack · 2017-01-23T17:48:47Z

Notice that we now have https://github.com/JuliaArrays/StaticArrays.jl. Also, although the Base methods possibly could do better for small matrices, the fast versions used in the benchmarks here are not directly comparable to the Base versions because the fast versions use less precise algorithms.

KristofferC · 2017-05-26T22:00:42Z

I feel this is sufficiently covered by StaticArrays (which also automatically uses SIMD in matmul with -O3). Please reopen if anyone disagrees.

JeffBezanson mentioned this issue May 30, 2013

Poor performance of OpenBLAS matrix-vector multiplication on small sizes #3239

Closed

lindahua mentioned this issue Feb 19, 2014

Do we want fixed-size arrays? #5857

Closed

ViralBShah added the feature label Apr 27, 2014

ViralBShah removed the feature label Feb 14, 2015

JeffBezanson modified the milestones: 1.x, 1.0 May 10, 2017

KristofferC closed this as completed May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on small matrix algebra #2489

Performance on small matrix algebra #2489

lindahua commented Mar 7, 2013

lindahua commented Mar 7, 2013

JeffBezanson commented Mar 7, 2013

ViralBShah commented Mar 7, 2013

ViralBShah commented Mar 7, 2013

timholy commented Mar 7, 2013

lindahua commented Mar 7, 2013

ViralBShah commented Mar 10, 2013

pao commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

lindahua commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

lindahua commented Mar 10, 2013

lindahua commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

lindahua commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

felixrehren commented Jan 23, 2017

andreasnoack commented Jan 23, 2017

KristofferC commented May 26, 2017 •

edited

Loading

Performance on small matrix algebra #2489

Performance on small matrix algebra #2489

Comments

lindahua commented Mar 7, 2013

lindahua commented Mar 7, 2013

JeffBezanson commented Mar 7, 2013

ViralBShah commented Mar 7, 2013

ViralBShah commented Mar 7, 2013

timholy commented Mar 7, 2013

lindahua commented Mar 7, 2013

ViralBShah commented Mar 10, 2013

pao commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

lindahua commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

lindahua commented Mar 10, 2013

lindahua commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

lindahua commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

ViralBShah commented Mar 10, 2013

JeffBezanson commented Mar 10, 2013

felixrehren commented Jan 23, 2017

andreasnoack commented Jan 23, 2017

KristofferC commented May 26, 2017 • edited Loading

KristofferC commented May 26, 2017 •

edited

Loading