-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance on small matrix algebra #2489
Comments
It is known that the overhead of using BLAS is significant for small matrices. They are better than naive implementation only when the matrix is relatively large (e.g. larger than |
We actually have special-case 2x2 and 3x3 matmul, but unfortunately over time they've been shoved behind more and more layers of dispatch. We should add at least your |
It would be nice to revisit these layers and have faster code paths for smaller matrices. I am not sure if it is worthwhile for the time being to have a SIMD small matrix linear algebra library. I would still much rather push for having SIMD capabilities in julia rather than do something special purpose. |
SIMD capabilities are discussed in #2299. |
Yes, but check this out:
About 3x faster than I agree with you on |
I agree with Viral that we should first make SIMD capability available. Small matrix computation (e.g. multiplication, equation solving, inversion, etc) can be implemented in a incredibly fast way if SIMD instructions are exposed. For example, an entire 2x2 matrix can be loaded into a single AVX register, and it takes no more than 5 - 6 CPU cycles to compute the inversion if done carefully (much faster than the scalar version To achieve this performance, the intrinsics for shuffling an SIMD vector have to be exposed. This is not an urgent issue, but we should keep it here and revisit this once SIMD is ready. |
I am thinking it may be interesting to have a Just for fun, here is something I tried with 2x2, but such a package could have many other things in it, support matrices up to a size that can comfortably fit in L1 cache, and operate seamlessly with larger matrices. We can also experiment with SIMD, when we get it. Eventually, if this is compelling enough, we can move it to Base.
Now, trying it out, it is marginally slower than matmul2x2:
|
While I'm making unreasonable requests (i.e., |
Nope, it is way faster:
|
My comparisons were with the manual memory management version. Also, it seems that using |
It would be great to have a high performance small matrix package. Implementing a high performance small matrix library generally needs the support of SIMD. More than a year ago, I used to develop a C++ template library that implements small matrix algebra between 2x2, 2x3, 2x4, 3x2, 3x3, 3x4, 4x2, 4x3, 4x4 and short vectors of length 2, 3, 4. (http://code.google.com/p/light-simd/) using SSE. That took quite a lot of efforts. Basically, for each specific matrix size, you have to write a specialized routine. For example, a 2x2 float matrix can fit in a SSE register, while you need to use two SSE registers to store a 2x2 double matrix. However, one AVX register is able to accommodate a 2x2 double matrix. The codes for these different settings are quite different. Here is the code for computing matrix product of 2x2 matrices: # suppose SSE vector a holds a 2x2 float matrix A, and b holds B
__m128 t1 = _mm_movelh_ps(a, a); // t1 = [a11, a21, a11, a21]
__m128 t2 = _mm_moveldup_ps(b); // t2 = [b11, b11, b12, b12]
__m128 t3 = _mm_movehl_ps(a, a); // t3 = [a12, a22, a12, a22]
__m128 t4 = _mm_movehdup_ps(b); // t4 = [b21, b21, b22, b22]
t1 = _mm_mul_ps(t1, t2); // t1 = [a11 * b11, a21 * b11, a11 * b12, a21 * b12]
t3 = _mm_mul_ps(t3, t4); // t2 = [a12 * b21, a22 * b21, a12 * b22, a22 * b22]
__m128 r = _mm_add_ps(t1, t3); // the result I implemented such products for each combination of If SIMD is available in Julia, I can easily port these codes (provided that the intrinsics for swizzling/shuffling are available). However, SIMD capability is the prerequisite to make all these happen. |
How difficult would it be to use the |
light-simd was a C++ template library ... I think it was out of the capability of |
What I was thinking is that we can re-implement these things in Julia, but take light-simd as a reference. |
I just realized that myself. It would need C wrappers, and it is better to instead focus on adding SIMD instructions to julia. |
I agreed. |
That looks like a nice library, very compact and clever. We could potentially call it using swig-generated C wrappers for a couple argument types. |
Also, yes, immutable won't affect this simple benchmark, since in every case a matrix is just created to be stored into an untyped context, one at a time, and then thrown away. If you had more elaborate nests of loops and function calls going on you'd probably start to see a difference. |
If the operations are a few avx instructions, would swig wrappers not destroy the gains? What would be the effort for adding the simd types and instructions? |
It's true, we'd want to be able to inline them. We could potentially compile them to llvm bitcode and include that. |
Current status:
Big improvements from OP. Tests:
Is this already benchmarked? |
Notice that we now have https://github.com/JuliaArrays/StaticArrays.jl. Also, although the Base methods possibly could do better for small matrices, the fast versions used in the benchmarks here are not directly comparable to the Base versions because the fast versions use less precise algorithms. |
I feel this is sufficiently covered by StaticArrays (which also automatically uses SIMD in matmul with -O3). Please reopen if anyone disagrees. |
The following code compares current Julia implementation and hand-crafted implementation of matrix multiplication and inversion on 2-by-2 matrices.
mul2(a, b)
is about 4x faster thana * b
.Even more significant difference is observed for inversion.
inv2
is about 25x faster thaninv
.A carefully crafted SIMD implementation can lead to 5x to 10x gain compared to the scalar version
inv2
andmul2
-- which is at least two orders of magnitude faster than current Julia version. There are some reference SIMD implementation in Intel's website. Eigen also implements some of these.Fast implementation on small matrices (e.g.
2 x 2
,3 x 3
, and4 x 4
) are important for graphics, geometric computation, and some image processing applications.The text was updated successfully, but these errors were encountered: