Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you see the complete assembly-code for a function? #446

Closed
maxhuebner opened this issue Nov 28, 2022 · 14 comments
Closed

Can you see the complete assembly-code for a function? #446

maxhuebner opened this issue Nov 28, 2022 · 14 comments

Comments

@maxhuebner
Copy link

maxhuebner commented Nov 28, 2022

Is there a way for me to see the generated LLVM or assembly-code while using @turbo?

Background:
I'm currently writing my thesis using Julia and similar projects. The goal is basically to compare the performance as well as apply optimizations and see how well they do.
Using this package I get great results in terms of performance, but am currently unable to figure out how the generated assembly code looks.

If I run @code_native (or @code_llvm) on a function, that uses @turbo I'm not getting the full assembly code. For example, if I run this on the function below I cannot find a single instruction that would be performing multiplication, addition or fma.

Other related side questions:

  1. How would I cite this project appropriately?
  2. Julia already uses SIMD instructions if I'm not mistaken. Is this package basically just able to do this better?
function matmult_turbo!(A,B,C)
    n,m = size(A)
    @turbo for j in 1:n 
        for k in 1:n
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end
@chriselrod
Copy link
Member

I recomend @descend.
Descend into the _turbo_!.
You'll be able to toggle debug info on/off (d), as well as LLVM IR (L), or native code (N).

My startup.jl contains

using Cthulhu
Cthulhu.CONFIG.asm_syntax = :intel
Cthulhu.CONFIG.enable_highlighter = true

because I prefer ASM syntax.

For example, I get this from your function for the primary inner-most loop:

L496:
        vbroadcastsd    zmm28, qword ptr [r8]
        vmovupd zmm29, zmmword ptr [rdx]
        vmovupd zmm30, zmmword ptr [rdx + 64]
        vmovupd zmm27, zmmword ptr [rdx + 128]
        prefetcht0      byte ptr [rdx + rax]
        prefetcht0      byte ptr [rdx + rax + 64]
        prefetcht0      byte ptr [rdx + rax + 128]
        vfmadd231pd     zmm18, zmm29, zmm28     # zmm18 = (zmm29 * zmm28) + zmm18
        vfmadd231pd     zmm9, zmm30, zmm28      # zmm9 = (zmm30 * zmm28) + zmm9
        vbroadcastsd    zmm31, qword ptr [r8 + r15]
        vfmadd231pd     zmm0, zmm27, zmm28      # zmm0 = (zmm27 * zmm28) + zmm0
        vfmadd231pd     zmm19, zmm29, zmm31     # zmm19 = (zmm29 * zmm31) + zmm19
        vfmadd231pd     zmm10, zmm30, zmm31     # zmm10 = (zmm30 * zmm31) + zmm10
        vfmadd231pd     zmm1, zmm27, zmm31      # zmm1 = (zmm27 * zmm31) + zmm1
        vbroadcastsd    zmm28, qword ptr [r8 + 2*r15]
        vfmadd231pd     zmm20, zmm29, zmm28     # zmm20 = (zmm29 * zmm28) + zmm20
        vfmadd231pd     zmm11, zmm30, zmm28     # zmm11 = (zmm30 * zmm28) + zmm11
        vfmadd231pd     zmm2, zmm27, zmm28      # zmm2 = (zmm27 * zmm28) + zmm2
        vbroadcastsd    zmm28, qword ptr [r8 + rbx]
        vfmadd231pd     zmm21, zmm29, zmm28     # zmm21 = (zmm29 * zmm28) + zmm21
        vfmadd231pd     zmm12, zmm30, zmm28     # zmm12 = (zmm30 * zmm28) + zmm12
        vfmadd231pd     zmm3, zmm27, zmm28      # zmm3 = (zmm27 * zmm28) + zmm3
        vbroadcastsd    zmm28, qword ptr [r8 + 4*r15]
        vfmadd231pd     zmm22, zmm29, zmm28     # zmm22 = (zmm29 * zmm28) + zmm22
        vfmadd231pd     zmm13, zmm30, zmm28     # zmm13 = (zmm30 * zmm28) + zmm13
        vfmadd231pd     zmm4, zmm27, zmm28      # zmm4 = (zmm27 * zmm28) + zmm4
        vbroadcastsd    zmm28, qword ptr [r8 + r13]
        vfmadd231pd     zmm23, zmm29, zmm28     # zmm23 = (zmm29 * zmm28) + zmm23
        vfmadd231pd     zmm14, zmm30, zmm28     # zmm14 = (zmm30 * zmm28) + zmm14
        vfmadd231pd     zmm5, zmm27, zmm28      # zmm5 = (zmm27 * zmm28) + zmm5
        vbroadcastsd    zmm28, qword ptr [r8 + 2*rbx]
        vfmadd231pd     zmm24, zmm29, zmm28     # zmm24 = (zmm29 * zmm28) + zmm24
        vfmadd231pd     zmm15, zmm30, zmm28     # zmm15 = (zmm30 * zmm28) + zmm15
        vbroadcastsd    zmm31, qword ptr [r8 + r12]
        vfmadd231pd     zmm6, zmm27, zmm28      # zmm6 = (zmm27 * zmm28) + zmm6
        vfmadd231pd     zmm25, zmm29, zmm31     # zmm25 = (zmm29 * zmm31) + zmm25
        vfmadd231pd     zmm16, zmm30, zmm31     # zmm16 = (zmm30 * zmm31) + zmm16
        vfmadd231pd     zmm7, zmm27, zmm31      # zmm7 = (zmm27 * zmm31) + zmm7
        vbroadcastsd    zmm28, qword ptr [r8 + 8*r15]
        vfmadd231pd     zmm26, zmm28, zmm29     # zmm26 = (zmm28 * zmm29) + zmm26
        vfmadd231pd     zmm17, zmm28, zmm30     # zmm17 = (zmm28 * zmm30) + zmm17
        vfmadd231pd     zmm8, zmm27, zmm28      # zmm8 = (zmm27 * zmm28) + zmm8
        add     rdx, r11
        add     r8, 8
        cmp     rdx, rcx
        jbe     L496

Note that this has 12 loads total (3 vmovupd and 9 vbroadcastsd), 0 stores, and 27 fma instructions.

If you compare it to the assembly of @inbounds (whether or not you add @simd or @fastmath), it's quite good; that may yield something like

.LBB0_8:                                # %vector.body
                                        #   Parent Loop BB0_2 Depth=1
                                        #     Parent Loop BB0_3 Depth=2
                                        # =>    This Inner Loop Header: Depth=3
        vmovupd zmm1, zmmword ptr [rbp + 8*r15 - 192]
        vmovupd zmm2, zmmword ptr [rbp + 8*r15 - 128]
        vmovupd zmm3, zmmword ptr [rbp + 8*r15 - 64]
        vmovupd zmm4, zmmword ptr [rbp + 8*r15]
        vbroadcastsd    zmm5, xmm0
        vfmadd213pd     zmm1, zmm5, zmmword ptr [rdx + 8*r15 - 192] # zmm1 = (zmm5 * zmm1) + mem
        vfmadd213pd     zmm2, zmm5, zmmword ptr [rdx + 8*r15 - 128] # zmm2 = (zmm5 * zmm2) + mem
        vfmadd213pd     zmm3, zmm5, zmmword ptr [rdx + 8*r15 - 64] # zmm3 = (zmm5 * zmm3) + mem
        vfmadd213pd     zmm4, zmm5, zmmword ptr [rdx + 8*r15] # zmm4 = (zmm5 * zmm4) + mem
        vmovupd zmmword ptr [rdx + 8*r15 - 192], zmm1
        vmovupd zmmword ptr [rdx + 8*r15 - 128], zmm2
        vmovupd zmmword ptr [rdx + 8*r15 - 64], zmm3
        vmovupd zmmword ptr [rdx + 8*r15], zmm4
        add     r15, 32
        cmp     r14, r15
        jne     .LBB0_8

We have 9 loads (4 vmovupd, 4 loads as part of the vfmad213pds, and 1 vbroadcast), 4 stores (vmovupd), and 4 fmas.

Julia already uses SIMD instructions if I'm not mistaken. Is this package basically just able to do this better?

As alluded to above, the compute density (fmas vs load and store instructions) of the @turbo version is much better than the @inbounds @fastmath here, which is why it gets much better performance.
This (compute density relative to loads and stores) is what LoopVectorization.jl tries to optimize in its code generation.

I am working on rewriting LoopVectorization with the aim of removing many of its current limitations and hopefully further improve code generation in more cases (but many are already quite close to optimal).

  1. How would I cite this project appropriately?

I'm not sure.
https://citation-file-format.github.io/
I don't have an orcid.

@maxhuebner
Copy link
Author

Thanks for that thorough explanation, you really helped me out a lot!

Following your example, I was able to extract almost identical assembly code as the one you showed.
However, without @turbo the code looks a bit worse. Julia will use FMA instructions with @fastmath but will not use zmm-registers even tough they are available. Any idea why and whether this would affect the performance?

This is what it looks like with @inbounds and @fastmath (@simd had no visible effect):

L384:
        vbroadcastsd    ymm14, qword ptr [rsi]
        vmovupd ymm15, ymmword ptr [rcx]
        vmovupd ymm0, ymmword ptr [rcx + 32]
        prefetcht0      byte ptr [rcx + rbx]
        add     rcx, r11
        vfmadd231pd     ymm8, ymm15, ymm14      # ymm8 = (ymm15 * ymm14) + ymm8
        vfmadd231pd     ymm2, ymm0, ymm14       # ymm2 = (ymm0 * ymm14) + ymm2
        vbroadcastsd    ymm14, qword ptr [rsi + r13]
        vfmadd231pd     ymm9, ymm15, ymm14      # ymm9 = (ymm15 * ymm14) + ymm9
        vfmadd231pd     ymm3, ymm0, ymm14       # ymm3 = (ymm0 * ymm14) + ymm3
        vbroadcastsd    ymm14, qword ptr [rsi + 2*r13]
        vfmadd231pd     ymm10, ymm15, ymm14     # ymm10 = (ymm15 * ymm14) + ymm10
        vfmadd231pd     ymm4, ymm0, ymm14       # ymm4 = (ymm0 * ymm14) + ymm4
        vbroadcastsd    ymm14, qword ptr [rsi + r15]
        vfmadd231pd     ymm11, ymm15, ymm14     # ymm11 = (ymm15 * ymm14) + ymm11
        vfmadd231pd     ymm5, ymm0, ymm14       # ymm5 = (ymm0 * ymm14) + ymm5
        vbroadcastsd    ymm14, qword ptr [rsi + 4*r13]
        vfmadd231pd     ymm12, ymm15, ymm14     # ymm12 = (ymm15 * ymm14) + ymm12
        vfmadd231pd     ymm6, ymm0, ymm14       # ymm6 = (ymm0 * ymm14) + ymm6
        vbroadcastsd    ymm14, qword ptr [rsi + rax]
        add     rsi, 8
        vfmadd231pd     ymm13, ymm14, ymm15     # ymm13 = (ymm14 * ymm15) + ymm13
        vfmadd231pd     ymm7, ymm0, ymm14       # ymm7 = (ymm0 * ymm14) + ymm7
        cmp     rcx, rdx
        jbe     L384

@chriselrod
Copy link
Member

chriselrod commented Nov 29, 2022

This is what it looks like with @inbounds and @fastmath (@simd had no visible effect):

You mean that is what @turbo produced, presumably on a CPU without AVX512 (which provides the zmm registers)?
I've never seen assembly like that from @inbounds + @fastmath, but is exactly what I would expect from @turbo on a CPU with AVX2 but without AVX512.

Julia will use FMA instructions with @fastmath but will not use zmm-registers even tough they are available. Any idea why and whether this would affect the performance?

But you are correct here -- when zmm registers are available, @turbo will use them, but Julia will not otherwise by default.
To get it to use them, I start Julia with -C"native,-prefer-256-bit". This disables the "prefer-256-bit" option, which is enabled by default.

@chriselrod
Copy link
Member

Any idea why and whether this would affect the performance?

Yes, zmm registers should generally be anywhere from a bit up to 2x faster than ymm, depending on the CPU.
Which CPU are you using; mind sharing your versioninfo()?

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 28 on 28 virtual cores

@maxhuebner
Copy link
Author

mind sharing your versioninfo()?

The CPU I used for this example is a Xeon Gold 6248 (Cascade Lake):

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)

You mean that is what @turbo produced, presumably on a CPU without AVX512 (which provides the zmm registers)?

No, turbo produced the following code, that looks just like yours and performs really well:

L496:
        vbroadcastsd    zmm28, qword ptr [r8]
        vmovupd zmm29, zmmword ptr [rdx]
        vmovupd zmm30, zmmword ptr [rdx + 64]
        vmovupd zmm27, zmmword ptr [rdx + 128]
        prefetcht0      byte ptr [rdx + rax]
        prefetcht0      byte ptr [rdx + rax + 64]
        prefetcht0      byte ptr [rdx + rax + 128]
        vfmadd231pd     zmm18, zmm29, zmm28     # zmm18 = (zmm29 * zmm28) + zmm18
        vfmadd231pd     zmm9, zmm30, zmm28      # zmm9 = (zmm30 * zmm28) + zmm9
        vbroadcastsd    zmm31, qword ptr [r8 + r15]
        vfmadd231pd     zmm0, zmm27, zmm28      # zmm0 = (zmm27 * zmm28) + zmm0
        vfmadd231pd     zmm19, zmm29, zmm31     # zmm19 = (zmm29 * zmm31) + zmm19
        vfmadd231pd     zmm10, zmm30, zmm31     # zmm10 = (zmm30 * zmm31) + zmm10
        vfmadd231pd     zmm1, zmm27, zmm31      # zmm1 = (zmm27 * zmm31) + zmm1
        vbroadcastsd    zmm28, qword ptr [r8 + 2*r15]
        vfmadd231pd     zmm20, zmm29, zmm28     # zmm20 = (zmm29 * zmm28) + zmm20
        vfmadd231pd     zmm11, zmm30, zmm28     # zmm11 = (zmm30 * zmm28) + zmm11
        vfmadd231pd     zmm2, zmm27, zmm28      # zmm2 = (zmm27 * zmm28) + zmm2
        vbroadcastsd    zmm28, qword ptr [r8 + rbx]
        vfmadd231pd     zmm21, zmm29, zmm28     # zmm21 = (zmm29 * zmm28) + zmm21
        vfmadd231pd     zmm12, zmm30, zmm28     # zmm12 = (zmm30 * zmm28) + zmm12
        vfmadd231pd     zmm3, zmm27, zmm28      # zmm3 = (zmm27 * zmm28) + zmm3
        vbroadcastsd    zmm28, qword ptr [r8 + 4*r15]
        vfmadd231pd     zmm22, zmm29, zmm28     # zmm22 = (zmm29 * zmm28) + zmm22
        vfmadd231pd     zmm13, zmm30, zmm28     # zmm13 = (zmm30 * zmm28) + zmm13
        vfmadd231pd     zmm4, zmm27, zmm28      # zmm4 = (zmm27 * zmm28) + zmm4
        vbroadcastsd    zmm28, qword ptr [r8 + r13]
        vfmadd231pd     zmm23, zmm29, zmm28     # zmm23 = (zmm29 * zmm28) + zmm23
        vfmadd231pd     zmm14, zmm30, zmm28     # zmm14 = (zmm30 * zmm28) + zmm14
        vfmadd231pd     zmm5, zmm27, zmm28      # zmm5 = (zmm27 * zmm28) + zmm5
        vbroadcastsd    zmm28, qword ptr [r8 + 2*rbx]
        vfmadd231pd     zmm24, zmm29, zmm28     # zmm24 = (zmm29 * zmm28) + zmm24
        vfmadd231pd     zmm15, zmm30, zmm28     # zmm15 = (zmm30 * zmm28) + zmm15
        vbroadcastsd    zmm31, qword ptr [r8 + r12]
        vfmadd231pd     zmm6, zmm27, zmm28      # zmm6 = (zmm27 * zmm28) + zmm6
        vfmadd231pd     zmm25, zmm29, zmm31     # zmm25 = (zmm29 * zmm31) + zmm25
        vfmadd231pd     zmm16, zmm30, zmm31     # zmm16 = (zmm30 * zmm31) + zmm16
        vfmadd231pd     zmm7, zmm27, zmm31      # zmm7 = (zmm27 * zmm31) + zmm7
        vbroadcastsd    zmm28, qword ptr [r8 + 8*r15]
        vfmadd231pd     zmm26, zmm28, zmm29     # zmm26 = (zmm28 * zmm29) + zmm26
        vfmadd231pd     zmm17, zmm28, zmm30     # zmm17 = (zmm28 * zmm30) + zmm17
        vfmadd231pd     zmm8, zmm27, zmm28      # zmm8 = (zmm27 * zmm28) + zmm8
        add     rdx, r11
        add     r8, 8
        cmp     rdx, rcx
        jbe     L496

The code from my last reply was produce without using LoopVectorization.jl, I was just wondering if you would know why Julia does not use AVX-512, although it is supported by the CPU.

Using the suggestion from your previous reply and starting Julia with julia --check-bounds=no -C"native,-prefer-256-bit" AVX-512 will be used, although the generated code with @turbo still looks better and you already explained why.
Here is how it look like (with @fastmath and the parameter above; without @turbo):

L496:
        vmovupd zmm1, zmmword ptr [r10 + 8*rax - 192]
        vmovupd zmm2, zmmword ptr [r10 + 8*rax - 128]
        vmovupd zmm3, zmmword ptr [r10 + 8*rax - 64]
        vmovupd zmm4, zmmword ptr [r10 + 8*rax]
        vbroadcastsd    zmm5, xmm0
        vfmadd213pd     zmm1, zmm5, zmmword ptr [r9 + 8*rax - 192] # zmm1 = (zmm5 * zmm1) + mem
        vfmadd213pd     zmm2, zmm5, zmmword ptr [r9 + 8*rax - 128] # zmm2 = (zmm5 * zmm2) + mem
        vfmadd213pd     zmm3, zmm5, zmmword ptr [r9 + 8*rax - 64] # zmm3 = (zmm5 * zmm3) + mem
        vfmadd213pd     zmm4, zmm5, zmmword ptr [r9 + 8*rax] # zmm4 = (zmm5 * zmm4) + mem
        vmovupd zmmword ptr [r9 + 8*rax - 192], zmm1
        vmovupd zmmword ptr [r9 + 8*rax - 128], zmm2
        vmovupd zmmword ptr [r9 + 8*rax - 64], zmm3
        vmovupd zmmword ptr [r9 + 8*rax], zmm4

If I only use @inbounds (via --check-bounds=no) the assembly is even worse:

L560:
        vbroadcastsd    ymm1, xmm0
        vmulpd  ymm2, ymm1, ymmword ptr [r8 + 8*r15 - 96]
        vmulpd  ymm3, ymm1, ymmword ptr [r8 + 8*r15 - 64]
        vmulpd  ymm4, ymm1, ymmword ptr [r8 + 8*r15 - 32]
        vmulpd  ymm1, ymm1, ymmword ptr [r8 + 8*r15]
        vaddpd  ymm2, ymm2, ymmword ptr [rcx + 8*r15 - 96]
        vaddpd  ymm3, ymm3, ymmword ptr [rcx + 8*r15 - 64]
        vaddpd  ymm4, ymm4, ymmword ptr [rcx + 8*r15 - 32]
        vaddpd  ymm1, ymm1, ymmword ptr [rcx + 8*r15]
        vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm2
        vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm3
        vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm4
        vmovupd ymmword ptr [rcx + 8*r15], ymm1
        add     r15, 16
        cmp     r9, r15
        jne     L560

Using @turbo compared to this will improve the performance by a factor of 5-6x (Example Measurement)

If you know the reason why Julia is not using AVX-512 by default I would appreciate if you'd let me know. Otherwise thanks again for the in depth answer.
I was also wondering why Julia can't just do what you are doing with this package. Is there a specific reason, why optimizations like yours are not applied per default (if Julia can identify, that a loop satisfies the necessary limitations)?

@chriselrod
Copy link
Member

The code from my last reply was produce without using LoopVectorization.jl,

I find that hard to believe that Julia without LoopVectorization.jl produced this. That linked code is a ymm[0-15] version of the zmm[0-31] @turbo code, which is what LoopVectorization.jl will produce on systems without AVX512:
It features 8 loads (2 vmovupd and 6 vbroadcastsd), 12 fmas, and 0 stores.
It also has 1 prefetch.
It also shows the same memory addressing pattern inside the vbroadcastsd, where it uses multiples 2*r13 and 4*r13 for the 3rd and 5th broadcast, and hoisted products for the rest.

@chriselrod
Copy link
Member

chriselrod commented Nov 30, 2022

Using the suggestion from your previous reply and starting Julia with julia --check-bounds=no -C"native,-prefer-256-bit" AVX-512 will be used, although the generated code with @turbo still looks better and you already explained why.
Here is how it look like (with @fastmath and the parameter above; without @turbo):

Those look like what I would expect from Julia. Note the huge number of loads and stores relative to arithmetic instructions: 1 broadcast + 8 loads, for 4 mul-adds (whether fma or separate muls and adds).

Using @turbo compared to this will improve the performance by a factor of 5-6x

The relative difference will probably be much larger with smaller matrices. Try sizes like 72x72, 144x144, 216x216.
This is because LoopVectorization.jl does not do any cache tiling, and hence does not do a good job reusing cache memory. It thus chokes on memory bandwidth.
Note that matrix multiply is O(N^3) (no one uses Strassen, but I'd like to see more experiments with it). While it uses O(N^2) memory total, it requires O(N^3) memory bandwidth -- matching the computational cost -- so memory bandwidth and reuse are just as important to optimizing matrix multiply for matrices too large for cache.

By dividing matrices into blocks and multiplying blockwise, you can get much better performance at larger sizes.
Gaius.jl takes a naive/recursive approach to this: https://github.com/MasonProtter/Gaius.jl
Octavian.jl takes a more sophisticated approach (but is still maybe a little lacking) and does much better: https://github.com/JuliaLinearAlgebra/Octavian.jl

@chriselrod
Copy link
Member

If you know the reason why Julia is not using AVX-512 by default I would appreciate if you'd let me know.

None of the major compilers use 512 bit vectors by default.
Officially, it's because thermal throttling -> worse performance.

Personally, I think it's because they suck at generating good vector code, and thus see little benefit.
For example:

using LoopVectorization, Random, BenchmarkTools


@inline function selfdotsimd(a)
  s = zero(eltype(a))
  @inbounds @simd for i  eachindex(a)
    s += a[i] * a[i]
  end
  s
end

@inline function selfdotturbo(a)
  s = zero(eltype(a))
  @turbo for i  eachindex(a)
    s += a[i] * a[i]
  end
  s
end

function forall!(f, y, x, Ns)
  @inbounds for i = eachindex(y,Ns)
    y[i] = f(@view(x[1:Ns[i]]))
  end
end

# we want to call selfdot for all sizes in a random order (to be harder on the branch predictor)
N = 512;
x = rand(N); Ns = randperm(length(x)); y = similar(x);

@benchmark forall!(selfdotsimd, $y, $x, $Ns)
@benchmark forall!(selfdotturbo, $y, $x, $Ns)

N = 256;
x = rand(N); Ns = randperm(length(x)); y = similar(x);

@benchmark forall!(selfdotsimd, $y, $x, $Ns)
@benchmark forall!(selfdotturbo, $y, $x, $Ns)

using -C"native,-prefer-256-bit":

julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);

julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  12.153 μs   25.980 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     12.227 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.248 μs ± 264.713 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▂
  ▂▅███▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂ ▂
  12.2 μs         Histogram: frequency by time         13.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min  max):  6.112 μs   14.994 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.137 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.146 μs ± 106.127 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▅█▅
  ▃▄█████▄▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  6.11 μs         Histogram: frequency by time        6.43 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> N = 256;

julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);

julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.284 μs   7.578 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.306 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.312 μs ± 49.389 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▅█▅
  ▂▃▄█████▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  4.28 μs        Histogram: frequency by time        4.51 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.903 μs   4.970 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.912 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.916 μs ± 50.064 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▅                                                      ▂
  ██████▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▄▄▅▅▇█▇▆▆▄▆ █
  1.9 μs       Histogram: log(frequency) by time     2.06 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Default:

julia> # we want to call selfdot for all sizes in a random order (to be harder on the branch predictor)
       N = 512;

julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);

julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  13.382 μs   23.362 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     13.525 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.623 μs ± 628.988 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ██
  ▄██▆▃▂▂▂▂▂▁▁▁▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂ ▂
  13.4 μs         Histogram: frequency by time         17.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min  max):  6.109 μs   15.190 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.123 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.133 μs ± 110.947 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▇█▇▆▃                                                      ▂
  ███████▄▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▄▅▆█████▇▆ █
  6.11 μs      Histogram: log(frequency) by time      6.41 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> N = 256;

julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);

julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min  max):  3.554 μs   6.840 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     3.581 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.586 μs ± 49.125 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁  █▂
  ▃▆▆█▅███▄▂▃▅▃▂▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  3.55 μs        Histogram: frequency by time         3.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.852 μs   4.799 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.859 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.862 μs ± 35.680 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆██▆▁                                                     ▂
  ██████▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▅▃▆▇▇▇ █
  1.85 μs      Histogram: log(frequency) by time     1.99 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Note that dot products of all sizes 1:256 is actually faster using 256 bit vectors than it is with 512 bit vectors!
3.554 microseconds for 256 bit vectors vs 4.284 microseconds for 512 bit.

Of course, LoopVectorization.jl is using 512 bit vectors here and takes 1.9 microseconds, making it substantially faster than both. LoopVectorization.jl would be slower were it to switch to 256 bit vectors.

512 bit gets a slight edge at N=512, but is still way behind LoopVectorization.jl.

Make the arrays much longer, and it becomes dominated by memory bandwidth, so they'll all be the same fast.

@chriselrod
Copy link
Member

If you want to know what's going on here, look at the plots here: https://juliasimd.github.io/LoopVectorization.jl/latest/examples/dot_product/#Dot-Products

Julia/LLVM get slower the larger length(array) % 4*vector_width gets. Shrink vector_width, and you shrink the average size of the remainder, improving performance.

LoopVectorization.jl doesn't have that problem, so it does see a big improvement from full size vectors and has little reason not to use them.

@chriselrod
Copy link
Member

chriselrod commented Nov 30, 2022

I was also wondering why Julia can't just do what you are doing with this package. Is there a specific reason, why optimizations like yours are not applied per default (if Julia can identify, that a loop satisfies the necessary limitations)?

I'm working on it. =)
https://github.com/JuliaSIMD/LoopModels
is an ongoing rewrite that uses LLVM. It'll take a long time, but my goal would be to eventually allow many of these to happen by default, without needing a package.

Note that some transforms, like using fma instructions, probably won't happen by default. You'd have to opt in somehow.
Before being in base, there'll be another library for using LoopModels from within Julia that also requires a macro.
It'll probably turn on contract by default, to allow using fma instructions at least (but let people opt out in case they don't want it for some reason).

Note that "if Julia can identify, that a loop satisfies the necessary limitations" is big ask, but LoopModels aims to do this, as well as having much looser restrictions (these go hand in hand: if you can analyze what is legal, you can restrict yourself to a legal subset).

@maxhuebner
Copy link
Author

I find that hard to believe that Julia without LoopVectorization.jl produced this.

Well, it looks like you are right..
Although i was pretty sure that this code was produced by Julia on its own, I am not able to reproduce this. I did examine this on a few more machines and the code looks similar for all of them (I did include the results below, but I don't think they are surprising).
Not quite sure how I produced this then as I was only working on a single machine which produced the assembly code with AVX-515 from above..

Recap: Overall I did examine three functions

Version Turbo

function matmult_turbo!(A,B,C)
    n,m = size(A)
    @turbo for j in 1:n 
        for k in 1:n
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

Version Standard

function matmult_std!(A,B,C)
    n,m = size(A)
    for j in 1:n 
        for k in 1:n
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

Version Fastmath

function matmult_fm!(A,B,C)
    n,m = size(A)
    for j in 1:n 
        for k in 1:n
            @fastmath for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

All of them are located in a script, that I start with julia --check-bounds=no -i loopvec_test.jl which contains the following code:

using Cthulhu
using LoopVectorization

# [...]
# function definitions from above

n = 200
a = rand(n,n)
b = rand(n,n)
c = zeros(n,n)

Cthulhu.CONFIG.asm_syntax = :intel
Cthulhu.CONFIG.enable_highlighter = true

The assembly code is extracted by running @descent fun!(a,b,c) on each of them.
Here are the results:

CPU: Intel Xeon Gold 6248 (the same on as yesterday)
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)

Turbo:

L496:
        vbroadcastsd    zmm28, qword ptr [r8]
        vmovupd zmm29, zmmword ptr [rdx]
        vmovupd zmm30, zmmword ptr [rdx + 64]
        vmovupd zmm27, zmmword ptr [rdx + 128]
        prefetcht0      byte ptr [rdx + rax]
        prefetcht0      byte ptr [rdx + rax + 64]
        prefetcht0      byte ptr [rdx + rax + 128]
        vfmadd231pd     zmm18, zmm29, zmm28     # zmm18 = (zmm29 * zmm28) + zmm18
        vfmadd231pd     zmm9, zmm30, zmm28      # zmm9 = (zmm30 * zmm28) + zmm9
        vbroadcastsd    zmm31, qword ptr [r8 + r15]
        vfmadd231pd     zmm0, zmm27, zmm28      # zmm0 = (zmm27 * zmm28) + zmm0
        vfmadd231pd     zmm19, zmm29, zmm31     # zmm19 = (zmm29 * zmm31) + zmm19
        vfmadd231pd     zmm10, zmm30, zmm31     # zmm10 = (zmm30 * zmm31) + zmm10
        vfmadd231pd     zmm1, zmm27, zmm31      # zmm1 = (zmm27 * zmm31) + zmm1
        vbroadcastsd    zmm28, qword ptr [r8 + 2*r15]
        vfmadd231pd     zmm20, zmm29, zmm28     # zmm20 = (zmm29 * zmm28) + zmm20
        vfmadd231pd     zmm11, zmm30, zmm28     # zmm11 = (zmm30 * zmm28) + zmm11
        vfmadd231pd     zmm2, zmm27, zmm28      # zmm2 = (zmm27 * zmm28) + zmm2
        vbroadcastsd    zmm28, qword ptr [r8 + rbx]
        vfmadd231pd     zmm21, zmm29, zmm28     # zmm21 = (zmm29 * zmm28) + zmm21
        vfmadd231pd     zmm12, zmm30, zmm28     # zmm12 = (zmm30 * zmm28) + zmm12
        vfmadd231pd     zmm3, zmm27, zmm28      # zmm3 = (zmm27 * zmm28) + zmm3
        vbroadcastsd    zmm28, qword ptr [r8 + 4*r15]
        vfmadd231pd     zmm22, zmm29, zmm28     # zmm22 = (zmm29 * zmm28) + zmm22
        vfmadd231pd     zmm13, zmm30, zmm28     # zmm13 = (zmm30 * zmm28) + zmm13
        vfmadd231pd     zmm4, zmm27, zmm28      # zmm4 = (zmm27 * zmm28) + zmm4
        vbroadcastsd    zmm28, qword ptr [r8 + r13]
        vfmadd231pd     zmm23, zmm29, zmm28     # zmm23 = (zmm29 * zmm28) + zmm23
        vfmadd231pd     zmm14, zmm30, zmm28     # zmm14 = (zmm30 * zmm28) + zmm14
        vfmadd231pd     zmm5, zmm27, zmm28      # zmm5 = (zmm27 * zmm28) + zmm5
        vbroadcastsd    zmm28, qword ptr [r8 + 2*rbx]
        vfmadd231pd     zmm24, zmm29, zmm28     # zmm24 = (zmm29 * zmm28) + zmm24
        vfmadd231pd     zmm15, zmm30, zmm28     # zmm15 = (zmm30 * zmm28) + zmm15
        vbroadcastsd    zmm31, qword ptr [r8 + r12]
        vfmadd231pd     zmm6, zmm27, zmm28      # zmm6 = (zmm27 * zmm28) + zmm6
        vfmadd231pd     zmm25, zmm29, zmm31     # zmm25 = (zmm29 * zmm31) + zmm25
        vfmadd231pd     zmm16, zmm30, zmm31     # zmm16 = (zmm30 * zmm31) + zmm16
        vfmadd231pd     zmm7, zmm27, zmm31      # zmm7 = (zmm27 * zmm31) + zmm7
        vbroadcastsd    zmm28, qword ptr [r8 + 8*r15]
        vfmadd231pd     zmm26, zmm28, zmm29     # zmm26 = (zmm28 * zmm29) + zmm26
        vfmadd231pd     zmm17, zmm28, zmm30     # zmm17 = (zmm28 * zmm30) + zmm17
        vfmadd231pd     zmm8, zmm27, zmm28      # zmm8 = (zmm27 * zmm28) + zmm8
        add     rdx, r11
        add     r8, 8
        cmp     rdx, rcx
        jbe     L496

Standard:

L560:
        vbroadcastsd    ymm1, xmm0
        vmulpd  ymm2, ymm1, ymmword ptr [r8 + 8*r15 - 96]
        vmulpd  ymm3, ymm1, ymmword ptr [r8 + 8*r15 - 64]
        vmulpd  ymm4, ymm1, ymmword ptr [r8 + 8*r15 - 32]
        vmulpd  ymm1, ymm1, ymmword ptr [r8 + 8*r15]
        vaddpd  ymm2, ymm2, ymmword ptr [rcx + 8*r15 - 96]
        vaddpd  ymm3, ymm3, ymmword ptr [rcx + 8*r15 - 64]
        vaddpd  ymm4, ymm4, ymmword ptr [rcx + 8*r15 - 32]
        vaddpd  ymm1, ymm1, ymmword ptr [rcx + 8*r15]
        vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm2
        vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm3
        vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm4
        vmovupd ymmword ptr [rcx + 8*r15], ymm1
        add     r15, 16
        cmp     r9, r15
        jne     L560

Fastmath:

L560:
        vmovupd ymm1, ymmword ptr [r8 + 8*r15 - 96]
        vmovupd ymm2, ymmword ptr [r8 + 8*r15 - 64]
        vmovupd ymm3, ymmword ptr [r8 + 8*r15 - 32]
        vmovupd ymm4, ymmword ptr [r8 + 8*r15]
        vbroadcastsd    ymm5, xmm0
        vfmadd213pd     ymm1, ymm5, ymmword ptr [rcx + 8*r15 - 96] # ymm1 = (ymm5 * ymm1) + mem
        vfmadd213pd     ymm2, ymm5, ymmword ptr [rcx + 8*r15 - 64] # ymm2 = (ymm5 * ymm2) + mem
        vfmadd213pd     ymm3, ymm5, ymmword ptr [rcx + 8*r15 - 32] # ymm3 = (ymm5 * ymm3) + mem
        vfmadd213pd     ymm4, ymm5, ymmword ptr [rcx + 8*r15] # ymm4 = (ymm5 * ymm4) + mem
        vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm1
        vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm2
        vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm3
        vmovupd ymmword ptr [rcx + 8*r15], ymm4
        add     r15, 16
        cmp     r9, r15
        jne     L560
CPU: Intel Core i9-12900K
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12th Gen Intel(R) Core(TM) i9-12900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, goldmont)

Turbo:

L416:
        vbroadcastsd    ymm14, qword ptr [rsi]
        vmovupd ymm15, ymmword ptr [rcx]
        vmovupd ymm0, ymmword ptr [rcx + 32]
        prefetcht0      byte ptr [rcx + rbx]
        vmovupd ymm5, ymmword ptr [rsp - 96]
        vmovupd ymm4, ymmword ptr [rsp - 64]
        add     rcx, r11
        vbroadcastsd    ymm1, qword ptr [rsi + r13]
        vbroadcastsd    ymm2, qword ptr [rsi + 2*r13]
        vbroadcastsd    ymm3, qword ptr [rsi + r15]
        vfmadd231pd     ymm4, ymm0, ymm14       # ymm4 = (ymm0 * ymm14) + ymm4
        vfmadd231pd     ymm8, ymm15, ymm14      # ymm8 = (ymm15 * ymm14) + ymm8
        vbroadcastsd    ymm14, qword ptr [rsi + 4*r13]
        vfmadd231pd     ymm5, ymm0, ymm1        # ymm5 = (ymm0 * ymm1) + ymm5
        vmovupd ymmword ptr [rsp - 64], ymm4
        vbroadcastsd    ymm4, qword ptr [rsi + rax]
        vfmadd231pd     ymm10, ymm15, ymm2      # ymm10 = (ymm15 * ymm2) + ymm10
        add     rsi, 8
        vfmadd231pd     ymm9, ymm15, ymm1       # ymm9 = (ymm15 * ymm1) + ymm9
        vfmadd231pd     ymm11, ymm15, ymm3      # ymm11 = (ymm15 * ymm3) + ymm11
        cmp     rcx, rdx
        vmovupd ymmword ptr [rsp - 96], ymm5
        vmovupd ymm5, ymmword ptr [rsp + 32]
        vfmadd231pd     ymm12, ymm15, ymm14     # ymm12 = (ymm15 * ymm14) + ymm12
        vfmadd231pd     ymm6, ymm0, ymm14       # ymm6 = (ymm0 * ymm14) + ymm6
        vfmadd231pd     ymm5, ymm0, ymm2        # ymm5 = (ymm0 * ymm2) + ymm5
        vmovupd ymm2, ymmword ptr [rsp + 96]
        vfmadd231pd     ymm13, ymm4, ymm15      # ymm13 = (ymm4 * ymm15) + ymm13
        vfmadd231pd     ymm7, ymm0, ymm4        # ymm7 = (ymm0 * ymm4) + ymm7
        vmovupd ymmword ptr [rsp + 32], ymm5
        vfmadd231pd     ymm2, ymm0, ymm3        # ymm2 = (ymm0 * ymm3) + ymm2
        vmovupd ymmword ptr [rsp + 96], ymm2
        jbe     L416

Standard:

L544:
        vbroadcastsd    ymm1, xmm0
        vmulpd  ymm2, ymm1, ymmword ptr [r8 + 8*r15 - 96]
        vmulpd  ymm3, ymm1, ymmword ptr [r8 + 8*r15 - 64]
        vmulpd  ymm4, ymm1, ymmword ptr [r8 + 8*r15 - 32]
        vmulpd  ymm1, ymm1, ymmword ptr [r8 + 8*r15]
        vaddpd  ymm2, ymm2, ymmword ptr [rcx + 8*r15 - 96]
        vaddpd  ymm3, ymm3, ymmword ptr [rcx + 8*r15 - 64]
        vaddpd  ymm4, ymm4, ymmword ptr [rcx + 8*r15 - 32]
        vaddpd  ymm1, ymm1, ymmword ptr [rcx + 8*r15]
        vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm2
        vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm3
        vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm4
        vmovupd ymmword ptr [rcx + 8*r15], ymm1
        add     r15, 16
        cmp     r9, r15
        jne     L544

Fastmath:

L544:
        vmovupd ymm1, ymmword ptr [r8 + 8*r15 - 96]
        vmovupd ymm2, ymmword ptr [r8 + 8*r15 - 64]
        vmovupd ymm3, ymmword ptr [r8 + 8*r15 - 32]
        vmovupd ymm4, ymmword ptr [r8 + 8*r15]
        vbroadcastsd    ymm5, xmm0
        vfmadd213pd     ymm1, ymm5, ymmword ptr [rcx + 8*r15 - 96] # ymm1 = (ymm5 * ymm1) + mem
        vfmadd213pd     ymm2, ymm5, ymmword ptr [rcx + 8*r15 - 64] # ymm2 = (ymm5 * ymm2) + mem
        vfmadd213pd     ymm3, ymm5, ymmword ptr [rcx + 8*r15 - 32] # ymm3 = (ymm5 * ymm3) + mem
        vfmadd213pd     ymm4, ymm5, ymmword ptr [rcx + 8*r15] # ymm4 = (ymm5 * ymm4) + mem
        vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm1
        vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm2
        vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm3
        vmovupd ymmword ptr [rcx + 8*r15], ymm4
        add     r15, 16
        cmp     r9, r15
        jne     L544
CPU: AMD Ryzen 9 5900X
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver3)

Turbo:

L384:
        vbroadcastsd    ymm14, qword ptr [rsi]
        vmovupd ymm15, ymmword ptr [rcx]
        vmovupd ymm0, ymmword ptr [rcx + 32]
        prefetcht0      byte ptr [rcx + rbx]
        add     rcx, r11
        vfmadd231pd     ymm8, ymm15, ymm14      # ymm8 = (ymm15 * ymm14) + ymm8
        vfmadd231pd     ymm2, ymm0, ymm14       # ymm2 = (ymm0 * ymm14) + ymm2
        vbroadcastsd    ymm14, qword ptr [rsi + r13]
        vfmadd231pd     ymm9, ymm15, ymm14      # ymm9 = (ymm15 * ymm14) + ymm9
        vfmadd231pd     ymm3, ymm0, ymm14       # ymm3 = (ymm0 * ymm14) + ymm3
        vbroadcastsd    ymm14, qword ptr [rsi + 2*r13]
        vfmadd231pd     ymm10, ymm15, ymm14     # ymm10 = (ymm15 * ymm14) + ymm10
        vfmadd231pd     ymm4, ymm0, ymm14       # ymm4 = (ymm0 * ymm14) + ymm4
        vbroadcastsd    ymm14, qword ptr [rsi + r15]
        vfmadd231pd     ymm11, ymm15, ymm14     # ymm11 = (ymm15 * ymm14) + ymm11
        vfmadd231pd     ymm5, ymm0, ymm14       # ymm5 = (ymm0 * ymm14) + ymm5
        vbroadcastsd    ymm14, qword ptr [rsi + 4*r13]
        vfmadd231pd     ymm12, ymm15, ymm14     # ymm12 = (ymm15 * ymm14) + ymm12
        vfmadd231pd     ymm6, ymm0, ymm14       # ymm6 = (ymm0 * ymm14) + ymm6
        vbroadcastsd    ymm14, qword ptr [rsi + rax]
        add     rsi, 8
        vfmadd231pd     ymm13, ymm14, ymm15     # ymm13 = (ymm14 * ymm15) + ymm13
        vfmadd231pd     ymm7, ymm0, ymm14       # ymm7 = (ymm0 * ymm14) + ymm7
        cmp     rcx, rdx
        jbe     L384

Standard:

L560:
        vbroadcastsd    ymm1, xmm0
        vmulpd  ymm2, ymm1, ymmword ptr [r8 + 8*r9 - 96]
        vmulpd  ymm3, ymm1, ymmword ptr [r8 + 8*r9 - 64]
        vmulpd  ymm4, ymm1, ymmword ptr [r8 + 8*r9 - 32]
        vmulpd  ymm1, ymm1, ymmword ptr [r8 + 8*r9]
        vaddpd  ymm2, ymm2, ymmword ptr [rcx + 8*r9 - 96]
        vaddpd  ymm3, ymm3, ymmword ptr [rcx + 8*r9 - 64]
        vaddpd  ymm4, ymm4, ymmword ptr [rcx + 8*r9 - 32]
        vaddpd  ymm1, ymm1, ymmword ptr [rcx + 8*r9]
        vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm2
        vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm3
        vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm4
        vmovupd ymmword ptr [rcx + 8*r9], ymm1
        add     r9, 16
        cmp     r15, r9
        jne     L560

Fastmath:

L560:
        vbroadcastsd    ymm5, xmm0
        vmovupd ymm1, ymmword ptr [r8 + 8*r9 - 96]
        vmovupd ymm2, ymmword ptr [r8 + 8*r9 - 64]
        vmovupd ymm3, ymmword ptr [r8 + 8*r9 - 32]
        vmovupd ymm4, ymmword ptr [r8 + 8*r9]
        vfmadd213pd     ymm1, ymm5, ymmword ptr [rcx + 8*r9 - 96] # ymm1 = (ymm5 * ymm1) + mem
        vfmadd213pd     ymm2, ymm5, ymmword ptr [rcx + 8*r9 - 64] # ymm2 = (ymm5 * ymm2) + mem
        vfmadd213pd     ymm3, ymm5, ymmword ptr [rcx + 8*r9 - 32] # ymm3 = (ymm5 * ymm3) + mem
        vfmadd213pd     ymm4, ymm5, ymmword ptr [rcx + 8*r9] # ymm4 = (ymm5 * ymm4) + mem
        vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm1
        vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm2
        vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm3
        vmovupd ymmword ptr [rcx + 8*r9], ymm4
        add     r9, 16
        cmp     r15, r9
        jne     L560
        cmp     rax, r15
        je      L320
        jmp     L384
CPU: AMD EPYC 7313
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7313 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver3)

Turbo:

L384:
        vbroadcastsd    ymm14, qword ptr [rsi]
        vmovupd ymm15, ymmword ptr [rcx]
        vmovupd ymm0, ymmword ptr [rcx + 32]
        prefetcht0      byte ptr [rcx + rbx]
        add     rcx, r11
        vfmadd231pd     ymm8, ymm15, ymm14      # ymm8 = (ymm15 * ymm14) + ymm8
        vfmadd231pd     ymm2, ymm0, ymm14       # ymm2 = (ymm0 * ymm14) + ymm2
        vbroadcastsd    ymm14, qword ptr [rsi + r13]
        vfmadd231pd     ymm9, ymm15, ymm14      # ymm9 = (ymm15 * ymm14) + ymm9
        vfmadd231pd     ymm3, ymm0, ymm14       # ymm3 = (ymm0 * ymm14) + ymm3
        vbroadcastsd    ymm14, qword ptr [rsi + 2*r13]
        vfmadd231pd     ymm10, ymm15, ymm14     # ymm10 = (ymm15 * ymm14) + ymm10
        vfmadd231pd     ymm4, ymm0, ymm14       # ymm4 = (ymm0 * ymm14) + ymm4
        vbroadcastsd    ymm14, qword ptr [rsi + r15]
        vfmadd231pd     ymm11, ymm15, ymm14     # ymm11 = (ymm15 * ymm14) + ymm11
        vfmadd231pd     ymm5, ymm0, ymm14       # ymm5 = (ymm0 * ymm14) + ymm5
        vbroadcastsd    ymm14, qword ptr [rsi + 4*r13]
        vfmadd231pd     ymm12, ymm15, ymm14     # ymm12 = (ymm15 * ymm14) + ymm12
        vfmadd231pd     ymm6, ymm0, ymm14       # ymm6 = (ymm0 * ymm14) + ymm6
        vbroadcastsd    ymm14, qword ptr [rsi + rax]
        add     rsi, 8
        vfmadd231pd     ymm13, ymm14, ymm15     # ymm13 = (ymm14 * ymm15) + ymm13
        vfmadd231pd     ymm7, ymm0, ymm14       # ymm7 = (ymm0 * ymm14) + ymm7
        cmp     rcx, rdx
        jbe     L384

Standard:

L560:
        vbroadcastsd    ymm1, xmm0
        vmulpd  ymm2, ymm1, ymmword ptr [r8 + 8*r9 - 96]
        vmulpd  ymm3, ymm1, ymmword ptr [r8 + 8*r9 - 64]
        vmulpd  ymm4, ymm1, ymmword ptr [r8 + 8*r9 - 32]
        vmulpd  ymm1, ymm1, ymmword ptr [r8 + 8*r9]
        vaddpd  ymm2, ymm2, ymmword ptr [rcx + 8*r9 - 96]
        vaddpd  ymm3, ymm3, ymmword ptr [rcx + 8*r9 - 64]
        vaddpd  ymm4, ymm4, ymmword ptr [rcx + 8*r9 - 32]
        vaddpd  ymm1, ymm1, ymmword ptr [rcx + 8*r9]
        vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm2
        vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm3
        vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm4
        vmovupd ymmword ptr [rcx + 8*r9], ymm1
        add     r9, 16
        cmp     r15, r9
        jne     L560
        cmp     rax, r15
        je      L320
        jmp     L384

Fastmath:

L560:
        vbroadcastsd    ymm5, xmm0
        vmovupd ymm1, ymmword ptr [r8 + 8*r9 - 96]
        vmovupd ymm2, ymmword ptr [r8 + 8*r9 - 64]
        vmovupd ymm3, ymmword ptr [r8 + 8*r9 - 32]
        vmovupd ymm4, ymmword ptr [r8 + 8*r9]
        vfmadd213pd     ymm1, ymm5, ymmword ptr [rcx + 8*r9 - 96] # ymm1 = (ymm5 * ymm1) + mem
        vfmadd213pd     ymm2, ymm5, ymmword ptr [rcx + 8*r9 - 64] # ymm2 = (ymm5 * ymm2) + mem
        vfmadd213pd     ymm3, ymm5, ymmword ptr [rcx + 8*r9 - 32] # ymm3 = (ymm5 * ymm3) + mem
        vfmadd213pd     ymm4, ymm5, ymmword ptr [rcx + 8*r9] # ymm4 = (ymm5 * ymm4) + mem
        vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm1
        vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm2
        vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm3
        vmovupd ymmword ptr [rcx + 8*r9], ymm4
        add     r9, 16
        cmp     r15, r9
        jne     L560

@maxhuebner
Copy link
Author

The relative difference will probably be much larger with smaller matrices. Try sizes like 72x72, 144x144, 216x216. This is because LoopVectorization.jl does not do any cache tiling, and hence does not do a good job reusing cache memory.

I forgot to mention, that I use an approach similar to this, where LoopVectorization.jl is only used to calculate smaller matrices (Source).

function my_mul_turbo!(C,A,B)
    n,m = size(A)
    @turbo for j in 1:m 
        for k in 1:m
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

UL(x, n) = @views x[1:div(n,2), 1:div(n,2)]
UR(x, n) = @views x[1:div(n,2), div(n,2)+1:end]
LL(x, n) = @views x[div(n,2)+1:end, 1:div(n,2)]
LR(x, n) = @views x[div(n,2)+1:end, div(n,2)+1:end]

function my_mul_tile!(C,A,B)
    n,m = size(A)
    @assert n == m
    thresh_size = 64
    if n <= thresh_size
        my_mul_turbo!(C,A,B)
        return
    end

    @sync begin
        Threads.@spawn my_mul_tile!(UL(C,n), UL(A,n), UL(B,n))
        Threads.@spawn my_mul_tile!(UR(C,n), UL(A,n), UR(B,n))
        Threads.@spawn my_mul_tile!(LL(C,n), LL(A,n), UL(B,n))
        Threads.@spawn my_mul_tile!(LR(C,n), LL(A,n), UR(B,n))
    end
    @sync begin
        Threads.@spawn my_mul_tile!(UL(C,n), UR(A,n), LL(B,n))
        Threads.@spawn my_mul_tile!(UR(C,n), UR(A,n), LR(B,n))
        Threads.@spawn my_mul_tile!(LL(C,n), LR(A,n), LL(B,n))
        Threads.@spawn my_mul_tile!(LR(C,n), LR(A,n), LR(B,n))
    end
end

This should improve the cache behavior (at least compared to the naive algorithm)

By dividing matrices into blocks and multiplying blockwise, you can get much better performance at larger sizes. Gaius.jl takes a naive/recursive approach to this: https://github.com/MasonProtter/Gaius.jl Octavian.jl takes a more sophisticated approach (but is still maybe a little lacking) and does much better: https://github.com/JuliaLinearAlgebra/Octavian.jl

I will take a look at those, thank again.
Thanks for the included material and measurements as well!

I'm working on it. =)
https://github.com/JuliaSIMD/LoopModels
is an ongoing rewrite that uses LLVM. It'll take a long time, but my goal would be to eventually allow many of these to happen by default, without needing a package.

I was able to get good performance out of Julia already, but using your package improved it further significantly. Would be great if someday this performance boost was available to a user, who maybe isn't aware that a package like this exists.

@chriselrod
Copy link
Member

Woah, something is going wrong on the 12900K. It's spilling registers.

I forgot to mention, that I use an approach similar to this, where LoopVectorization.jl is only used to calculate smaller matrices

Have you played much with the threshold values? I'm guessing larger threshholds will do better.

If you're curious about how to optimize this further, you could take a look at: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
Which describes the basic approach used by libraries like GotoBLAS, OpenBLAS, BLIS, and Octavian.

Of course, that function is already going to be vastly better than the naive approach.
What you're doing is basically the same as Gauis.jl.

Or, if you want to try something else fun (I know you have a dissertation to finish, but...), a Julia Strassen would be neat. Basically, you can replace those 8 matmuls inside my_mul_tile! with only 7 (but a bunch of matrix addition/subtractions).
For large enough sizes, that 1 matmul you save would take a lot longer than all the addition/subtractions, so above that size in the recursion you'd use Strassen, and below you'd use the more traditional O(N^3).

I was able to get good performance out of Julia already, but using your package improved it further significantly. Would be great if someday this performance boost was available to a user, who maybe isn't aware that a package like this exists.

Yeah, that's one of my long term goals, but it'll be a long road, getting from here to there.

@maxhuebner
Copy link
Author

Woah, something is going wrong on the 12900K. It's spilling registers.

Interesting. I have never actually used it for serious measurements because I didn't want to deal with the Efficient-cores, so I cannot provide further info. Is this worrisome?

Have you played much with the threshold values? I'm guessing larger thresholds will do better.

You are correct, I tested $2^4$ to $2^9 = 512$ and these were the results:
(Row 6 seems weird, this CPU should do way better)

  device             n  time threads threshold
  <chr>          <dbl> <dbl>   <dbl>     <dbl>
1 Xeon Gold 6248  8192  1.78      80       128
2 Ryzen 9 5900x   8192  3.01      24       256
3 Xeon E5-2630    8192  5.21      16       256
4 ThunderX2       8192  6.72     224       256
5 Core i7-6950X   8192 11.5       20       128
6 AMD EPYC 7313   8192 15.0       32       256

If you're curious about how to optimize this further, you could take a look at: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
Which describes the basic approach used by libraries like GotoBLAS, OpenBLAS, BLIS, and Octavian.

Looks interesting, I'll have a look, thanks!

Or, if you want to try something else fun (I know you have a dissertation to finish, but...), a Julia Strassen would be neat.

This is interesting as well. I might try to implement this after the thesis is done to improve my Julia skills as I feel like I need to improve in that regard, especially since Julia works a bit different to what I'm used to. I'll certainly try to use the language more often in the future and particularly @turbo for compute-intensive things.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants