-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can you see the complete assembly-code for a function? #446
Comments
I recomend @descend. My using Cthulhu
Cthulhu.CONFIG.asm_syntax = :intel
Cthulhu.CONFIG.enable_highlighter = true because I prefer ASM syntax. For example, I get this from your function for the primary inner-most loop: L496:
vbroadcastsd zmm28, qword ptr [r8]
vmovupd zmm29, zmmword ptr [rdx]
vmovupd zmm30, zmmword ptr [rdx + 64]
vmovupd zmm27, zmmword ptr [rdx + 128]
prefetcht0 byte ptr [rdx + rax]
prefetcht0 byte ptr [rdx + rax + 64]
prefetcht0 byte ptr [rdx + rax + 128]
vfmadd231pd zmm18, zmm29, zmm28 # zmm18 = (zmm29 * zmm28) + zmm18
vfmadd231pd zmm9, zmm30, zmm28 # zmm9 = (zmm30 * zmm28) + zmm9
vbroadcastsd zmm31, qword ptr [r8 + r15]
vfmadd231pd zmm0, zmm27, zmm28 # zmm0 = (zmm27 * zmm28) + zmm0
vfmadd231pd zmm19, zmm29, zmm31 # zmm19 = (zmm29 * zmm31) + zmm19
vfmadd231pd zmm10, zmm30, zmm31 # zmm10 = (zmm30 * zmm31) + zmm10
vfmadd231pd zmm1, zmm27, zmm31 # zmm1 = (zmm27 * zmm31) + zmm1
vbroadcastsd zmm28, qword ptr [r8 + 2*r15]
vfmadd231pd zmm20, zmm29, zmm28 # zmm20 = (zmm29 * zmm28) + zmm20
vfmadd231pd zmm11, zmm30, zmm28 # zmm11 = (zmm30 * zmm28) + zmm11
vfmadd231pd zmm2, zmm27, zmm28 # zmm2 = (zmm27 * zmm28) + zmm2
vbroadcastsd zmm28, qword ptr [r8 + rbx]
vfmadd231pd zmm21, zmm29, zmm28 # zmm21 = (zmm29 * zmm28) + zmm21
vfmadd231pd zmm12, zmm30, zmm28 # zmm12 = (zmm30 * zmm28) + zmm12
vfmadd231pd zmm3, zmm27, zmm28 # zmm3 = (zmm27 * zmm28) + zmm3
vbroadcastsd zmm28, qword ptr [r8 + 4*r15]
vfmadd231pd zmm22, zmm29, zmm28 # zmm22 = (zmm29 * zmm28) + zmm22
vfmadd231pd zmm13, zmm30, zmm28 # zmm13 = (zmm30 * zmm28) + zmm13
vfmadd231pd zmm4, zmm27, zmm28 # zmm4 = (zmm27 * zmm28) + zmm4
vbroadcastsd zmm28, qword ptr [r8 + r13]
vfmadd231pd zmm23, zmm29, zmm28 # zmm23 = (zmm29 * zmm28) + zmm23
vfmadd231pd zmm14, zmm30, zmm28 # zmm14 = (zmm30 * zmm28) + zmm14
vfmadd231pd zmm5, zmm27, zmm28 # zmm5 = (zmm27 * zmm28) + zmm5
vbroadcastsd zmm28, qword ptr [r8 + 2*rbx]
vfmadd231pd zmm24, zmm29, zmm28 # zmm24 = (zmm29 * zmm28) + zmm24
vfmadd231pd zmm15, zmm30, zmm28 # zmm15 = (zmm30 * zmm28) + zmm15
vbroadcastsd zmm31, qword ptr [r8 + r12]
vfmadd231pd zmm6, zmm27, zmm28 # zmm6 = (zmm27 * zmm28) + zmm6
vfmadd231pd zmm25, zmm29, zmm31 # zmm25 = (zmm29 * zmm31) + zmm25
vfmadd231pd zmm16, zmm30, zmm31 # zmm16 = (zmm30 * zmm31) + zmm16
vfmadd231pd zmm7, zmm27, zmm31 # zmm7 = (zmm27 * zmm31) + zmm7
vbroadcastsd zmm28, qword ptr [r8 + 8*r15]
vfmadd231pd zmm26, zmm28, zmm29 # zmm26 = (zmm28 * zmm29) + zmm26
vfmadd231pd zmm17, zmm28, zmm30 # zmm17 = (zmm28 * zmm30) + zmm17
vfmadd231pd zmm8, zmm27, zmm28 # zmm8 = (zmm27 * zmm28) + zmm8
add rdx, r11
add r8, 8
cmp rdx, rcx
jbe L496 Note that this has 12 loads total ( If you compare it to the assembly of .LBB0_8: # %vector.body
# Parent Loop BB0_2 Depth=1
# Parent Loop BB0_3 Depth=2
# => This Inner Loop Header: Depth=3
vmovupd zmm1, zmmword ptr [rbp + 8*r15 - 192]
vmovupd zmm2, zmmword ptr [rbp + 8*r15 - 128]
vmovupd zmm3, zmmword ptr [rbp + 8*r15 - 64]
vmovupd zmm4, zmmword ptr [rbp + 8*r15]
vbroadcastsd zmm5, xmm0
vfmadd213pd zmm1, zmm5, zmmword ptr [rdx + 8*r15 - 192] # zmm1 = (zmm5 * zmm1) + mem
vfmadd213pd zmm2, zmm5, zmmword ptr [rdx + 8*r15 - 128] # zmm2 = (zmm5 * zmm2) + mem
vfmadd213pd zmm3, zmm5, zmmword ptr [rdx + 8*r15 - 64] # zmm3 = (zmm5 * zmm3) + mem
vfmadd213pd zmm4, zmm5, zmmword ptr [rdx + 8*r15] # zmm4 = (zmm5 * zmm4) + mem
vmovupd zmmword ptr [rdx + 8*r15 - 192], zmm1
vmovupd zmmword ptr [rdx + 8*r15 - 128], zmm2
vmovupd zmmword ptr [rdx + 8*r15 - 64], zmm3
vmovupd zmmword ptr [rdx + 8*r15], zmm4
add r15, 32
cmp r14, r15
jne .LBB0_8 We have 9 loads (
As alluded to above, the compute density (fmas vs load and store instructions) of the I am working on rewriting LoopVectorization with the aim of removing many of its current limitations and hopefully further improve code generation in more cases (but many are already quite close to optimal).
I'm not sure. |
Thanks for that thorough explanation, you really helped me out a lot! Following your example, I was able to extract almost identical assembly code as the one you showed. This is what it looks like with L384:
vbroadcastsd ymm14, qword ptr [rsi]
vmovupd ymm15, ymmword ptr [rcx]
vmovupd ymm0, ymmword ptr [rcx + 32]
prefetcht0 byte ptr [rcx + rbx]
add rcx, r11
vfmadd231pd ymm8, ymm15, ymm14 # ymm8 = (ymm15 * ymm14) + ymm8
vfmadd231pd ymm2, ymm0, ymm14 # ymm2 = (ymm0 * ymm14) + ymm2
vbroadcastsd ymm14, qword ptr [rsi + r13]
vfmadd231pd ymm9, ymm15, ymm14 # ymm9 = (ymm15 * ymm14) + ymm9
vfmadd231pd ymm3, ymm0, ymm14 # ymm3 = (ymm0 * ymm14) + ymm3
vbroadcastsd ymm14, qword ptr [rsi + 2*r13]
vfmadd231pd ymm10, ymm15, ymm14 # ymm10 = (ymm15 * ymm14) + ymm10
vfmadd231pd ymm4, ymm0, ymm14 # ymm4 = (ymm0 * ymm14) + ymm4
vbroadcastsd ymm14, qword ptr [rsi + r15]
vfmadd231pd ymm11, ymm15, ymm14 # ymm11 = (ymm15 * ymm14) + ymm11
vfmadd231pd ymm5, ymm0, ymm14 # ymm5 = (ymm0 * ymm14) + ymm5
vbroadcastsd ymm14, qword ptr [rsi + 4*r13]
vfmadd231pd ymm12, ymm15, ymm14 # ymm12 = (ymm15 * ymm14) + ymm12
vfmadd231pd ymm6, ymm0, ymm14 # ymm6 = (ymm0 * ymm14) + ymm6
vbroadcastsd ymm14, qword ptr [rsi + rax]
add rsi, 8
vfmadd231pd ymm13, ymm14, ymm15 # ymm13 = (ymm14 * ymm15) + ymm13
vfmadd231pd ymm7, ymm0, ymm14 # ymm7 = (ymm0 * ymm14) + ymm7
cmp rcx, rdx
jbe L384 |
You mean that is what
But you are correct here -- when zmm registers are available, |
Yes, julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
Threads: 28 on 28 virtual cores |
The CPU I used for this example is a Xeon Gold 6248 (Cascade Lake): julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
No, turbo produced the following code, that looks just like yours and performs really well: L496:
vbroadcastsd zmm28, qword ptr [r8]
vmovupd zmm29, zmmword ptr [rdx]
vmovupd zmm30, zmmword ptr [rdx + 64]
vmovupd zmm27, zmmword ptr [rdx + 128]
prefetcht0 byte ptr [rdx + rax]
prefetcht0 byte ptr [rdx + rax + 64]
prefetcht0 byte ptr [rdx + rax + 128]
vfmadd231pd zmm18, zmm29, zmm28 # zmm18 = (zmm29 * zmm28) + zmm18
vfmadd231pd zmm9, zmm30, zmm28 # zmm9 = (zmm30 * zmm28) + zmm9
vbroadcastsd zmm31, qword ptr [r8 + r15]
vfmadd231pd zmm0, zmm27, zmm28 # zmm0 = (zmm27 * zmm28) + zmm0
vfmadd231pd zmm19, zmm29, zmm31 # zmm19 = (zmm29 * zmm31) + zmm19
vfmadd231pd zmm10, zmm30, zmm31 # zmm10 = (zmm30 * zmm31) + zmm10
vfmadd231pd zmm1, zmm27, zmm31 # zmm1 = (zmm27 * zmm31) + zmm1
vbroadcastsd zmm28, qword ptr [r8 + 2*r15]
vfmadd231pd zmm20, zmm29, zmm28 # zmm20 = (zmm29 * zmm28) + zmm20
vfmadd231pd zmm11, zmm30, zmm28 # zmm11 = (zmm30 * zmm28) + zmm11
vfmadd231pd zmm2, zmm27, zmm28 # zmm2 = (zmm27 * zmm28) + zmm2
vbroadcastsd zmm28, qword ptr [r8 + rbx]
vfmadd231pd zmm21, zmm29, zmm28 # zmm21 = (zmm29 * zmm28) + zmm21
vfmadd231pd zmm12, zmm30, zmm28 # zmm12 = (zmm30 * zmm28) + zmm12
vfmadd231pd zmm3, zmm27, zmm28 # zmm3 = (zmm27 * zmm28) + zmm3
vbroadcastsd zmm28, qword ptr [r8 + 4*r15]
vfmadd231pd zmm22, zmm29, zmm28 # zmm22 = (zmm29 * zmm28) + zmm22
vfmadd231pd zmm13, zmm30, zmm28 # zmm13 = (zmm30 * zmm28) + zmm13
vfmadd231pd zmm4, zmm27, zmm28 # zmm4 = (zmm27 * zmm28) + zmm4
vbroadcastsd zmm28, qword ptr [r8 + r13]
vfmadd231pd zmm23, zmm29, zmm28 # zmm23 = (zmm29 * zmm28) + zmm23
vfmadd231pd zmm14, zmm30, zmm28 # zmm14 = (zmm30 * zmm28) + zmm14
vfmadd231pd zmm5, zmm27, zmm28 # zmm5 = (zmm27 * zmm28) + zmm5
vbroadcastsd zmm28, qword ptr [r8 + 2*rbx]
vfmadd231pd zmm24, zmm29, zmm28 # zmm24 = (zmm29 * zmm28) + zmm24
vfmadd231pd zmm15, zmm30, zmm28 # zmm15 = (zmm30 * zmm28) + zmm15
vbroadcastsd zmm31, qword ptr [r8 + r12]
vfmadd231pd zmm6, zmm27, zmm28 # zmm6 = (zmm27 * zmm28) + zmm6
vfmadd231pd zmm25, zmm29, zmm31 # zmm25 = (zmm29 * zmm31) + zmm25
vfmadd231pd zmm16, zmm30, zmm31 # zmm16 = (zmm30 * zmm31) + zmm16
vfmadd231pd zmm7, zmm27, zmm31 # zmm7 = (zmm27 * zmm31) + zmm7
vbroadcastsd zmm28, qword ptr [r8 + 8*r15]
vfmadd231pd zmm26, zmm28, zmm29 # zmm26 = (zmm28 * zmm29) + zmm26
vfmadd231pd zmm17, zmm28, zmm30 # zmm17 = (zmm28 * zmm30) + zmm17
vfmadd231pd zmm8, zmm27, zmm28 # zmm8 = (zmm27 * zmm28) + zmm8
add rdx, r11
add r8, 8
cmp rdx, rcx
jbe L496 The code from my last reply was produce without using LoopVectorization.jl, I was just wondering if you would know why Julia does not use AVX-512, although it is supported by the CPU. Using the suggestion from your previous reply and starting Julia with L496:
vmovupd zmm1, zmmword ptr [r10 + 8*rax - 192]
vmovupd zmm2, zmmword ptr [r10 + 8*rax - 128]
vmovupd zmm3, zmmword ptr [r10 + 8*rax - 64]
vmovupd zmm4, zmmword ptr [r10 + 8*rax]
vbroadcastsd zmm5, xmm0
vfmadd213pd zmm1, zmm5, zmmword ptr [r9 + 8*rax - 192] # zmm1 = (zmm5 * zmm1) + mem
vfmadd213pd zmm2, zmm5, zmmword ptr [r9 + 8*rax - 128] # zmm2 = (zmm5 * zmm2) + mem
vfmadd213pd zmm3, zmm5, zmmword ptr [r9 + 8*rax - 64] # zmm3 = (zmm5 * zmm3) + mem
vfmadd213pd zmm4, zmm5, zmmword ptr [r9 + 8*rax] # zmm4 = (zmm5 * zmm4) + mem
vmovupd zmmword ptr [r9 + 8*rax - 192], zmm1
vmovupd zmmword ptr [r9 + 8*rax - 128], zmm2
vmovupd zmmword ptr [r9 + 8*rax - 64], zmm3
vmovupd zmmword ptr [r9 + 8*rax], zmm4 If I only use L560:
vbroadcastsd ymm1, xmm0
vmulpd ymm2, ymm1, ymmword ptr [r8 + 8*r15 - 96]
vmulpd ymm3, ymm1, ymmword ptr [r8 + 8*r15 - 64]
vmulpd ymm4, ymm1, ymmword ptr [r8 + 8*r15 - 32]
vmulpd ymm1, ymm1, ymmword ptr [r8 + 8*r15]
vaddpd ymm2, ymm2, ymmword ptr [rcx + 8*r15 - 96]
vaddpd ymm3, ymm3, ymmword ptr [rcx + 8*r15 - 64]
vaddpd ymm4, ymm4, ymmword ptr [rcx + 8*r15 - 32]
vaddpd ymm1, ymm1, ymmword ptr [rcx + 8*r15]
vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm2
vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm3
vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm4
vmovupd ymmword ptr [rcx + 8*r15], ymm1
add r15, 16
cmp r9, r15
jne L560 Using If you know the reason why Julia is not using AVX-512 by default I would appreciate if you'd let me know. Otherwise thanks again for the in depth answer. |
I find that hard to believe that Julia without LoopVectorization.jl produced this. That linked code is a |
Those look like what I would expect from Julia. Note the huge number of loads and stores relative to arithmetic instructions: 1 broadcast + 8 loads, for 4 mul-adds (whether
The relative difference will probably be much larger with smaller matrices. Try sizes like 72x72, 144x144, 216x216. By dividing matrices into blocks and multiplying blockwise, you can get much better performance at larger sizes. |
None of the major compilers use 512 bit vectors by default. Personally, I think it's because they suck at generating good vector code, and thus see little benefit. using LoopVectorization, Random, BenchmarkTools
@inline function selfdotsimd(a)
s = zero(eltype(a))
@inbounds @simd for i ∈ eachindex(a)
s += a[i] * a[i]
end
s
end
@inline function selfdotturbo(a)
s = zero(eltype(a))
@turbo for i ∈ eachindex(a)
s += a[i] * a[i]
end
s
end
function forall!(f, y, x, Ns)
@inbounds for i = eachindex(y,Ns)
y[i] = f(@view(x[1:Ns[i]]))
end
end
# we want to call selfdot for all sizes in a random order (to be harder on the branch predictor)
N = 512;
x = rand(N); Ns = randperm(length(x)); y = similar(x);
@benchmark forall!(selfdotsimd, $y, $x, $Ns)
@benchmark forall!(selfdotturbo, $y, $x, $Ns)
N = 256;
x = rand(N); Ns = randperm(length(x)); y = similar(x);
@benchmark forall!(selfdotsimd, $y, $x, $Ns)
@benchmark forall!(selfdotturbo, $y, $x, $Ns) using julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);
julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 12.153 μs … 25.980 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 12.227 μs ┊ GC (median): 0.00%
Time (mean ± σ): 12.248 μs ± 264.713 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▂
▂▅███▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂ ▂
12.2 μs Histogram: frequency by time 13.5 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 6.112 μs … 14.994 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.137 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.146 μs ± 106.127 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▅
▃▄█████▄▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
6.11 μs Histogram: frequency by time 6.43 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> N = 256;
julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);
julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.284 μs … 7.578 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.306 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.312 μs ± 49.389 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▅
▂▃▄█████▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
4.28 μs Histogram: frequency by time 4.51 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.903 μs … 4.970 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.912 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.916 μs ± 50.064 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▅▇█▅ ▂
██████▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▄▄▅▅▇█▇▆▆▄▆ █
1.9 μs Histogram: log(frequency) by time 2.06 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Default: julia> # we want to call selfdot for all sizes in a random order (to be harder on the branch predictor)
N = 512;
julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);
julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.382 μs … 23.362 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.525 μs ┊ GC (median): 0.00%
Time (mean ± σ): 13.623 μs ± 628.988 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
██
▄██▆▃▂▂▂▂▂▁▁▁▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂ ▂
13.4 μs Histogram: frequency by time 17.4 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 6.109 μs … 15.190 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.123 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.133 μs ± 110.947 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▇█▇▆▃ ▂
███████▄▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▄▅▆█████▇▆ █
6.11 μs Histogram: log(frequency) by time 6.41 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> N = 256;
julia> x = rand(N); Ns = randperm(length(x)); y = similar(x);
julia> @benchmark forall!(selfdotsimd, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 3.554 μs … 6.840 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.581 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.586 μs ± 49.125 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █▂
▃▆▆█▅███▄▂▃▅▃▂▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
3.55 μs Histogram: frequency by time 3.8 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark forall!(selfdotturbo, $y, $x, $Ns)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.852 μs … 4.799 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.859 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.862 μs ± 35.680 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆██▆▁ ▂
██████▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▅▃▆▇▇▇ █
1.85 μs Histogram: log(frequency) by time 1.99 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Note that dot products of all sizes Of course, LoopVectorization.jl is using 512 bit vectors here and takes 1.9 microseconds, making it substantially faster than both. LoopVectorization.jl would be slower were it to switch to 256 bit vectors. 512 bit gets a slight edge at N=512, but is still way behind LoopVectorization.jl. Make the arrays much longer, and it becomes dominated by memory bandwidth, so they'll all be the same fast. |
If you want to know what's going on here, look at the plots here: https://juliasimd.github.io/LoopVectorization.jl/latest/examples/dot_product/#Dot-Products Julia/LLVM get slower the larger LoopVectorization.jl doesn't have that problem, so it does see a big improvement from full size vectors and has little reason not to use them. |
I'm working on it. =) Note that some transforms, like using Note that "if Julia can identify, that a loop satisfies the necessary limitations" is big ask, but LoopModels aims to do this, as well as having much looser restrictions (these go hand in hand: if you can analyze what is legal, you can restrict yourself to a legal subset). |
Well, it looks like you are right.. Recap: Overall I did examine three functions Version Turbo function matmult_turbo!(A,B,C)
n,m = size(A)
@turbo for j in 1:n
for k in 1:n
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end Version Standard function matmult_std!(A,B,C)
n,m = size(A)
for j in 1:n
for k in 1:n
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end Version Fastmath function matmult_fm!(A,B,C)
n,m = size(A)
for j in 1:n
for k in 1:n
@fastmath for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end All of them are located in a script, that I start with using Cthulhu
using LoopVectorization
# [...]
# function definitions from above
n = 200
a = rand(n,n)
b = rand(n,n)
c = zeros(n,n)
Cthulhu.CONFIG.asm_syntax = :intel
Cthulhu.CONFIG.enable_highlighter = true The assembly code is extracted by running CPU: Intel Xeon Gold 6248 (the same on as yesterday)julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Turbo: L496:
vbroadcastsd zmm28, qword ptr [r8]
vmovupd zmm29, zmmword ptr [rdx]
vmovupd zmm30, zmmword ptr [rdx + 64]
vmovupd zmm27, zmmword ptr [rdx + 128]
prefetcht0 byte ptr [rdx + rax]
prefetcht0 byte ptr [rdx + rax + 64]
prefetcht0 byte ptr [rdx + rax + 128]
vfmadd231pd zmm18, zmm29, zmm28 # zmm18 = (zmm29 * zmm28) + zmm18
vfmadd231pd zmm9, zmm30, zmm28 # zmm9 = (zmm30 * zmm28) + zmm9
vbroadcastsd zmm31, qword ptr [r8 + r15]
vfmadd231pd zmm0, zmm27, zmm28 # zmm0 = (zmm27 * zmm28) + zmm0
vfmadd231pd zmm19, zmm29, zmm31 # zmm19 = (zmm29 * zmm31) + zmm19
vfmadd231pd zmm10, zmm30, zmm31 # zmm10 = (zmm30 * zmm31) + zmm10
vfmadd231pd zmm1, zmm27, zmm31 # zmm1 = (zmm27 * zmm31) + zmm1
vbroadcastsd zmm28, qword ptr [r8 + 2*r15]
vfmadd231pd zmm20, zmm29, zmm28 # zmm20 = (zmm29 * zmm28) + zmm20
vfmadd231pd zmm11, zmm30, zmm28 # zmm11 = (zmm30 * zmm28) + zmm11
vfmadd231pd zmm2, zmm27, zmm28 # zmm2 = (zmm27 * zmm28) + zmm2
vbroadcastsd zmm28, qword ptr [r8 + rbx]
vfmadd231pd zmm21, zmm29, zmm28 # zmm21 = (zmm29 * zmm28) + zmm21
vfmadd231pd zmm12, zmm30, zmm28 # zmm12 = (zmm30 * zmm28) + zmm12
vfmadd231pd zmm3, zmm27, zmm28 # zmm3 = (zmm27 * zmm28) + zmm3
vbroadcastsd zmm28, qword ptr [r8 + 4*r15]
vfmadd231pd zmm22, zmm29, zmm28 # zmm22 = (zmm29 * zmm28) + zmm22
vfmadd231pd zmm13, zmm30, zmm28 # zmm13 = (zmm30 * zmm28) + zmm13
vfmadd231pd zmm4, zmm27, zmm28 # zmm4 = (zmm27 * zmm28) + zmm4
vbroadcastsd zmm28, qword ptr [r8 + r13]
vfmadd231pd zmm23, zmm29, zmm28 # zmm23 = (zmm29 * zmm28) + zmm23
vfmadd231pd zmm14, zmm30, zmm28 # zmm14 = (zmm30 * zmm28) + zmm14
vfmadd231pd zmm5, zmm27, zmm28 # zmm5 = (zmm27 * zmm28) + zmm5
vbroadcastsd zmm28, qword ptr [r8 + 2*rbx]
vfmadd231pd zmm24, zmm29, zmm28 # zmm24 = (zmm29 * zmm28) + zmm24
vfmadd231pd zmm15, zmm30, zmm28 # zmm15 = (zmm30 * zmm28) + zmm15
vbroadcastsd zmm31, qword ptr [r8 + r12]
vfmadd231pd zmm6, zmm27, zmm28 # zmm6 = (zmm27 * zmm28) + zmm6
vfmadd231pd zmm25, zmm29, zmm31 # zmm25 = (zmm29 * zmm31) + zmm25
vfmadd231pd zmm16, zmm30, zmm31 # zmm16 = (zmm30 * zmm31) + zmm16
vfmadd231pd zmm7, zmm27, zmm31 # zmm7 = (zmm27 * zmm31) + zmm7
vbroadcastsd zmm28, qword ptr [r8 + 8*r15]
vfmadd231pd zmm26, zmm28, zmm29 # zmm26 = (zmm28 * zmm29) + zmm26
vfmadd231pd zmm17, zmm28, zmm30 # zmm17 = (zmm28 * zmm30) + zmm17
vfmadd231pd zmm8, zmm27, zmm28 # zmm8 = (zmm27 * zmm28) + zmm8
add rdx, r11
add r8, 8
cmp rdx, rcx
jbe L496 Standard: L560:
vbroadcastsd ymm1, xmm0
vmulpd ymm2, ymm1, ymmword ptr [r8 + 8*r15 - 96]
vmulpd ymm3, ymm1, ymmword ptr [r8 + 8*r15 - 64]
vmulpd ymm4, ymm1, ymmword ptr [r8 + 8*r15 - 32]
vmulpd ymm1, ymm1, ymmword ptr [r8 + 8*r15]
vaddpd ymm2, ymm2, ymmword ptr [rcx + 8*r15 - 96]
vaddpd ymm3, ymm3, ymmword ptr [rcx + 8*r15 - 64]
vaddpd ymm4, ymm4, ymmword ptr [rcx + 8*r15 - 32]
vaddpd ymm1, ymm1, ymmword ptr [rcx + 8*r15]
vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm2
vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm3
vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm4
vmovupd ymmword ptr [rcx + 8*r15], ymm1
add r15, 16
cmp r9, r15
jne L560 Fastmath: L560:
vmovupd ymm1, ymmword ptr [r8 + 8*r15 - 96]
vmovupd ymm2, ymmword ptr [r8 + 8*r15 - 64]
vmovupd ymm3, ymmword ptr [r8 + 8*r15 - 32]
vmovupd ymm4, ymmword ptr [r8 + 8*r15]
vbroadcastsd ymm5, xmm0
vfmadd213pd ymm1, ymm5, ymmword ptr [rcx + 8*r15 - 96] # ymm1 = (ymm5 * ymm1) + mem
vfmadd213pd ymm2, ymm5, ymmword ptr [rcx + 8*r15 - 64] # ymm2 = (ymm5 * ymm2) + mem
vfmadd213pd ymm3, ymm5, ymmword ptr [rcx + 8*r15 - 32] # ymm3 = (ymm5 * ymm3) + mem
vfmadd213pd ymm4, ymm5, ymmword ptr [rcx + 8*r15] # ymm4 = (ymm5 * ymm4) + mem
vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm1
vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm2
vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm3
vmovupd ymmword ptr [rcx + 8*r15], ymm4
add r15, 16
cmp r9, r15
jne L560 CPU: Intel Core i9-12900Kjulia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 12th Gen Intel(R) Core(TM) i9-12900K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, goldmont) Turbo: L416:
vbroadcastsd ymm14, qword ptr [rsi]
vmovupd ymm15, ymmword ptr [rcx]
vmovupd ymm0, ymmword ptr [rcx + 32]
prefetcht0 byte ptr [rcx + rbx]
vmovupd ymm5, ymmword ptr [rsp - 96]
vmovupd ymm4, ymmword ptr [rsp - 64]
add rcx, r11
vbroadcastsd ymm1, qword ptr [rsi + r13]
vbroadcastsd ymm2, qword ptr [rsi + 2*r13]
vbroadcastsd ymm3, qword ptr [rsi + r15]
vfmadd231pd ymm4, ymm0, ymm14 # ymm4 = (ymm0 * ymm14) + ymm4
vfmadd231pd ymm8, ymm15, ymm14 # ymm8 = (ymm15 * ymm14) + ymm8
vbroadcastsd ymm14, qword ptr [rsi + 4*r13]
vfmadd231pd ymm5, ymm0, ymm1 # ymm5 = (ymm0 * ymm1) + ymm5
vmovupd ymmword ptr [rsp - 64], ymm4
vbroadcastsd ymm4, qword ptr [rsi + rax]
vfmadd231pd ymm10, ymm15, ymm2 # ymm10 = (ymm15 * ymm2) + ymm10
add rsi, 8
vfmadd231pd ymm9, ymm15, ymm1 # ymm9 = (ymm15 * ymm1) + ymm9
vfmadd231pd ymm11, ymm15, ymm3 # ymm11 = (ymm15 * ymm3) + ymm11
cmp rcx, rdx
vmovupd ymmword ptr [rsp - 96], ymm5
vmovupd ymm5, ymmword ptr [rsp + 32]
vfmadd231pd ymm12, ymm15, ymm14 # ymm12 = (ymm15 * ymm14) + ymm12
vfmadd231pd ymm6, ymm0, ymm14 # ymm6 = (ymm0 * ymm14) + ymm6
vfmadd231pd ymm5, ymm0, ymm2 # ymm5 = (ymm0 * ymm2) + ymm5
vmovupd ymm2, ymmword ptr [rsp + 96]
vfmadd231pd ymm13, ymm4, ymm15 # ymm13 = (ymm4 * ymm15) + ymm13
vfmadd231pd ymm7, ymm0, ymm4 # ymm7 = (ymm0 * ymm4) + ymm7
vmovupd ymmword ptr [rsp + 32], ymm5
vfmadd231pd ymm2, ymm0, ymm3 # ymm2 = (ymm0 * ymm3) + ymm2
vmovupd ymmword ptr [rsp + 96], ymm2
jbe L416 Standard: L544:
vbroadcastsd ymm1, xmm0
vmulpd ymm2, ymm1, ymmword ptr [r8 + 8*r15 - 96]
vmulpd ymm3, ymm1, ymmword ptr [r8 + 8*r15 - 64]
vmulpd ymm4, ymm1, ymmword ptr [r8 + 8*r15 - 32]
vmulpd ymm1, ymm1, ymmword ptr [r8 + 8*r15]
vaddpd ymm2, ymm2, ymmword ptr [rcx + 8*r15 - 96]
vaddpd ymm3, ymm3, ymmword ptr [rcx + 8*r15 - 64]
vaddpd ymm4, ymm4, ymmword ptr [rcx + 8*r15 - 32]
vaddpd ymm1, ymm1, ymmword ptr [rcx + 8*r15]
vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm2
vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm3
vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm4
vmovupd ymmword ptr [rcx + 8*r15], ymm1
add r15, 16
cmp r9, r15
jne L544 Fastmath: L544:
vmovupd ymm1, ymmword ptr [r8 + 8*r15 - 96]
vmovupd ymm2, ymmword ptr [r8 + 8*r15 - 64]
vmovupd ymm3, ymmword ptr [r8 + 8*r15 - 32]
vmovupd ymm4, ymmword ptr [r8 + 8*r15]
vbroadcastsd ymm5, xmm0
vfmadd213pd ymm1, ymm5, ymmword ptr [rcx + 8*r15 - 96] # ymm1 = (ymm5 * ymm1) + mem
vfmadd213pd ymm2, ymm5, ymmword ptr [rcx + 8*r15 - 64] # ymm2 = (ymm5 * ymm2) + mem
vfmadd213pd ymm3, ymm5, ymmword ptr [rcx + 8*r15 - 32] # ymm3 = (ymm5 * ymm3) + mem
vfmadd213pd ymm4, ymm5, ymmword ptr [rcx + 8*r15] # ymm4 = (ymm5 * ymm4) + mem
vmovupd ymmword ptr [rcx + 8*r15 - 96], ymm1
vmovupd ymmword ptr [rcx + 8*r15 - 64], ymm2
vmovupd ymmword ptr [rcx + 8*r15 - 32], ymm3
vmovupd ymmword ptr [rcx + 8*r15], ymm4
add r15, 16
cmp r9, r15
jne L544 CPU: AMD Ryzen 9 5900Xjulia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 9 5900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver3) Turbo: L384:
vbroadcastsd ymm14, qword ptr [rsi]
vmovupd ymm15, ymmword ptr [rcx]
vmovupd ymm0, ymmword ptr [rcx + 32]
prefetcht0 byte ptr [rcx + rbx]
add rcx, r11
vfmadd231pd ymm8, ymm15, ymm14 # ymm8 = (ymm15 * ymm14) + ymm8
vfmadd231pd ymm2, ymm0, ymm14 # ymm2 = (ymm0 * ymm14) + ymm2
vbroadcastsd ymm14, qword ptr [rsi + r13]
vfmadd231pd ymm9, ymm15, ymm14 # ymm9 = (ymm15 * ymm14) + ymm9
vfmadd231pd ymm3, ymm0, ymm14 # ymm3 = (ymm0 * ymm14) + ymm3
vbroadcastsd ymm14, qword ptr [rsi + 2*r13]
vfmadd231pd ymm10, ymm15, ymm14 # ymm10 = (ymm15 * ymm14) + ymm10
vfmadd231pd ymm4, ymm0, ymm14 # ymm4 = (ymm0 * ymm14) + ymm4
vbroadcastsd ymm14, qword ptr [rsi + r15]
vfmadd231pd ymm11, ymm15, ymm14 # ymm11 = (ymm15 * ymm14) + ymm11
vfmadd231pd ymm5, ymm0, ymm14 # ymm5 = (ymm0 * ymm14) + ymm5
vbroadcastsd ymm14, qword ptr [rsi + 4*r13]
vfmadd231pd ymm12, ymm15, ymm14 # ymm12 = (ymm15 * ymm14) + ymm12
vfmadd231pd ymm6, ymm0, ymm14 # ymm6 = (ymm0 * ymm14) + ymm6
vbroadcastsd ymm14, qword ptr [rsi + rax]
add rsi, 8
vfmadd231pd ymm13, ymm14, ymm15 # ymm13 = (ymm14 * ymm15) + ymm13
vfmadd231pd ymm7, ymm0, ymm14 # ymm7 = (ymm0 * ymm14) + ymm7
cmp rcx, rdx
jbe L384 Standard: L560:
vbroadcastsd ymm1, xmm0
vmulpd ymm2, ymm1, ymmword ptr [r8 + 8*r9 - 96]
vmulpd ymm3, ymm1, ymmword ptr [r8 + 8*r9 - 64]
vmulpd ymm4, ymm1, ymmword ptr [r8 + 8*r9 - 32]
vmulpd ymm1, ymm1, ymmword ptr [r8 + 8*r9]
vaddpd ymm2, ymm2, ymmword ptr [rcx + 8*r9 - 96]
vaddpd ymm3, ymm3, ymmword ptr [rcx + 8*r9 - 64]
vaddpd ymm4, ymm4, ymmword ptr [rcx + 8*r9 - 32]
vaddpd ymm1, ymm1, ymmword ptr [rcx + 8*r9]
vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm2
vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm3
vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm4
vmovupd ymmword ptr [rcx + 8*r9], ymm1
add r9, 16
cmp r15, r9
jne L560 Fastmath: L560:
vbroadcastsd ymm5, xmm0
vmovupd ymm1, ymmword ptr [r8 + 8*r9 - 96]
vmovupd ymm2, ymmword ptr [r8 + 8*r9 - 64]
vmovupd ymm3, ymmword ptr [r8 + 8*r9 - 32]
vmovupd ymm4, ymmword ptr [r8 + 8*r9]
vfmadd213pd ymm1, ymm5, ymmword ptr [rcx + 8*r9 - 96] # ymm1 = (ymm5 * ymm1) + mem
vfmadd213pd ymm2, ymm5, ymmword ptr [rcx + 8*r9 - 64] # ymm2 = (ymm5 * ymm2) + mem
vfmadd213pd ymm3, ymm5, ymmword ptr [rcx + 8*r9 - 32] # ymm3 = (ymm5 * ymm3) + mem
vfmadd213pd ymm4, ymm5, ymmword ptr [rcx + 8*r9] # ymm4 = (ymm5 * ymm4) + mem
vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm1
vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm2
vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm3
vmovupd ymmword ptr [rcx + 8*r9], ymm4
add r9, 16
cmp r15, r9
jne L560
cmp rax, r15
je L320
jmp L384 CPU: AMD EPYC 7313julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD EPYC 7313 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver3) Turbo: L384:
vbroadcastsd ymm14, qword ptr [rsi]
vmovupd ymm15, ymmword ptr [rcx]
vmovupd ymm0, ymmword ptr [rcx + 32]
prefetcht0 byte ptr [rcx + rbx]
add rcx, r11
vfmadd231pd ymm8, ymm15, ymm14 # ymm8 = (ymm15 * ymm14) + ymm8
vfmadd231pd ymm2, ymm0, ymm14 # ymm2 = (ymm0 * ymm14) + ymm2
vbroadcastsd ymm14, qword ptr [rsi + r13]
vfmadd231pd ymm9, ymm15, ymm14 # ymm9 = (ymm15 * ymm14) + ymm9
vfmadd231pd ymm3, ymm0, ymm14 # ymm3 = (ymm0 * ymm14) + ymm3
vbroadcastsd ymm14, qword ptr [rsi + 2*r13]
vfmadd231pd ymm10, ymm15, ymm14 # ymm10 = (ymm15 * ymm14) + ymm10
vfmadd231pd ymm4, ymm0, ymm14 # ymm4 = (ymm0 * ymm14) + ymm4
vbroadcastsd ymm14, qword ptr [rsi + r15]
vfmadd231pd ymm11, ymm15, ymm14 # ymm11 = (ymm15 * ymm14) + ymm11
vfmadd231pd ymm5, ymm0, ymm14 # ymm5 = (ymm0 * ymm14) + ymm5
vbroadcastsd ymm14, qword ptr [rsi + 4*r13]
vfmadd231pd ymm12, ymm15, ymm14 # ymm12 = (ymm15 * ymm14) + ymm12
vfmadd231pd ymm6, ymm0, ymm14 # ymm6 = (ymm0 * ymm14) + ymm6
vbroadcastsd ymm14, qword ptr [rsi + rax]
add rsi, 8
vfmadd231pd ymm13, ymm14, ymm15 # ymm13 = (ymm14 * ymm15) + ymm13
vfmadd231pd ymm7, ymm0, ymm14 # ymm7 = (ymm0 * ymm14) + ymm7
cmp rcx, rdx
jbe L384 Standard: L560:
vbroadcastsd ymm1, xmm0
vmulpd ymm2, ymm1, ymmword ptr [r8 + 8*r9 - 96]
vmulpd ymm3, ymm1, ymmword ptr [r8 + 8*r9 - 64]
vmulpd ymm4, ymm1, ymmword ptr [r8 + 8*r9 - 32]
vmulpd ymm1, ymm1, ymmword ptr [r8 + 8*r9]
vaddpd ymm2, ymm2, ymmword ptr [rcx + 8*r9 - 96]
vaddpd ymm3, ymm3, ymmword ptr [rcx + 8*r9 - 64]
vaddpd ymm4, ymm4, ymmword ptr [rcx + 8*r9 - 32]
vaddpd ymm1, ymm1, ymmword ptr [rcx + 8*r9]
vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm2
vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm3
vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm4
vmovupd ymmword ptr [rcx + 8*r9], ymm1
add r9, 16
cmp r15, r9
jne L560
cmp rax, r15
je L320
jmp L384 Fastmath: L560:
vbroadcastsd ymm5, xmm0
vmovupd ymm1, ymmword ptr [r8 + 8*r9 - 96]
vmovupd ymm2, ymmword ptr [r8 + 8*r9 - 64]
vmovupd ymm3, ymmword ptr [r8 + 8*r9 - 32]
vmovupd ymm4, ymmword ptr [r8 + 8*r9]
vfmadd213pd ymm1, ymm5, ymmword ptr [rcx + 8*r9 - 96] # ymm1 = (ymm5 * ymm1) + mem
vfmadd213pd ymm2, ymm5, ymmword ptr [rcx + 8*r9 - 64] # ymm2 = (ymm5 * ymm2) + mem
vfmadd213pd ymm3, ymm5, ymmword ptr [rcx + 8*r9 - 32] # ymm3 = (ymm5 * ymm3) + mem
vfmadd213pd ymm4, ymm5, ymmword ptr [rcx + 8*r9] # ymm4 = (ymm5 * ymm4) + mem
vmovupd ymmword ptr [rcx + 8*r9 - 96], ymm1
vmovupd ymmword ptr [rcx + 8*r9 - 64], ymm2
vmovupd ymmword ptr [rcx + 8*r9 - 32], ymm3
vmovupd ymmword ptr [rcx + 8*r9], ymm4
add r9, 16
cmp r15, r9
jne L560 |
I forgot to mention, that I use an approach similar to this, where LoopVectorization.jl is only used to calculate smaller matrices (Source). function my_mul_turbo!(C,A,B)
n,m = size(A)
@turbo for j in 1:m
for k in 1:m
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
UL(x, n) = @views x[1:div(n,2), 1:div(n,2)]
UR(x, n) = @views x[1:div(n,2), div(n,2)+1:end]
LL(x, n) = @views x[div(n,2)+1:end, 1:div(n,2)]
LR(x, n) = @views x[div(n,2)+1:end, div(n,2)+1:end]
function my_mul_tile!(C,A,B)
n,m = size(A)
@assert n == m
thresh_size = 64
if n <= thresh_size
my_mul_turbo!(C,A,B)
return
end
@sync begin
Threads.@spawn my_mul_tile!(UL(C,n), UL(A,n), UL(B,n))
Threads.@spawn my_mul_tile!(UR(C,n), UL(A,n), UR(B,n))
Threads.@spawn my_mul_tile!(LL(C,n), LL(A,n), UL(B,n))
Threads.@spawn my_mul_tile!(LR(C,n), LL(A,n), UR(B,n))
end
@sync begin
Threads.@spawn my_mul_tile!(UL(C,n), UR(A,n), LL(B,n))
Threads.@spawn my_mul_tile!(UR(C,n), UR(A,n), LR(B,n))
Threads.@spawn my_mul_tile!(LL(C,n), LR(A,n), LL(B,n))
Threads.@spawn my_mul_tile!(LR(C,n), LR(A,n), LR(B,n))
end
end This should improve the cache behavior (at least compared to the naive algorithm)
I will take a look at those, thank again.
I was able to get good performance out of Julia already, but using your package improved it further significantly. Would be great if someday this performance boost was available to a user, who maybe isn't aware that a package like this exists. |
Woah, something is going wrong on the 12900K. It's spilling registers.
Have you played much with the threshold values? I'm guessing larger threshholds will do better. If you're curious about how to optimize this further, you could take a look at: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf Of course, that function is already going to be vastly better than the naive approach. Or, if you want to try something else fun (I know you have a dissertation to finish, but...), a Julia Strassen would be neat. Basically, you can replace those 8 matmuls inside
Yeah, that's one of my long term goals, but it'll be a long road, getting from here to there. |
Interesting. I have never actually used it for serious measurements because I didn't want to deal with the Efficient-cores, so I cannot provide further info. Is this worrisome?
You are correct, I tested
Looks interesting, I'll have a look, thanks!
This is interesting as well. I might try to implement this after the thesis is done to improve my Julia skills as I feel like I need to improve in that regard, especially since Julia works a bit different to what I'm used to. I'll certainly try to use the language more often in the future and particularly |
Is there a way for me to see the generated LLVM or assembly-code while using
@turbo
?Background:
I'm currently writing my thesis using Julia and similar projects. The goal is basically to compare the performance as well as apply optimizations and see how well they do.
Using this package I get great results in terms of performance, but am currently unable to figure out how the generated assembly code looks.
If I run
@code_native
(or@code_llvm
) on a function, that uses@turbo
I'm not getting the full assembly code. For example, if I run this on the function below I cannot find a single instruction that would be performing multiplication, addition or fma.Other related side questions:
The text was updated successfully, but these errors were encountered: