-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cartesian related simd benchmarks #284
Conversation
add Cartesian related benchmarks
fix 1.0
LLVM is generally going to unroll by 4x the simd vector width. FWIW, the dim1 = 31, 32, and 63 times will fail to vectorize. However, that doesn't actually matter here, because the benchmark is totally dominated by memory bandwidth. julia> size(v), typeof(v)
((63, 8, 8, 65), SubArray{Float32, 4, Array{Float32, 4}, NTuple{4, Base.OneTo{Int64}}, false})
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(perf_axpy!, 10_000, n, v, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.31e+09 50.0% # 4.1 cycles per ns
┌ instructions 5.94e+09 75.0% # 1.8 insns per cycle
│ branch-instructions 9.23e+08 75.0% # 15.6% of instructions
└ branch-misses 7.38e+06 75.0% # 0.8% of branch instructions
┌ task-clock 8.08e+08 100.0% # 808.5 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 3.32e+08 25.0% # 20.0% of dcache loads
│ L1-dcache-loads 1.66e+09 25.0%
└ L1-icache-load-misses 2.03e+05 25.0%
┌ dTLB-load-misses 1.96e+02 25.0% # 0.0% of dTLB loads
└ dTLB-loads 1.66e+09 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(perf_turbo!, 10_000, n, v, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.05e+09 50.0% # 3.8 cycles per ns
┌ instructions 9.30e+08 75.0% # 0.3 insns per cycle
│ branch-instructions 9.70e+07 75.0% # 10.4% of instructions
└ branch-misses 5.71e+04 75.0% # 0.1% of branch instructions
┌ task-clock 7.92e+08 100.0% # 791.6 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 3.30e+08 25.0% # 91.1% of dcache loads
│ L1-dcache-loads 3.62e+08 25.0%
└ L1-icache-load-misses 6.61e+04 25.0%
┌ dTLB-load-misses 1.28e+02 25.0% # 0.0% of dTLB loads
└ dTLB-loads 3.62e+08 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ We see LoopVectorization required less than 1/6 the total number of instructions, but missed 91.1% of dcache loads. For comparison, using julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(perf_axpy!, 10_000, n, parent(v), parent(x))
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.07e+09 50.0% # 3.9 cycles per ns
┌ instructions 6.21e+08 75.0% # 0.2 insns per cycle
│ branch-instructions 4.23e+07 75.0% # 6.8% of instructions
└ branch-misses 3.20e+04 75.0% # 0.1% of branch instructions
┌ task-clock 7.97e+08 100.0% # 797.0 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 3.29e+08 25.0% # 99.6% of dcache loads
│ L1-dcache-loads 3.30e+08 25.0%
└ L1-icache-load-misses 1.32e+05 25.0%
┌ dTLB-load-misses 2.00e+01 25.0% # 0.0% of dTLB loads
└ dTLB-loads 3.30e+08 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(perf_turbo!, 10_000, n, parent(v), parent(x))
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.07e+09 50.0% # 3.9 cycles per ns
┌ instructions 6.21e+08 75.0% # 0.2 insns per cycle
│ branch-instructions 4.22e+07 75.0% # 6.8% of instructions
└ branch-misses 3.69e+04 75.0% # 0.1% of branch instructions
┌ task-clock 7.96e+08 100.0% # 795.8 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 3.29e+08 25.0% # 99.6% of dcache loads
│ L1-dcache-loads 3.30e+08 25.0%
└ L1-icache-load-misses 5.41e+04 25.0%
┌ dTLB-load-misses 1.60e+01 25.0% # 0.0% of dTLB loads
└ dTLB-loads 3.30e+08 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ They both now experience 99.6% dcache load misses. It is 8x faster if I cut EDIT: |
make nbytes smaller to fitting in L2 caches.
src/simd/SIMDBenchmarks.jl
Outdated
end | ||
end | ||
const nbytes = 1 << 18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 512 KiB combined for the two arrays should be good for Intel server processors Skylake and newer, client processors Ice Lake and newer, and for AMD Zen.
1. avoid `view` in `CartesianPartition`'s bench. 2. make benchsize smaller 3. add a manually partitional sum bench
The maximum benched size is reduced to 32kB each Array. |
Co-authored-by: Jameson Nash <[email protected]>
Some original perf-test functions are extended to bench 2/3/4d Cartesian simd.
Since the length of 1st dim definitely influence the performace, I‘m not confident with the representativeness of choosed bench size.
Pinging @chriselrod for advice.
see also JuliaLang/julia#42736