-
Notifications
You must be signed in to change notification settings - Fork 66
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow vmapntt! on Ryzen #141
Comments
Thanks for the issue. Following a change, multithreaded julia> @benchmark map!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 425.385 ms (0.00% GC)
median time: 425.683 ms (0.00% GC)
mean time: 425.701 ms (0.00% GC)
maximum time: 426.033 ms (0.00% GC)
--------------
samples: 12
evals/sample: 1
julia> @benchmark vmap!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 178.897 ms (0.00% GC)
median time: 179.035 ms (0.00% GC)
mean time: 179.077 ms (0.00% GC)
maximum time: 179.510 ms (0.00% GC)
--------------
samples: 28
evals/sample: 1
julia> @benchmark vmapnt!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 147.471 ms (0.00% GC)
median time: 147.832 ms (0.00% GC)
mean time: 147.854 ms (0.00% GC)
maximum time: 148.343 ms (0.00% GC)
--------------
samples: 34
evals/sample: 1
julia> Threads.nthreads()
18
julia> @benchmark vmapt!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 13.19 KiB
allocs estimate: 91
--------------
minimum time: 40.678 ms (0.00% GC)
median time: 40.781 ms (0.00% GC)
mean time: 40.787 ms (0.00% GC)
maximum time: 41.057 ms (0.00% GC)
--------------
samples: 123
evals/sample: 1
julia> @benchmark vmapntt!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 13.19 KiB
allocs estimate: 91
--------------
minimum time: 29.133 ms (0.00% GC)
median time: 29.234 ms (0.00% GC)
mean time: 29.240 ms (0.00% GC)
maximum time: 29.557 ms (0.00% GC)
--------------
samples: 171
evals/sample: 1 EDIT: Crazy that you can get that kind of performance on a laptop. |
Yeah, I just got it today! Very pleased with the performance so far. Are you using any chip family-specific instruction cost modeling? |
Yes. The tables don't have Zen2, but they do have Zen1 (based on an 1800X). I'd be happy to accept a PR that changes the costs based on architecture, one adding more functions, etc. Did you ask out of curiosity, or out of suspicion of LoopVectorization making some sub-optimal decisions? |
Just out of curiosity! I don't have any handy side-by-side comparisons with other compilers/frameworks. I tried to run your benchmarks, but the manifest's version of LoopVectorization is hardcoded to your machine:
Regarding the original issue, I'm seeing the expected timings on
|
I updated and Project and Manifest. Note that the benchmarks requires the Intel compilers (neither free as in freedom or beer [unless you can get an academic or open source contributor license]), gcc, clang, and Eigen, so they're probably not easy to run on Windows. A PR to refactor the benchmark code to only run those with supported compilers would be welcome. Otherwise, I'll get around to it eventually. EDIT: The operation is memory bound, so 60 vs 30 ms more or less matches dual vs quad-channel memory. |
I don't think your update touched the relevant hardcoded line (last updated 2 months ago) |
You're right. Fixed, and I issued a new release. |
Following the example presented in the documentation, I see a ~20x slowdown between
vmapnt!
andvmapntt!
on a Ryzen 4900HS using 8 threads (corresponding to the number of physical cores).The text was updated successfully, but these errors were encountered: