Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow vmapntt! on Ryzen #141

Closed
stillyslalom opened this issue Aug 2, 2020 · 7 comments
Closed

Slow vmapntt! on Ryzen #141

stillyslalom opened this issue Aug 2, 2020 · 7 comments

Comments

@stillyslalom
Copy link

Following the example presented in the documentation, I see a ~20x slowdown between vmapnt! and vmapntt! on a Ryzen 4900HS using 8 threads (corresponding to the number of physical cores).

julia> using LoopVectorization, BenchmarkTools

julia> f(x,y) = exp(-0.5abs2(x - y))
f (generic function with 1 method)

julia> x = rand(10^8); y = rand(10^8); z = similar(x);

julia> @benchmark map!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     423.404 ms (0.00% GC)
  median time:      424.123 ms (0.00% GC)
  mean time:        424.162 ms (0.00% GC)
  maximum time:     425.239 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1

julia> @benchmark vmap!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     194.848 ms (0.00% GC)
  median time:      195.361 ms (0.00% GC)
  mean time:        195.444 ms (0.00% GC)
  maximum time:     197.766 ms (0.00% GC)
  --------------
  samples:          26
  evals/sample:     1

julia> @benchmark vmapnt!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     197.531 ms (0.00% GC)
  median time:      197.992 ms (0.00% GC)
  mean time:        197.966 ms (0.00% GC)
  maximum time:     198.306 ms (0.00% GC)
  --------------
  samples:          26
  evals/sample:     1

julia> Threads.nthreads()
8

julia> @benchmark vmapntt!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  10.62 GiB
  allocs estimate:  324999792
  --------------
  minimum time:     4.337 s (19.72% GC)
  median time:      4.404 s (21.41% GC)
  mean time:        4.404 s (21.41% GC)
  maximum time:     4.471 s (23.06% GC)
  --------------
  samples:          2
  evals/sample:     1
julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 4900HS with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 8

(@v1.4) pkg> st
Status `C:\Users\alexa\.julia\environments\v1.4\Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [052768ef] CUDA v1.2.1
  [bdcacae8] LoopVectorization v0.8.20
@chriselrod
Copy link
Member

chriselrod commented Aug 2, 2020

Thanks for the issue. Following a change, multithreaded LoopVectorization.vmap variants weren't type stable. It should be fixed now. I'll issue a new release soon.

julia> @benchmark map!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     425.385 ms (0.00% GC)
  median time:      425.683 ms (0.00% GC)
  mean time:        425.701 ms (0.00% GC)
  maximum time:     426.033 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1

julia> @benchmark vmap!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     178.897 ms (0.00% GC)
  median time:      179.035 ms (0.00% GC)
  mean time:        179.077 ms (0.00% GC)
  maximum time:     179.510 ms (0.00% GC)
  --------------
  samples:          28
  evals/sample:     1

julia> @benchmark vmapnt!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     147.471 ms (0.00% GC)
  median time:      147.832 ms (0.00% GC)
  mean time:        147.854 ms (0.00% GC)
  maximum time:     148.343 ms (0.00% GC)
  --------------
  samples:          34
  evals/sample:     1

julia> Threads.nthreads()
 18

julia> @benchmark vmapt!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  13.19 KiB
  allocs estimate:  91
  --------------
  minimum time:     40.678 ms (0.00% GC)
  median time:      40.781 ms (0.00% GC)
  mean time:        40.787 ms (0.00% GC)
  maximum time:     41.057 ms (0.00% GC)
  --------------
  samples:          123
  evals/sample:     1

julia> @benchmark vmapntt!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  13.19 KiB
  allocs estimate:  91
  --------------
  minimum time:     29.133 ms (0.00% GC)
  median time:      29.234 ms (0.00% GC)
  mean time:        29.240 ms (0.00% GC)
  maximum time:     29.557 ms (0.00% GC)
  --------------
  samples:          171
  evals/sample:     1

EDIT: Crazy that you can get that kind of performance on a laptop.

@stillyslalom
Copy link
Author

Yeah, I just got it today! Very pleased with the performance so far. Are you using any chip family-specific instruction cost modeling?

@chriselrod
Copy link
Member

Are you using any chip family-specific instruction cost modeling?

Yes.
For the most part, the costs come from Agner Fog's instruction tables, specifically Skylake-X. But costs are at least automatically adjusted to reflect the SIMD vector width.

The tables don't have Zen2, but they do have Zen1 (based on an 1800X).
You could estimate costs following the approach from here.

I'd be happy to accept a PR that changes the costs based on architecture, one adding more functions, etc.

Did you ask out of curiosity, or out of suspicion of LoopVectorization making some sub-optimal decisions?

@stillyslalom
Copy link
Author

Just out of curiosity! I don't have any handy side-by-side comparisons with other compilers/frameworks. I tried to run your benchmarks, but the manifest's version of LoopVectorization is hardcoded to your machine:

[[LoopVectorization]]
deps = ["DocStringExtensions", "LinearAlgebra", "OffsetArrays", "SIMDPirates", "SLEEFPirates", "UnPack", "VectorizationBase"]
path = "/home/chriselrod/.julia/dev/LoopVectorization"
uuid = "bdcacae8-1622-11e9-2a5c-532679323890"
version = "0.8.2"

Regarding the original issue, I'm seeing the expected timings on master:

julia> @benchmark vmapntt!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  6.66 KiB
  allocs estimate:  43
  --------------
  minimum time:     63.950 ms (0.00% GC)
  median time:      64.548 ms (0.00% GC)
  mean time:        64.991 ms (0.00% GC)
  maximum time:     72.985 ms (0.00% GC)
  --------------
  samples:          77
  evals/sample:     1

@chriselrod
Copy link
Member

chriselrod commented Aug 2, 2020

I updated and Project and Manifest.

Note that the benchmarks requires the Intel compilers (neither free as in freedom or beer [unless you can get an academic or open source contributor license]), gcc, clang, and Eigen, so they're probably not easy to run on Windows.

A PR to refactor the benchmark code to only run those with supported compilers would be welcome. Otherwise, I'll get around to it eventually.

EDIT: The operation is memory bound, so 60 vs 30 ms more or less matches dual vs quad-channel memory.

@stillyslalom
Copy link
Author

I don't think your update touched the relevant hardcoded line (last updated 2 months ago)
https://github.com/chriselrod/LoopVectorization.jl/blame/1d7c2c103160ad3a85ae14bf7b0e1a29b2e047d7/benchmark/Manifest.toml#L426

@chriselrod
Copy link
Member

You're right. Fixed, and I issued a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants