Slow vmapntt! on Ryzen #141

stillyslalom · 2020-08-02T01:40:35Z

Following the example presented in the documentation, I see a ~20x slowdown between vmapnt! and vmapntt! on a Ryzen 4900HS using 8 threads (corresponding to the number of physical cores).

julia> using LoopVectorization, BenchmarkTools

julia> f(x,y) = exp(-0.5abs2(x - y))
f (generic function with 1 method)

julia> x = rand(10^8); y = rand(10^8); z = similar(x);

julia> @benchmark map!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     423.404 ms (0.00% GC)
  median time:      424.123 ms (0.00% GC)
  mean time:        424.162 ms (0.00% GC)
  maximum time:     425.239 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1

julia> @benchmark vmap!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     194.848 ms (0.00% GC)
  median time:      195.361 ms (0.00% GC)
  mean time:        195.444 ms (0.00% GC)
  maximum time:     197.766 ms (0.00% GC)
  --------------
  samples:          26
  evals/sample:     1

julia> @benchmark vmapnt!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     197.531 ms (0.00% GC)
  median time:      197.992 ms (0.00% GC)
  mean time:        197.966 ms (0.00% GC)
  maximum time:     198.306 ms (0.00% GC)
  --------------
  samples:          26
  evals/sample:     1

julia> Threads.nthreads()
8

julia> @benchmark vmapntt!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  10.62 GiB
  allocs estimate:  324999792
  --------------
  minimum time:     4.337 s (19.72% GC)
  median time:      4.404 s (21.41% GC)
  mean time:        4.404 s (21.41% GC)
  maximum time:     4.471 s (23.06% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 4900HS with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 8

(@v1.4) pkg> st
Status `C:\Users\alexa\.julia\environments\v1.4\Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [052768ef] CUDA v1.2.1
  [bdcacae8] LoopVectorization v0.8.20

The text was updated successfully, but these errors were encountered:

chriselrod · 2020-08-02T03:16:40Z

Thanks for the issue. Following a change, multithreaded LoopVectorization.vmap variants weren't type stable. It should be fixed now. I'll issue a new release soon.

julia> @benchmark map!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     425.385 ms (0.00% GC)
  median time:      425.683 ms (0.00% GC)
  mean time:        425.701 ms (0.00% GC)
  maximum time:     426.033 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1

julia> @benchmark vmap!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     178.897 ms (0.00% GC)
  median time:      179.035 ms (0.00% GC)
  mean time:        179.077 ms (0.00% GC)
  maximum time:     179.510 ms (0.00% GC)
  --------------
  samples:          28
  evals/sample:     1

julia> @benchmark vmapnt!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     147.471 ms (0.00% GC)
  median time:      147.832 ms (0.00% GC)
  mean time:        147.854 ms (0.00% GC)
  maximum time:     148.343 ms (0.00% GC)
  --------------
  samples:          34
  evals/sample:     1

julia> Threads.nthreads()
 18

julia> @benchmark vmapt!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  13.19 KiB
  allocs estimate:  91
  --------------
  minimum time:     40.678 ms (0.00% GC)
  median time:      40.781 ms (0.00% GC)
  mean time:        40.787 ms (0.00% GC)
  maximum time:     41.057 ms (0.00% GC)
  --------------
  samples:          123
  evals/sample:     1

julia> @benchmark vmapntt!(f, $z, $x, $y)
 BenchmarkTools.Trial:
  memory estimate:  13.19 KiB
  allocs estimate:  91
  --------------
  minimum time:     29.133 ms (0.00% GC)
  median time:      29.234 ms (0.00% GC)
  mean time:        29.240 ms (0.00% GC)
  maximum time:     29.557 ms (0.00% GC)
  --------------
  samples:          171
  evals/sample:     1

EDIT: Crazy that you can get that kind of performance on a laptop.

stillyslalom · 2020-08-02T03:39:24Z

Yeah, I just got it today! Very pleased with the performance so far. Are you using any chip family-specific instruction cost modeling?

chriselrod · 2020-08-02T04:02:04Z

Are you using any chip family-specific instruction cost modeling?

Yes.
For the most part, the costs come from Agner Fog's instruction tables, specifically Skylake-X. But costs are at least automatically adjusted to reflect the SIMD vector width.

The tables don't have Zen2, but they do have Zen1 (based on an 1800X).
You could estimate costs following the approach from here.

I'd be happy to accept a PR that changes the costs based on architecture, one adding more functions, etc.

Did you ask out of curiosity, or out of suspicion of LoopVectorization making some sub-optimal decisions?

stillyslalom · 2020-08-02T05:32:07Z

Just out of curiosity! I don't have any handy side-by-side comparisons with other compilers/frameworks. I tried to run your benchmarks, but the manifest's version of LoopVectorization is hardcoded to your machine:

[[LoopVectorization]]
deps = ["DocStringExtensions", "LinearAlgebra", "OffsetArrays", "SIMDPirates", "SLEEFPirates", "UnPack", "VectorizationBase"]
path = "/home/chriselrod/.julia/dev/LoopVectorization"
uuid = "bdcacae8-1622-11e9-2a5c-532679323890"
version = "0.8.2"

Regarding the original issue, I'm seeing the expected timings on master:

julia> @benchmark vmapntt!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  6.66 KiB
  allocs estimate:  43
  --------------
  minimum time:     63.950 ms (0.00% GC)
  median time:      64.548 ms (0.00% GC)
  mean time:        64.991 ms (0.00% GC)
  maximum time:     72.985 ms (0.00% GC)
  --------------
  samples:          77
  evals/sample:     1

chriselrod · 2020-08-02T06:24:14Z

I updated and Project and Manifest.

Note that the benchmarks requires the Intel compilers (neither free as in freedom or beer [unless you can get an academic or open source contributor license]), gcc, clang, and Eigen, so they're probably not easy to run on Windows.

A PR to refactor the benchmark code to only run those with supported compilers would be welcome. Otherwise, I'll get around to it eventually.

EDIT: The operation is memory bound, so 60 vs 30 ms more or less matches dual vs quad-channel memory.

stillyslalom · 2020-08-02T22:08:21Z

I don't think your update touched the relevant hardcoded line (last updated 2 months ago)
https://github.com/chriselrod/LoopVectorization.jl/blame/1d7c2c103160ad3a85ae14bf7b0e1a29b2e047d7/benchmark/Manifest.toml#L426

chriselrod · 2020-08-03T09:23:39Z

You're right. Fixed, and I issued a new release.

chriselrod closed this as completed in f705e3d Aug 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow vmapntt! on Ryzen #141

Slow vmapntt! on Ryzen #141

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 2, 2020 •

edited

Loading

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 2, 2020

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 2, 2020 •

edited

Loading

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 3, 2020

Slow vmapntt! on Ryzen #141

Slow vmapntt! on Ryzen #141

Comments

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 2, 2020 • edited Loading

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 2, 2020

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 2, 2020 • edited Loading

stillyslalom commented Aug 2, 2020

chriselrod commented Aug 3, 2020

chriselrod commented Aug 2, 2020 •

edited

Loading

chriselrod commented Aug 2, 2020 •

edited

Loading