Tweaks and more LoopVectorization support #1

chriselrod · 2021-03-19T00:17:16Z

Also some fixes, like zero_vecunroll should return a bunch of Vecs of -Inf.

I also really don't like those promote definitions.
When where they needed? Do you have an example?
I suspect they aren't, but didn't want to touch them without some test case I could use to confirm.

I am also use Base.FastMath.max_fast instead of max, as this cuts out a few instructions. It didn't have much of an impact on performance, but it cleans up the assembly nicely.

Benchmarks on my computer:

julia> A = randn(1000,1000);

julia> B = randn(1000,1000);

julia> Cref = A*B; C = similar(Cref);

julia> At = Tropical.(A);

julia> Bt = Tropical.(B);

julia> Ctref = At * Bt; Ct = similar(Ctref);

julia> @benchmark Octavian.matmul_serial!($C,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.310 ms (0.00% GC)
  median time:      17.409 ms (0.00% GC)
  mean time:        17.409 ms (0.00% GC)
  maximum time:     18.099 ms (0.00% GC)
  --------------
  samples:          288
  evals/sample:     1

julia> 2e-9*1000^3 / 17.31e-3
115.54015020219526

julia> C ≈ Cref
true

julia> @benchmark Octavian.matmul_serial!($Ct,$At,$Bt)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.906 ms (0.00% GC)
  median time:      31.946 ms (0.00% GC)
  mean time:        31.960 ms (0.00% GC)
  maximum time:     32.764 ms (0.00% GC)
  --------------
  samples:          157
  evals/sample:     1

julia> 2e-9*1000^3 / 31.906e-3
62.684134645521226

julia> Ct ≈ Ctref
true

julia> 4.1 * 2 * 16 # peak theoretical `Float64` gflops
131.2

julia> 4.1 * 2 * 8 # peak theoretical `Tropical{Float64}` gflops
65.6

We get 62.7 out of the theoretical peak of 65.6 gflops on my machine.
I think that's pretty good; very little performance left on the table.

Peak GFLOPS is calculated as clock speed * instructions/clock * ops/instruction.
Float64 has twice the theoretical peak thanks to fused multiply add instructions, i.e. 16 Float64 operations per fma with AVX512, while for Tropical{Float64} it has 8 operations per max and per +.

codecov · 2021-03-19T00:20:17Z

Codecov Report

Merging #1 (0fe1d36) into master (56cc43a) will decrease coverage by 11.01%.
The diff coverage is 77.27%.

@@             Coverage Diff             @@
##           master       #1       +/-   ##
===========================================
- Coverage   93.93%   82.92%   -11.02%     
===========================================
  Files           2        2               
  Lines          33       41        +8     
===========================================
+ Hits           31       34        +3     
- Misses          2        7        +5

Impacted Files	Coverage Δ
src/gemm.jl	`82.50% <77.27%> (-11.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56cc43a...0fe1d36. Read the comment docs.

GiggleLiu

Thank you for investing time in this project! I also checked locally and confirmed this remarkable performance improvement.

Before

julia> using TropicalNumbers, Octavian, TropicalGEMM, BenchmarkTools; a = Tropical.(randn(1000, 1000)); (@benchmark Octavian.matmul_serial($a, $a))
┌ Warning: Your system has static(6) physical cores, but `Octavian.jl` only has 1 thread available. For the best performance, you should start Julia with at least static(6) threads.
└ @ Octavian ~/.julia/packages/Octavian/1LTHQ/src/init.jl:11
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     137.420 ms (0.00% GC)
  median time:      138.857 ms (0.00% GC)
  mean time:        140.412 ms (0.09% GC)
  maximum time:     157.847 ms (0.00% GC)
  --------------
  samples:          36
  evals/sample:     1%

After

julia> @benchmark Octavian.matmul_serial($a, $a)
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     66.231 ms (0.00% GC)
  median time:      67.523 ms (0.00% GC)
  mean time:        67.629 ms (0.28% GC)
  maximum time:     70.431 ms (0.00% GC)
  --------------
  samples:          74
  evals/sample:     1

Please allow me taking the chance to ask some technical questions in the PR comments. Thanks again.

GiggleLiu · 2021-03-19T03:37:58Z

src/gemm.jl

+    Tropical(VectorizationBase.collapse_max(content(vu)))
+end
+@inline function VectorizationBase.contract_add(vu::Tropical{VecUnroll{N,W,T,V}}, ::StaticInt{K}) where {N,W,T,V,K}
+    Tropical(VectorizationBase.contract_max(content(vu), StaticInt{K}()))


I want to make contract_max covered, but I can not figure out when it is used. it is defined as

julia> VectorizationBase.collapse_expr(3, :max, 1) quote #= /home/leo/.julia/packages/VectorizationBase/p0bvq/src/vecunroll/fmap.jl:173 =# $(Expr(:meta, :inline)) #= /home/leo/.julia/packages/VectorizationBase/p0bvq/src/vecunroll/fmap.jl:174 =# (v_1, v_2, v_3, v_4) = data(vu) v_1 = max(v_1, v_3) v_2 = max(v_2, v_4) v_1 = max(v_1, v_2) end

in VectorizationBase. It is similar to collapse_add, wondering when do we need this function?

I'll add the test that needed this, vecmemaybe, and the ifelse definition.

GiggleLiu · 2021-03-19T03:54:05Z

src/gemm.jl

+for f ∈ [:(Base.:(+)), :(Base.FastMath.add_fast)]
+    @eval begin
+        @inline $f(::StaticInt{0}, vx::Tropical{T}) where {T<:AbstractSIMD} = vx
+        @inline $f(vx::Tropical{T}, ::StaticInt{0}) where {T<:AbstractSIMD} = vx


Nice, this is way more elegant!

lines 94,95,101 are not covered. Is it because they are defined for completeness of definition, but never used in practise?

Also, I need the following patch to pass the tests on my local host. All related packages are consistent with the CI machine. I can not figure out why my host is so different.

I set the lower bound of VectorizationBase to 0.19.11, but you're on 0.19.9. Could you try 0.19.11 and see if you still need the patch?

you are right, we do not need that patch. thanks!

GiggleLiu · 2021-03-19T03:57:01Z

src/gemm.jl

-# julia 1.5 patch
-@inline function VectorizationBase.VecUnroll(data::Tuple{T,Vararg{T,N}}) where {N,T<:Tropical}
-    Tropical.(VecUnroll(content.(data)))
+@inline function VectorizationBase.ifelse(f::F, m::AbstractMask, v1::Tropical, v2::Tropical, v3::Tropical) where {F}


If I comment out this function on my local host, tests still pass. What is the purpose of this function?
(although I am having latest VectorizationBase and LoopVectorization, the behavior of my local host is slightly different from the CI machine, maybe some underlying package is different.)

GiggleLiu · 2021-03-19T04:19:59Z

src/gemm.jl

-@inline function Base.promote(a::Int, b::Tropical{T}, c::Tropical{T}) where {T<:VecUnroll}
-    elem = a == 0 ? -Inf : 0.0
-    Tropical(T(elem)), b, c
+@inline LoopVectorization.vecmemaybe(x::Tropical) = x


If I comment out vecmaybe function on my local host, tests still pass. What is the purpose of this function?

GiggleLiu · 2021-03-19T04:23:16Z

src/gemm.jl

-@inline function VectorizationBase._vload(ptr::AbstractStridedPointer{Tropical{T}}, u::Unroll, a::A, si::StaticInt{RS}) where {T,A<:StaticBool,RS}
-    res = VectorizationBase._vload(notropical(ptr), u, a, si)
+@inline function VectorizationBase._vload(ptr::AbstractStridedPointer{Tropical{T}}, u::Unroll, ::A, ::StaticInt{RS}) where {T,A<:StaticBool,RS}
+    res = VectorizationBase._vload(notropical(ptr), u, A(), StaticInt{RS}())


I always wonder what is the difference between passing an argument and writing the constructor explicitly?

src/gemm.jl

GiggleLiu · 2021-03-19T04:31:48Z

Also, I need the following patch to pass the tests on my local host. All related packages are consistent with the CI machine. I can not figure out why my host is so different.

#2

FYI:

(@v1.6) pkg> st
      Status `~/.julia/environments/v1.6/Project.toml`
  [1520ce14] AbstractTrees v0.3.4
  [a9ab73d0] BatchedRoutines v0.2.1
  [6e4b80f9] BenchmarkTools v0.5.0
  [d7b10767] BinarySparseTensors v0.1.0 `~/.julia/dev/BinarySparseTensors`
  [50ba71b6] BitBasis v0.7.2
  [81fc84e3] BliContractor v1.1.0 `~/.julia/dev/BliContractor`
  [336ed68f] CSV v0.8.4
  [052768ef] CUDA v2.6.2
  [159f3aea] Cairo v1.0.5
  [5ae59095] Colors v0.12.6
  [a81c6b42] Compose v0.9.2
  [150eb455] CoordinateTransformations v0.6.1
  [717857b8] DSP v0.6.10
  [a93c6f00] DataFrames v0.22.5
  [1313f7d8] DataFramesMeta v0.6.0
  [e30172f5] Documenter v0.26.3
  [35a29f4d] DocumenterTools v0.1.9
  [497a8b3b] DoubleFloats v1.1.18
  [b3ff564c] EliminateGraphs v0.1.0
  [d4d017d3] ExponentialUtilities v1.8.0 `~/.julia/dev/ExponentialUtilities`
  [7a1cc6ca] FFTW v1.3.2
  [5789e2e9] FileIO v1.6.4
  [652a1917] Fire v0.1.1
  [53c48c17] FixedPointNumbers v0.8.4
  [f6369f11] ForwardDiff v0.10.17
  [01680d73] GenericSVD v0.3.0
  [a2cc645c] GraphPlot v0.4.4
  [708ec375] Gumbo v0.8.0
  [f67ccb44] HDF5 v0.15.4
  [e47d643f] HierarchicalBipartition v0.1.0 `../../dev/HierarchicalBipartition`
  [9136182c] ITensors v0.1.40
  [42fd0dbc] IterativeSolvers v0.9.0
  [033835bb] JLD2 v0.2.4
  [682c06a0] JSON v0.21.1
  [e5e0dc1b] Juno v0.8.4
  [63c18a36] KernelAbstractions v0.5.3
  [b964fa9f] LaTeXStrings v1.2.1
  [23fbe1c1] Latexify v0.14.11
  [ac33ec98] LightBayesian v0.2.3 `~/.julia/dev/LightBayesian`
  [093fc24a] LightGraphs v1.3.5
  [98b081ad] Literate v2.8.0
  [aa2f6b4e] LogarithmicNumbers v0.4.0
  [bdcacae8] LoopVectorization v0.12.2
  [ae8d54c2] Luxor v2.10.0
  [0fe46d8b] MISExperimentUtils v0.5.6 `~/.julia/dev/MISExperimentUtils`
  [33e6dc65] MKL v0.4.0 `~/.julia/dev/MKL`
  [5dd3f0b1] MatchCore v0.1.0
  [ab4ef3a6] NiLang v0.8.2 `~/.julia/dev/NiLang`
  [575d3204] NiLangCore v0.8.2 `~/.julia/dev/NiLangCore`
  [ebe7aa44] OMEinsum v0.3.3 `~/.julia/dev/OMEinsum`
  [6fd5a793] Octavian v0.2.11
  [429524aa] Optim v1.2.4
  [32113eaa] PkgBenchmark v0.2.10
  [14b8a8f1] PkgTemplates v0.7.16
  [91a5bcdd] Plots v1.10.6
  [c3e4b0f8] Pluto v0.12.21
  [ef6358c6] PlutoMustache v0.2.4 `~/.julia/dev/PlutoMustache`
  [7f904dfe] PlutoUI v0.7.1
  [a5ec66cb] PlutoUtils v0.1.0 `https://github.com/GiggleLiu/PlutoUtils.jl#static-export`
  [ade5400d] Pomicon v0.3.4 `~/.julia/dev/Pomicon`
  [c46f51b8] ProfileView v0.6.9
  [438e738f] PyCall v1.92.2
  [d330b81b] PyPlot v2.9.0
  [ae029012] Requires v1.1.3
  [37e2e3b7] ReverseDiff v1.7.0
  [d4ee886f] ReversibleSeismic v0.1.0 `../../dev/ReversibleSeismic`
  [295af30f] Revise v3.1.14
  [6038ab10] Rotations v1.0.2
  [476501e8] SLEEFPirates v0.6.12
  [4456351a] SimpleTensorNetworks v0.2.1 `~/.julia/dev/SimpleTensorNetworks`
  [aa65fe97] SnoopCompile v2.6.0
  [90137ffa] StaticArrays v0.12.5
  [f3b207a7] StatsPlots v0.14.19
  [6d0aa2be] StochasticOptimizers v0.8.2 `~/.julia/dev/StochasticOptimizers`
  [123dc426] SymEngine v0.8.3
  [6aa20fa7] TensorOperations v3.1.0
  [8290d209] ThreadingUtilities v0.4.1
  [a4ad3063] TropicalGEMM v0.1.0 `../../dev/TropicalGEMM`
  [30ce92b6] TropicalMIS v0.1.2 `../../dev/TropicalMIS`
  [b3a74e9c] TropicalNumbers v0.2.2
  [d36a0d72] TropicalTensors v0.2.0 `~/.julia/dev/TropicalTensors`
  [9d95972d] TupleTools v1.2.0
  [1986cc42] Unitful v1.6.0
  [3d5dd08c] VectorizationBase v0.19.9
  [52a3aca4] Viznet v0.3.2 `~/.julia/dev/Viznet`
  [5872b779] Yao v0.6.3
  [32cfe2d9] YaoPlots v0.6.0 `~/.julia/dev/YaoPlots`

GiggleLiu · 2021-03-20T02:06:42Z

Let's get this PR merged, I also added some tests locally. We can always add more tests later. Cheers!

Tweaks and more LoopVectorization support

6187673

chriselrod requested a review from GiggleLiu March 19, 2021 00:17

chriselrod added 3 commits March 18, 2021 21:03

Drop promote functions, add StaticInt methods

c0f1bd9

Define mul_fast and add_fast for static numbers as well

9fdaa51

Define fma method for StaticInt

0fe1d36

GiggleLiu approved these changes Mar 19, 2021

View reviewed changes

GiggleLiu mentioned this pull request Mar 19, 2021

Supporting Tropical numbers JuliaSIMD/LoopVectorization.jl#201

Closed

GiggleLiu merged commit 54dc867 into master Mar 20, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweaks and more LoopVectorization support #1

Tweaks and more LoopVectorization support #1

chriselrod commented Mar 19, 2021 •

edited

Loading

codecov bot commented Mar 19, 2021 •

edited

Loading

GiggleLiu left a comment •

edited

Loading

GiggleLiu Mar 19, 2021

chriselrod Mar 19, 2021

GiggleLiu Mar 19, 2021

chriselrod Mar 19, 2021

GiggleLiu Mar 19, 2021

GiggleLiu Mar 19, 2021

GiggleLiu Mar 19, 2021

GiggleLiu Mar 19, 2021

GiggleLiu commented Mar 19, 2021 •

edited

Loading

GiggleLiu commented Mar 20, 2021

Tweaks and more LoopVectorization support #1

Tweaks and more LoopVectorization support #1

Conversation

chriselrod commented Mar 19, 2021 • edited Loading

codecov bot commented Mar 19, 2021 • edited Loading

Codecov Report

GiggleLiu left a comment • edited Loading

Choose a reason for hiding this comment

Before

After

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GiggleLiu commented Mar 19, 2021 • edited Loading

GiggleLiu commented Mar 20, 2021

chriselrod commented Mar 19, 2021 •

edited

Loading

codecov bot commented Mar 19, 2021 •

edited

Loading

GiggleLiu left a comment •

edited

Loading

GiggleLiu commented Mar 19, 2021 •

edited

Loading