Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaks and more LoopVectorization support #1

Merged
merged 4 commits into from
Mar 20, 2021
Merged

Tweaks and more LoopVectorization support #1

merged 4 commits into from
Mar 20, 2021

Conversation

chriselrod
Copy link
Collaborator

@chriselrod chriselrod commented Mar 19, 2021

Also some fixes, like zero_vecunroll should return a bunch of Vecs of -Inf.

I also really don't like those promote definitions.
When where they needed? Do you have an example?
I suspect they aren't, but didn't want to touch them without some test case I could use to confirm.

I am also use Base.FastMath.max_fast instead of max, as this cuts out a few instructions. It didn't have much of an impact on performance, but it cleans up the assembly nicely.

Benchmarks on my computer:

julia> A = randn(1000,1000);

julia> B = randn(1000,1000);

julia> Cref = A*B; C = similar(Cref);

julia> At = Tropical.(A);

julia> Bt = Tropical.(B);

julia> Ctref = At * Bt; Ct = similar(Ctref);

julia> @benchmark Octavian.matmul_serial!($C,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.310 ms (0.00% GC)
  median time:      17.409 ms (0.00% GC)
  mean time:        17.409 ms (0.00% GC)
  maximum time:     18.099 ms (0.00% GC)
  --------------
  samples:          288
  evals/sample:     1

julia> 2e-9*1000^3 / 17.31e-3
115.54015020219526

julia> C  Cref
true

julia> @benchmark Octavian.matmul_serial!($Ct,$At,$Bt)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.906 ms (0.00% GC)
  median time:      31.946 ms (0.00% GC)
  mean time:        31.960 ms (0.00% GC)
  maximum time:     32.764 ms (0.00% GC)
  --------------
  samples:          157
  evals/sample:     1

julia> 2e-9*1000^3 / 31.906e-3
62.684134645521226

julia> Ct  Ctref
true

julia> 4.1 * 2 * 16 # peak theoretical `Float64` gflops
131.2

julia> 4.1 * 2 * 8 # peak theoretical `Tropical{Float64}` gflops
65.6

We get 62.7 out of the theoretical peak of 65.6 gflops on my machine.
I think that's pretty good; very little performance left on the table.

Peak GFLOPS is calculated as clock speed * instructions/clock * ops/instruction.
Float64 has twice the theoretical peak thanks to fused multiply add instructions, i.e. 16 Float64 operations per fma with AVX512, while for Tropical{Float64} it has 8 operations per max and per +.

@codecov
Copy link

codecov bot commented Mar 19, 2021

Codecov Report

Merging #1 (0fe1d36) into master (56cc43a) will decrease coverage by 11.01%.
The diff coverage is 77.27%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master       #1       +/-   ##
===========================================
- Coverage   93.93%   82.92%   -11.02%     
===========================================
  Files           2        2               
  Lines          33       41        +8     
===========================================
+ Hits           31       34        +3     
- Misses          2        7        +5     
Impacted Files Coverage Δ
src/gemm.jl 82.50% <77.27%> (-11.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56cc43a...0fe1d36. Read the comment docs.

Copy link
Member

@GiggleLiu GiggleLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investing time in this project! I also checked locally and confirmed this remarkable performance improvement.

Before

julia> using TropicalNumbers, Octavian, TropicalGEMM, BenchmarkTools; a = Tropical.(randn(1000, 1000)); (@benchmark Octavian.matmul_serial($a, $a))
┌ Warning: Your system has static(6) physical cores, but `Octavian.jl` only has 1 thread available. For the best performance, you should start Julia with at least static(6) threads.
└ @ Octavian ~/.julia/packages/Octavian/1LTHQ/src/init.jl:11
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     137.420 ms (0.00% GC)
  median time:      138.857 ms (0.00% GC)
  mean time:        140.412 ms (0.09% GC)
  maximum time:     157.847 ms (0.00% GC)
  --------------
  samples:          36
  evals/sample:     1%      

After

julia> @benchmark Octavian.matmul_serial($a, $a)
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     66.231 ms (0.00% GC)
  median time:      67.523 ms (0.00% GC)
  mean time:        67.629 ms (0.28% GC)
  maximum time:     70.431 ms (0.00% GC)
  --------------
  samples:          74
  evals/sample:     1

Please allow me taking the chance to ask some technical questions in the PR comments. Thanks again.

Tropical(VectorizationBase.collapse_max(content(vu)))
end
@inline function VectorizationBase.contract_add(vu::Tropical{VecUnroll{N,W,T,V}}, ::StaticInt{K}) where {N,W,T,V,K}
Tropical(VectorizationBase.contract_max(content(vu), StaticInt{K}()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make contract_max covered, but I can not figure out when it is used. it is defined as

julia> VectorizationBase.collapse_expr(3, :max, 1)
quote
    #= /home/leo/.julia/packages/VectorizationBase/p0bvq/src/vecunroll/fmap.jl:173 =#
    $(Expr(:meta, :inline))
    #= /home/leo/.julia/packages/VectorizationBase/p0bvq/src/vecunroll/fmap.jl:174 =#
    (v_1, v_2, v_3, v_4) = data(vu)
    v_1 = max(v_1, v_3)
    v_2 = max(v_2, v_4)
    v_1 = max(v_1, v_2)
end

in VectorizationBase. It is similar to collapse_add, wondering when do we need this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the test that needed this, vecmemaybe, and the ifelse definition.

for f ∈ [:(Base.:(+)), :(Base.FastMath.add_fast)]
@eval begin
@inline $f(::StaticInt{0}, vx::Tropical{T}) where {T<:AbstractSIMD} = vx
@inline $f(vx::Tropical{T}, ::StaticInt{0}) where {T<:AbstractSIMD} = vx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is way more elegant!

lines 94,95,101 are not covered. Is it because they are defined for completeness of definition, but never used in practise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I need the following patch to pass the tests on my local host. All related packages are consistent with the CI machine. I can not figure out why my host is so different.

I set the lower bound of VectorizationBase to 0.19.11, but you're on 0.19.9. Could you try 0.19.11 and see if you still need the patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, we do not need that patch. thanks!

# julia 1.5 patch
@inline function VectorizationBase.VecUnroll(data::Tuple{T,Vararg{T,N}}) where {N,T<:Tropical}
Tropical.(VecUnroll(content.(data)))
@inline function VectorizationBase.ifelse(f::F, m::AbstractMask, v1::Tropical, v2::Tropical, v3::Tropical) where {F}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I comment out this function on my local host, tests still pass. What is the purpose of this function?
(although I am having latest VectorizationBase and LoopVectorization, the behavior of my local host is slightly different from the CI machine, maybe some underlying package is different.)

@inline function Base.promote(a::Int, b::Tropical{T}, c::Tropical{T}) where {T<:VecUnroll}
elem = a == 0 ? -Inf : 0.0
Tropical(T(elem)), b, c
@inline LoopVectorization.vecmemaybe(x::Tropical) = x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I comment out vecmaybe function on my local host, tests still pass. What is the purpose of this function?

@inline function VectorizationBase._vload(ptr::AbstractStridedPointer{Tropical{T}}, u::Unroll, a::A, si::StaticInt{RS}) where {T,A<:StaticBool,RS}
res = VectorizationBase._vload(notropical(ptr), u, a, si)
@inline function VectorizationBase._vload(ptr::AbstractStridedPointer{Tropical{T}}, u::Unroll, ::A, ::StaticInt{RS}) where {T,A<:StaticBool,RS}
res = VectorizationBase._vload(notropical(ptr), u, A(), StaticInt{RS}())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always wonder what is the difference between passing an argument and writing the constructor explicitly?

src/gemm.jl Show resolved Hide resolved
@GiggleLiu
Copy link
Member

GiggleLiu commented Mar 19, 2021

Also, I need the following patch to pass the tests on my local host. All related packages are consistent with the CI machine. I can not figure out why my host is so different.

#2

FYI:

(@v1.6) pkg> st
      Status `~/.julia/environments/v1.6/Project.toml`
  [1520ce14] AbstractTrees v0.3.4
  [a9ab73d0] BatchedRoutines v0.2.1
  [6e4b80f9] BenchmarkTools v0.5.0
  [d7b10767] BinarySparseTensors v0.1.0 `~/.julia/dev/BinarySparseTensors`
  [50ba71b6] BitBasis v0.7.2
  [81fc84e3] BliContractor v1.1.0 `~/.julia/dev/BliContractor`
  [336ed68f] CSV v0.8.4
  [052768ef] CUDA v2.6.2
  [159f3aea] Cairo v1.0.5
  [5ae59095] Colors v0.12.6
  [a81c6b42] Compose v0.9.2
  [150eb455] CoordinateTransformations v0.6.1
  [717857b8] DSP v0.6.10
  [a93c6f00] DataFrames v0.22.5
  [1313f7d8] DataFramesMeta v0.6.0
  [e30172f5] Documenter v0.26.3
  [35a29f4d] DocumenterTools v0.1.9
  [497a8b3b] DoubleFloats v1.1.18
  [b3ff564c] EliminateGraphs v0.1.0
  [d4d017d3] ExponentialUtilities v1.8.0 `~/.julia/dev/ExponentialUtilities`
  [7a1cc6ca] FFTW v1.3.2
  [5789e2e9] FileIO v1.6.4
  [652a1917] Fire v0.1.1
  [53c48c17] FixedPointNumbers v0.8.4
  [f6369f11] ForwardDiff v0.10.17
  [01680d73] GenericSVD v0.3.0
  [a2cc645c] GraphPlot v0.4.4
  [708ec375] Gumbo v0.8.0
  [f67ccb44] HDF5 v0.15.4
  [e47d643f] HierarchicalBipartition v0.1.0 `../../dev/HierarchicalBipartition`
  [9136182c] ITensors v0.1.40
  [42fd0dbc] IterativeSolvers v0.9.0
  [033835bb] JLD2 v0.2.4
  [682c06a0] JSON v0.21.1
  [e5e0dc1b] Juno v0.8.4
  [63c18a36] KernelAbstractions v0.5.3
  [b964fa9f] LaTeXStrings v1.2.1
  [23fbe1c1] Latexify v0.14.11
  [ac33ec98] LightBayesian v0.2.3 `~/.julia/dev/LightBayesian`
  [093fc24a] LightGraphs v1.3.5
  [98b081ad] Literate v2.8.0
  [aa2f6b4e] LogarithmicNumbers v0.4.0
  [bdcacae8] LoopVectorization v0.12.2
  [ae8d54c2] Luxor v2.10.0
  [0fe46d8b] MISExperimentUtils v0.5.6 `~/.julia/dev/MISExperimentUtils`
  [33e6dc65] MKL v0.4.0 `~/.julia/dev/MKL`
  [5dd3f0b1] MatchCore v0.1.0
  [ab4ef3a6] NiLang v0.8.2 `~/.julia/dev/NiLang`
  [575d3204] NiLangCore v0.8.2 `~/.julia/dev/NiLangCore`
  [ebe7aa44] OMEinsum v0.3.3 `~/.julia/dev/OMEinsum`
  [6fd5a793] Octavian v0.2.11
  [429524aa] Optim v1.2.4
  [32113eaa] PkgBenchmark v0.2.10
  [14b8a8f1] PkgTemplates v0.7.16
  [91a5bcdd] Plots v1.10.6
  [c3e4b0f8] Pluto v0.12.21
  [ef6358c6] PlutoMustache v0.2.4 `~/.julia/dev/PlutoMustache`
  [7f904dfe] PlutoUI v0.7.1
  [a5ec66cb] PlutoUtils v0.1.0 `https://github.com/GiggleLiu/PlutoUtils.jl#static-export`
  [ade5400d] Pomicon v0.3.4 `~/.julia/dev/Pomicon`
  [c46f51b8] ProfileView v0.6.9
  [438e738f] PyCall v1.92.2
  [d330b81b] PyPlot v2.9.0
  [ae029012] Requires v1.1.3
  [37e2e3b7] ReverseDiff v1.7.0
  [d4ee886f] ReversibleSeismic v0.1.0 `../../dev/ReversibleSeismic`
  [295af30f] Revise v3.1.14
  [6038ab10] Rotations v1.0.2
  [476501e8] SLEEFPirates v0.6.12
  [4456351a] SimpleTensorNetworks v0.2.1 `~/.julia/dev/SimpleTensorNetworks`
  [aa65fe97] SnoopCompile v2.6.0
  [90137ffa] StaticArrays v0.12.5
  [f3b207a7] StatsPlots v0.14.19
  [6d0aa2be] StochasticOptimizers v0.8.2 `~/.julia/dev/StochasticOptimizers`
  [123dc426] SymEngine v0.8.3
  [6aa20fa7] TensorOperations v3.1.0
  [8290d209] ThreadingUtilities v0.4.1
  [a4ad3063] TropicalGEMM v0.1.0 `../../dev/TropicalGEMM`
  [30ce92b6] TropicalMIS v0.1.2 `../../dev/TropicalMIS`
  [b3a74e9c] TropicalNumbers v0.2.2
  [d36a0d72] TropicalTensors v0.2.0 `~/.julia/dev/TropicalTensors`
  [9d95972d] TupleTools v1.2.0
  [1986cc42] Unitful v1.6.0
  [3d5dd08c] VectorizationBase v0.19.9
  [52a3aca4] Viznet v0.3.2 `~/.julia/dev/Viznet`
  [5872b779] Yao v0.6.3
  [32cfe2d9] YaoPlots v0.6.0 `~/.julia/dev/YaoPlots`

@GiggleLiu
Copy link
Member

Let's get this PR merged, I also added some tests locally. We can always add more tests later. Cheers!

@GiggleLiu GiggleLiu merged commit 54dc867 into master Mar 20, 2021
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants