-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweaks and more LoopVectorization support #1
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1 +/- ##
===========================================
- Coverage 93.93% 82.92% -11.02%
===========================================
Files 2 2
Lines 33 41 +8
===========================================
+ Hits 31 34 +3
- Misses 2 7 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for investing time in this project! I also checked locally and confirmed this remarkable performance improvement.
Before
julia> using TropicalNumbers, Octavian, TropicalGEMM, BenchmarkTools; a = Tropical.(randn(1000, 1000)); (@benchmark Octavian.matmul_serial($a, $a))
┌ Warning: Your system has static(6) physical cores, but `Octavian.jl` only has 1 thread available. For the best performance, you should start Julia with at least static(6) threads.
└ @ Octavian ~/.julia/packages/Octavian/1LTHQ/src/init.jl:11
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 137.420 ms (0.00% GC)
median time: 138.857 ms (0.00% GC)
mean time: 140.412 ms (0.09% GC)
maximum time: 157.847 ms (0.00% GC)
--------------
samples: 36
evals/sample: 1%
After
julia> @benchmark Octavian.matmul_serial($a, $a)
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 66.231 ms (0.00% GC)
median time: 67.523 ms (0.00% GC)
mean time: 67.629 ms (0.28% GC)
maximum time: 70.431 ms (0.00% GC)
--------------
samples: 74
evals/sample: 1
Please allow me taking the chance to ask some technical questions in the PR comments. Thanks again.
Tropical(VectorizationBase.collapse_max(content(vu))) | ||
end | ||
@inline function VectorizationBase.contract_add(vu::Tropical{VecUnroll{N,W,T,V}}, ::StaticInt{K}) where {N,W,T,V,K} | ||
Tropical(VectorizationBase.contract_max(content(vu), StaticInt{K}())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make contract_max
covered, but I can not figure out when it is used. it is defined as
julia> VectorizationBase.collapse_expr(3, :max, 1)
quote
#= /home/leo/.julia/packages/VectorizationBase/p0bvq/src/vecunroll/fmap.jl:173 =#
$(Expr(:meta, :inline))
#= /home/leo/.julia/packages/VectorizationBase/p0bvq/src/vecunroll/fmap.jl:174 =#
(v_1, v_2, v_3, v_4) = data(vu)
v_1 = max(v_1, v_3)
v_2 = max(v_2, v_4)
v_1 = max(v_1, v_2)
end
in VectorizationBase. It is similar to collapse_add
, wondering when do we need this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the test that needed this, vecmemaybe
, and the ifelse
definition.
for f ∈ [:(Base.:(+)), :(Base.FastMath.add_fast)] | ||
@eval begin | ||
@inline $f(::StaticInt{0}, vx::Tropical{T}) where {T<:AbstractSIMD} = vx | ||
@inline $f(vx::Tropical{T}, ::StaticInt{0}) where {T<:AbstractSIMD} = vx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is way more elegant!
lines 94,95,101 are not covered. Is it because they are defined for completeness of definition, but never used in practise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I need the following patch to pass the tests on my local host. All related packages are consistent with the CI machine. I can not figure out why my host is so different.
I set the lower bound of VectorizationBase to 0.19.11, but you're on 0.19.9. Could you try 0.19.11 and see if you still need the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, we do not need that patch. thanks!
# julia 1.5 patch | ||
@inline function VectorizationBase.VecUnroll(data::Tuple{T,Vararg{T,N}}) where {N,T<:Tropical} | ||
Tropical.(VecUnroll(content.(data))) | ||
@inline function VectorizationBase.ifelse(f::F, m::AbstractMask, v1::Tropical, v2::Tropical, v3::Tropical) where {F} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I comment out this function on my local host, tests still pass. What is the purpose of this function?
(although I am having latest VectorizationBase and LoopVectorization, the behavior of my local host is slightly different from the CI machine, maybe some underlying package is different.)
@inline function Base.promote(a::Int, b::Tropical{T}, c::Tropical{T}) where {T<:VecUnroll} | ||
elem = a == 0 ? -Inf : 0.0 | ||
Tropical(T(elem)), b, c | ||
@inline LoopVectorization.vecmemaybe(x::Tropical) = x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I comment out vecmaybe
function on my local host, tests still pass. What is the purpose of this function?
@inline function VectorizationBase._vload(ptr::AbstractStridedPointer{Tropical{T}}, u::Unroll, a::A, si::StaticInt{RS}) where {T,A<:StaticBool,RS} | ||
res = VectorizationBase._vload(notropical(ptr), u, a, si) | ||
@inline function VectorizationBase._vload(ptr::AbstractStridedPointer{Tropical{T}}, u::Unroll, ::A, ::StaticInt{RS}) where {T,A<:StaticBool,RS} | ||
res = VectorizationBase._vload(notropical(ptr), u, A(), StaticInt{RS}()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always wonder what is the difference between passing an argument and writing the constructor explicitly?
Also, I need the following patch to pass the tests on my local host. All related packages are consistent with the CI machine. I can not figure out why my host is so different. FYI:
|
Let's get this PR merged, I also added some tests locally. We can always add more tests later. Cheers! |
Also some fixes, like
zero_vecunroll
should return a bunch ofVec
s of-Inf
.I also really don't like those promote definitions.
When where they needed? Do you have an example?
I suspect they aren't, but didn't want to touch them without some test case I could use to confirm.
I am also use
Base.FastMath.max_fast
instead ofmax
, as this cuts out a few instructions. It didn't have much of an impact on performance, but it cleans up the assembly nicely.Benchmarks on my computer:
We get
62.7
out of the theoretical peak of65.6
gflops on my machine.I think that's pretty good; very little performance left on the table.
Peak GFLOPS is calculated as clock speed * instructions/clock * ops/instruction.
Float64
has twice the theoretical peak thanks to fused multiply add instructions, i.e. 16Float64
operations per fma with AVX512, while forTropical{Float64}
it has 8 operations permax
and per+
.