-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Move all platforms to use llvm.minimum/llvm.maximum for fmin/fmax #56371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
oh wow julia> versioninfo()
Julia Version 1.12.0-DEV.1506
Commit 2cdfe062952 (2024-10-28 11:32 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)
julia> function f(dij)
vmin = Inf
for x in dij
vmin = min(x,vmin)
end
return vmin
end
f (generic function with 2 methods)
julia> @be rand(Float64, 512000) f samples=500 evals=100
Benchmark: 6 samples with 100 evaluations
min 1.764 ms
median 1.768 ms
mean 1.775 ms
max 1.805 ms
julia> function f_llvm(dij)
vmin = Inf
for x in dij
vmin = llvm_min(x,vmin)
end
return vmin
end
f_llvm (generic function with 1 method)
julia> Base.@assume_effects :total @inline llvm_min(x::Float64, y::Float64) = ccall("llvm.minimum.f64", llvmcall, Float64, (Float64, Float64), x, y)
llvm_min (generic function with 1 method)
julia> @be rand(Float64, 512000) f_llvm samples=500 evals=100
Benchmark: 163 samples with 100 evaluations
min 49.408 μs
median 51.382 μs
mean 51.667 μs
max 55.443 μs |
|
related question, what would it take to make julia> function f_isless(dij)
vmin = Inf
for x in dij
pred = isless(x, vmin)
vmin = pred ? x : vmin
end
return vmin
end
f_isless (generic function with 2 methods)
julia> @be rand(Float64, 512000) f_isless samples=500 evals=100
Benchmark: 5 samples with 100 evaluations
min 2.108 ms
median 2.114 ms
mean 2.118 ms
max 2.140 ms
julia> function f_llvm(dij)
vmin = Inf
for x in dij
vmin = llvm_min(x, vmin)
end
return vmin
end
f_llvm (generic function with 5 methods)
julia> @be rand(Float64, 512000) f_llvm samples=500 evals=100
Benchmark: 162 samples with 100 evaluations
min 49.071 μs
median 51.354 μs
mean 51.752 μs
max 71.813 μs |
|
Are you sure the |
|
it's not because of branching: julia> function f_isless(dij)
vmin = Inf
for x in dij
pred = isless(x, vmin)
vmin = ifelse(pred, x, vmin)
end
return vmin
end
f_isless (generic function with 1 method)
julia> @be rand(512000) f_isless samples=100 evals=50
Benchmark: 10 samples with 50 evaluations
min 2.090 ms
median 2.101 ms
mean 2.116 ms
max 2.211 ms |
|
PowerPC failures are fixed in LLVM 19. So we may want to wait for that to merge to merge this. |
|
Should these get tfuncs? IIUC, making these be |
|
Maybe. I'm not sure we gain much by making them tfuncs. They get annotated |
|
This fixes #48487 (CC: @mikmoore) on x86_64 (I think it was already fixed on aarch64): # [...]
# %bb.0: # %top
#DEBUG_VALUE: #2:x <- $xmm0
push rbp
mov rbp, rsp
movabs rax, offset .LCPI0_0
vmovsd xmm1, qword ptr [rax] # xmm1 = mem[0],zero
vmaxsd xmm0, xmm1, xmm0
pop rbp
ret# [...]
# %bb.0: # %top
#DEBUG_VALUE: #5:x <- $xmm0
push rbp
mov rbp, rsp
movabs rax, offset .LCPI0_0
vmovsd xmm1, qword ptr [rax] # xmm1 = mem[0],zero
vmaxsd xmm0, xmm1, xmm0
pop rbp
ret |
|
Hopefully this also gives us a path to better performance on |
|
as far as I can tell However, I'm continuously interested in what we can do for |
|
@Moelf part of the issue is that For example, if one eliminates the branches here, then a factor of 2 speedup is possible. Alas, the ordering of floats becomes |
|
Right, but I don't think use |
|
I think this discussion should continue in a dedicated ticket 🙂 |
|
Given that LLVM 19 is taking a while to merge and powerpc is currently listed as teir 4 (used to build), can we rebase and merge this? |
|
I've pushed 1 additional commit that removes some of the remaining complexity from when this was only ccalls. |
…n but as a test. This used to not work but LLVM now has support for this on all platforms we care about.
639a157 to
a0af124
Compare
giordano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as far as I can tell.
I can confirm this works also on riscv64, and also on this PR
julia> code_llvm(x->max(-Inf,x),Tuple{Float64};debuginfo=:none); Function Signature: var"#20"(Float64)
define double @"julia_#20_2131"(double %"x::Float64") #0 {
top:
#dbg_value(double %"x::Float64", !2, !DIExpression(), !14)
ret double %"x::Float64"
}while on current master
julia> code_llvm(x->max(-Inf,x),Tuple{Float64};debuginfo=:none); Function Signature: var"#26"(Float64)
define double @"julia_#26_4754"(double %"x::Float64") #0 {
top:
#dbg_value(double %"x::Float64", !3, !DIExpression(), !15)
%0 = fsub double 0xFFF0000000000000, %"x::Float64"
%bitcast_coercion = bitcast double %0 to i64
%1 = icmp sgt i64 %bitcast_coercion, -1
%2 = select i1 %1, double 0xFFF0000000000000, double %"x::Float64"
%3 = fcmp ord double %"x::Float64", 0.000000e+00
%4 = select i1 %3, double %2, double %0
ret double %4
}which is another confirmation #48487 is indeed fixed. I see exactly the same improvement on x86_64 (on aarch64 we were already using the llvm functions so there's no change there).
Note that tests are passing on all platforms, the x86_64-apple-darwin job is only erroring during a cleanup step at the end, and upload jobs are failing because of the ongoing github outage.
…liaLang#56371) This used to not work but LLVM now has support for this on all platforms we care about. Maybe this should be a builtin. This allows for more vectorization opportunities since llvm understands the code better Fix JuliaLang#48487. --------- Co-authored-by: Mosè Giordano <[email protected]> Co-authored-by: oscarddssmith <[email protected]>
This used to not work but LLVM now has support for this on all platforms we care about.
Maybe this should be a builtin.
This allows for more vectorization opportunities since llvm understands the code better
Fix #48487.