Skip to content

Conversation

@gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Oct 28, 2024

This used to not work but LLVM now has support for this on all platforms we care about.

Maybe this should be a builtin.
This allows for more vectorization opportunities since llvm understands the code better

Fix #48487.

@Moelf
Copy link
Contributor

Moelf commented Oct 28, 2024

oh wow

julia> versioninfo()
Julia Version 1.12.0-DEV.1506
Commit 2cdfe062952 (2024-10-28 11:32 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)

julia> function f(dij)
           vmin = Inf
           for x in dij
               vmin = min(x,vmin)
           end
           return vmin
       end
f (generic function with 2 methods)

julia> @be rand(Float64, 512000) f samples=500 evals=100
Benchmark: 6 samples with 100 evaluations
 min    1.764 ms
 median 1.768 ms
 mean   1.775 ms
 max    1.805 ms

julia> function f_llvm(dij)
           vmin = Inf
           for x in dij
               vmin = llvm_min(x,vmin)
           end
           return vmin
       end
f_llvm (generic function with 1 method)

julia> Base.@assume_effects :total @inline llvm_min(x::Float64, y::Float64) = ccall("llvm.minimum.f64", llvmcall, Float64, (Float64, Float64), x, y)
llvm_min (generic function with 1 method)

julia> @be rand(Float64, 512000) f_llvm samples=500 evals=100
Benchmark: 163 samples with 100 evaluations
 min    49.408 μs
 median 51.382 μs
 mean   51.667 μs
 max    55.443 μs

@Moelf
Copy link
Contributor

Moelf commented Oct 28, 2024

related question, what would it take to make isless() faster? (it's used for our findmin())

julia> function f_isless(dij)
           vmin = Inf
           for x in dij
               pred = isless(x, vmin)
               vmin = pred ? x : vmin
           end
           return vmin
       end
f_isless (generic function with 2 methods)

julia> @be rand(Float64, 512000) f_isless samples=500 evals=100
Benchmark: 5 samples with 100 evaluations
 min    2.108 ms
 median 2.114 ms
 mean   2.118 ms
 max    2.140 ms

julia> function f_llvm(dij)
           vmin = Inf
           for x in dij
               vmin = llvm_min(x, vmin)
           end
           return vmin
       end
f_llvm (generic function with 5 methods)

julia> @be rand(Float64, 512000) f_llvm samples=500 evals=100
Benchmark: 162 samples with 100 evaluations
 min    49.071 μs
 median 51.354 μs
 mean   51.752 μs
 max    71.813 μs

@Zentrik
Copy link
Member

Zentrik commented Oct 29, 2024

Are you sure the isless version isn't slow because it's branching

@Moelf
Copy link
Contributor

Moelf commented Oct 29, 2024

it's not because of branching:

julia> function f_isless(dij)
                  vmin = Inf
                  for x in dij
                      pred = isless(x, vmin)
                      vmin = ifelse(pred, x, vmin)
                  end
                  return vmin
              end
f_isless (generic function with 1 method)

julia> @be rand(512000) f_isless samples=100 evals=50
Benchmark: 10 samples with 50 evaluations
 min    2.090 ms
 median 2.101 ms
 mean   2.116 ms
 max    2.211 ms

@gbaraldi
Copy link
Member Author

PowerPC failures are fixed in LLVM 19. So we may want to wait for that to merge to merge this.

@oscardssmith
Copy link
Member

Should these get tfuncs? IIUC, making these be ccalls gives them inlining cost 10, which is probably higher than we want.

@gbaraldi
Copy link
Member Author

gbaraldi commented Oct 29, 2024

Maybe. I'm not sure we gain much by making them tfuncs. They get annotated @inline anyway. But I´m not against it.

@giordano
Copy link
Member

This fixes #48487 (CC: @mikmoore) on x86_64 (I think it was already fixed on aarch64):

% julia +pr56371 -q  
julia> code_native(x->max(1.0,x),Tuple{Float64};debuginfo=:none)
# [...]
# %bb.0:                                # %top
        #DEBUG_VALUE: #2:x <- $xmm0
        push    rbp
        mov     rbp, rsp
        movabs  rax, offset .LCPI0_0
        vmovsd  xmm1, qword ptr [rax]           # xmm1 = mem[0],zero
        vmaxsd  xmm0, xmm1, xmm0
        pop     rbp
        ret
julia> code_native(x->ifelse(1.0>x,1.0,x),Tuple{Float64};debuginfo=:none)
# [...]
# %bb.0:                                # %top
        #DEBUG_VALUE: #5:x <- $xmm0
        push    rbp
        mov     rbp, rsp
        movabs  rax, offset .LCPI0_0
        vmovsd  xmm1, qword ptr [rax]           # xmm1 = mem[0],zero
        vmaxsd  xmm0, xmm1, xmm0
        pop     rbp
        ret

@mikmoore
Copy link
Contributor

Hopefully this also gives us a path to better performance on minimum/maximum/extrema without the pitfalls of the current implementation or complexity of #45581 (although I think that PR would be faster, it was too complicated to ever quite finish and would be annoying to maintain).

@Moelf
Copy link
Contributor

Moelf commented Oct 30, 2024

as far as I can tell minimum/maximum is as fast as it can be -- I don't know if you can get the same performance as Base.fast_minimum without fastmath.

However, I'm continuously interested in what we can do for argmin/findmin -- Mose and I have yet to find a way to make findmin() SIMD without using SIMD.jl

@andrewjradcliffe
Copy link
Contributor

@Moelf part of the issue is thatisless treats NaNs specially. Ideally, it would follow the totalOrder predicate (Wikipedia is a big vague; see IEEE 754-2019 for details). However, historical precedent means that we have weird treatment of NaN values.

For example, if one eliminates the branches here, then a factor of 2 speedup is possible. Alas, the ordering of floats becomes [-NaN, -Inf, ..., +Inf, +NaN] and this would be breaking.

@Moelf
Copy link
Contributor

Moelf commented Oct 31, 2024

Right, but I don't think use @fastmath x<vmin help either, it just doesn't SIMD no matter what you try

@giordano
Copy link
Member

I think this discussion should continue in a dedicated ticket 🙂

@oscardssmith oscardssmith added performance Must go faster compiler:codegen Generation of LLVM IR and native code labels Nov 8, 2024
@oscardssmith
Copy link
Member

Given that LLVM 19 is taking a while to merge and powerpc is currently listed as teir 4 (used to build), can we rebase and merge this?

@oscardssmith
Copy link
Member

I've pushed 1 additional commit that removes some of the remaining complexity from when this was only ccalls.

Copy link
Member

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as far as I can tell.

I can confirm this works also on riscv64, and also on this PR

julia> code_llvm(x->max(-Inf,x),Tuple{Float64};debuginfo=:none)
; Function Signature: var"#20"(Float64)
define double @"julia_#20_2131"(double %"x::Float64") #0 {
top:
    #dbg_value(double %"x::Float64", !2, !DIExpression(), !14)
  ret double %"x::Float64"
}

while on current master

julia> code_llvm(x->max(-Inf,x),Tuple{Float64};debuginfo=:none)
; Function Signature: var"#26"(Float64)
define double @"julia_#26_4754"(double %"x::Float64") #0 {
top:
    #dbg_value(double %"x::Float64", !3, !DIExpression(), !15)
  %0 = fsub double 0xFFF0000000000000, %"x::Float64"
  %bitcast_coercion = bitcast double %0 to i64
  %1 = icmp sgt i64 %bitcast_coercion, -1
  %2 = select i1 %1, double 0xFFF0000000000000, double %"x::Float64"
  %3 = fcmp ord double %"x::Float64", 0.000000e+00
  %4 = select i1 %3, double %2, double %0
  ret double %4
}

which is another confirmation #48487 is indeed fixed. I see exactly the same improvement on x86_64 (on aarch64 we were already using the llvm functions so there's no change there).

Note that tests are passing on all platforms, the x86_64-apple-darwin job is only erroring during a cleanup step at the end, and upload jobs are failing because of the ongoing github outage.

@oscardssmith oscardssmith merged commit a861a55 into master Jan 14, 2025
5 of 7 checks passed
@oscardssmith oscardssmith deleted the gb/fmin-fun branch January 14, 2025 04:14
serenity4 pushed a commit to serenity4/julia that referenced this pull request May 1, 2025
…liaLang#56371)

This used to not work but LLVM now has support for this on all platforms
we care about.

Maybe this should be a builtin.
This allows for more vectorization opportunities since llvm understands
the code better

Fix JuliaLang#48487.

---------

Co-authored-by: Mosè Giordano <[email protected]>
Co-authored-by: oscarddssmith <[email protected]>
@mbauman mbauman mentioned this pull request May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

compiler:codegen Generation of LLVM IR and native code performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize float min/max with constant propagation

8 participants