LLVM 18- generates non-existing `min.NaN.f64`/`max.NaN.f64` instructions

**Sanity checks (read this first, then remove this section)**

- [x] Make sure you're reporting *a bug*; for general questions, please use Discourse or
  Slack.

- [x] If you're dealing with a performance issue, make sure you **disable scalar iteration**
  (`CUDA.allowscalar(false)`). Only file an issue if that shows scalar iteration happening
  in CUDA.jl or Base Julia, as opposed to your own code.

- [x] If you're seeing an error message, **follow the error message instructions**, if any
  (e.g. `inspect code with @device_code_warntype`). If you can't solve the problem using
  that information, make sure to post it as part of the issue.

- [x] Always ensure you're using the latest version of CUDA.jl, and if possible, please
  check the master branch to see if your issue hasn't been resolved yet.

If your bug is still valid, please go ahead and fill out the template below.


**Describe the bug**

I'm trying to run https://github.com/WaterLily-jl/WaterLily-Benchmarks on GB200 with Julia v1.12 and I'm getting:

```
ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 255)                                                                                                                                                  00:09:56 [6/1300]
Invocation arguments: --generate-line-info --verbose --gpu-name sm_100 --output-file /tmp/jl_MjiXQV2lej.cubin /tmp/jl_kD4L6b8dQb.ptx                                                                                                        
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 119; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 131; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 147; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 160; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 181; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 194; error   : Illegal modifier '.NaN' for instruction 'max'
ptxas fatal   : Ptx assembly aborted due to errors
If you think this is a bug, please file an issue and attach /tmp/jl_kD4L6b8dQb.ptx
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/gqNji/src/compiler/compilation.jl:356
  [3] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof
(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:245
  [4] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:159
  [5] macro expansion
    @ ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:373 [inlined]
  [6] macro expansion
    @ ./lock.jl:376 [inlined]
  [7] cufunction(f::WaterLily.var"#gpu_##kern_#606#117", tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{3, Tuple{Base
.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{3, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(64, 1, 1)}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, Nothing}}, CUDA.CuDeviceArray{Float32, 3, 1}, CUDA.CuDeviceArray{Float32, 4, 1}, CartesianIndex{3}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Int64})
    @ CUDA ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:368
  [8] macro expansion
    @ ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:112 [inlined]
  [9] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.StaticSize{(64,)}, KernelAbstractions.NDIteration.DynamicSize, WaterLily.var"#gpu_##kern_#606#117"})(::CuArray{Float32, 3, CUDA.DeviceMemory
}, ::Vararg{Any}; ndrange::Tuple{Int64, Int64, Int64}, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/packages/CUDA/gqNji/src/CUDAKernels.jl:124
 [10] (::WaterLily.var"##kern#605#120"{Flow{3, Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{Float32, 4, CUDA.DeviceMemory}, CuArray{Float32, 5, CUDA.DeviceMemory}}})(σ::CuArray{Float32, 3, CUDA.DeviceMemory}, u::CuArray{Floa
t32, 4, CUDA.DeviceMemory})
    @ WaterLily ~/.julia/dev/WaterLily/src/util.jl:149
 [11] macro expansion
    @ ~/.julia/dev/WaterLily/src/util.jl:151 [inlined]
 [12] #CFL#115
    @ ~/.julia/dev/WaterLily/src/Flow.jl:169 [inlined]
 [13] CFL
    @ ~/.julia/dev/WaterLily/src/Flow.jl:168 [inlined]
 [14] mom_step!(a::Flow{3, Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{Float32, 4, CUDA.DeviceMemory}, CuArray{Float32, 5, CUDA.DeviceMemory}}, b::MultiLevelPoisson{Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{F
loat32, 4, CUDA.DeviceMemory}}; λ::Function, udf::Nothing, kwargs::@Kwargs{})
    @ WaterLily ~/.julia/dev/WaterLily/src/Flow.jl:164
 [15] sim_step!(sim::Simulation; remeasure::Bool, λ::Function, udf::Nothing, kwargs::@Kwargs{})
    @ WaterLily ~/.julia/dev/WaterLily/src/WaterLily.jl:112
 [16] sim_step!
    @ ~/.julia/dev/WaterLily/src/WaterLily.jl:110 [inlined]
 [17] sim_step!(sim::Simulation, t_end::Float32; remeasure::Bool, λ::Function, max_steps::Int64, verbose::Bool, udf::Nothing, kwargs::@Kwargs{})
    @ WaterLily ~/.julia/dev/WaterLily/src/WaterLily.jl:106
 [18] add_to_suite!(suite::BenchmarkGroup, sim_function::typeof(tgv); p::Tuple{Int64, Int64}, s::Int64, ft::Type, backend::Type, bstr::String, remeasure::Bool)
    @ Main ~/mose/waterlily/WaterLily-Benchmarks/util.jl:37
 [19] run_benchmarks(cases::Vector{String}, log2p::Vector{Tuple{Int64, Int64}}, max_steps::Vector{Int64}, ftype::Vector{DataType}, backend::Type, bstr::String; data_dir::String)
    @ Main ~/mose/waterlily/WaterLily-Benchmarks/benchmark.jl:10
 [20] top-level scope
    @ ~/mose/waterlily/WaterLily-Benchmarks/benchmark.jl:26
 [21] include(mod::Module, _path::String)
    @ Base ./Base.jl:305
```

**To reproduce**

Not very minimal, but I'm running the benchmark in https://github.com/WaterLily-jl/WaterLily-Benchmarks


**Expected behavior**

A clear and concise description of what you expected to happen.


**Version info**

Details on Julia:

```julia-repl
julia> versioninfo()
Julia Version 1.12.0-rc2
Commit 72cbf019d04 (2025-09-06 12:00 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 224 × INTEL(R) XEON(R) PLATINUM 8570
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, sapphirerapids)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 224 virtual cores)
```

Details on CUDA:

```julia-repl
julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 13.0, artifact installation
- driver 570.124.6 for 13.0
- compiler 13.0

CUDA libraries: 
- CUBLAS: 13.0.2
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 12.0.0+570.124.6

Julia packages: 
- CUDA: 5.9.0
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0

Toolchain:
- Julia: 1.12.0-rc2
- LLVM: 18.1.7

8 devices:
  0: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  1: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  2: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  3: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  4: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  5: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  6: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  7: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
```


**Additional context**

[`jl_kD4L6b8dQb.ptx.txt`](https://github.com/user-attachments/files/22244521/jl_kD4L6b8dQb.ptx.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLVM 18- generates non-existing `min.NaN.f64`/`max.NaN.f64` instructions #2886

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLVM 18- generates non-existing min.NaN.f64/max.NaN.f64 instructions #2886

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

LLVM 18- generates non-existing `min.NaN.f64`/`max.NaN.f64` instructions #2886