Skip to content

LLVM 18- generates non-existing min.NaN.f64/max.NaN.f64 instructions #2886

@giordano

Description

@giordano

Sanity checks (read this first, then remove this section)

  • Make sure you're reporting a bug; for general questions, please use Discourse or
    Slack.

  • If you're dealing with a performance issue, make sure you disable scalar iteration
    (CUDA.allowscalar(false)). Only file an issue if that shows scalar iteration happening
    in CUDA.jl or Base Julia, as opposed to your own code.

  • If you're seeing an error message, follow the error message instructions, if any
    (e.g. inspect code with @device_code_warntype). If you can't solve the problem using
    that information, make sure to post it as part of the issue.

  • Always ensure you're using the latest version of CUDA.jl, and if possible, please
    check the master branch to see if your issue hasn't been resolved yet.

If your bug is still valid, please go ahead and fill out the template below.

Describe the bug

I'm trying to run https://github.com/WaterLily-jl/WaterLily-Benchmarks on GB200 with Julia v1.12 and I'm getting:

ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 255)                                                                                                                                                  00:09:56 [6/1300]
Invocation arguments: --generate-line-info --verbose --gpu-name sm_100 --output-file /tmp/jl_MjiXQV2lej.cubin /tmp/jl_kD4L6b8dQb.ptx                                                                                                        
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 119; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 131; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 147; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 160; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 181; error   : Illegal modifier '.NaN' for instruction 'max'                                                                                                                                             
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 194; error   : Illegal modifier '.NaN' for instruction 'max'
ptxas fatal   : Ptx assembly aborted due to errors
If you think this is a bug, please file an issue and attach /tmp/jl_kD4L6b8dQb.ptx
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/gqNji/src/compiler/compilation.jl:356
  [3] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof
(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:245
  [4] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:159
  [5] macro expansion
    @ ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:373 [inlined]
  [6] macro expansion
    @ ./lock.jl:376 [inlined]
  [7] cufunction(f::WaterLily.var"#gpu_##kern_#606#117", tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{3, Tuple{Base
.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{3, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(64, 1, 1)}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, Nothing}}, CUDA.CuDeviceArray{Float32, 3, 1}, CUDA.CuDeviceArray{Float32, 4, 1}, CartesianIndex{3}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Int64})
    @ CUDA ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:368
  [8] macro expansion
    @ ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:112 [inlined]
  [9] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.StaticSize{(64,)}, KernelAbstractions.NDIteration.DynamicSize, WaterLily.var"#gpu_##kern_#606#117"})(::CuArray{Float32, 3, CUDA.DeviceMemory
}, ::Vararg{Any}; ndrange::Tuple{Int64, Int64, Int64}, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/packages/CUDA/gqNji/src/CUDAKernels.jl:124
 [10] (::WaterLily.var"##kern#605#120"{Flow{3, Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{Float32, 4, CUDA.DeviceMemory}, CuArray{Float32, 5, CUDA.DeviceMemory}}})(σ::CuArray{Float32, 3, CUDA.DeviceMemory}, u::CuArray{Floa
t32, 4, CUDA.DeviceMemory})
    @ WaterLily ~/.julia/dev/WaterLily/src/util.jl:149
 [11] macro expansion
    @ ~/.julia/dev/WaterLily/src/util.jl:151 [inlined]
 [12] #CFL#115
    @ ~/.julia/dev/WaterLily/src/Flow.jl:169 [inlined]
 [13] CFL
    @ ~/.julia/dev/WaterLily/src/Flow.jl:168 [inlined]
 [14] mom_step!(a::Flow{3, Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{Float32, 4, CUDA.DeviceMemory}, CuArray{Float32, 5, CUDA.DeviceMemory}}, b::MultiLevelPoisson{Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{F
loat32, 4, CUDA.DeviceMemory}}; λ::Function, udf::Nothing, kwargs::@Kwargs{})
    @ WaterLily ~/.julia/dev/WaterLily/src/Flow.jl:164
 [15] sim_step!(sim::Simulation; remeasure::Bool, λ::Function, udf::Nothing, kwargs::@Kwargs{})
    @ WaterLily ~/.julia/dev/WaterLily/src/WaterLily.jl:112
 [16] sim_step!
    @ ~/.julia/dev/WaterLily/src/WaterLily.jl:110 [inlined]
 [17] sim_step!(sim::Simulation, t_end::Float32; remeasure::Bool, λ::Function, max_steps::Int64, verbose::Bool, udf::Nothing, kwargs::@Kwargs{})
    @ WaterLily ~/.julia/dev/WaterLily/src/WaterLily.jl:106
 [18] add_to_suite!(suite::BenchmarkGroup, sim_function::typeof(tgv); p::Tuple{Int64, Int64}, s::Int64, ft::Type, backend::Type, bstr::String, remeasure::Bool)
    @ Main ~/mose/waterlily/WaterLily-Benchmarks/util.jl:37
 [19] run_benchmarks(cases::Vector{String}, log2p::Vector{Tuple{Int64, Int64}}, max_steps::Vector{Int64}, ftype::Vector{DataType}, backend::Type, bstr::String; data_dir::String)
    @ Main ~/mose/waterlily/WaterLily-Benchmarks/benchmark.jl:10
 [20] top-level scope
    @ ~/mose/waterlily/WaterLily-Benchmarks/benchmark.jl:26
 [21] include(mod::Module, _path::String)
    @ Base ./Base.jl:305

To reproduce

Not very minimal, but I'm running the benchmark in https://github.com/WaterLily-jl/WaterLily-Benchmarks

Expected behavior

A clear and concise description of what you expected to happen.

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.12.0-rc2
Commit 72cbf019d04 (2025-09-06 12:00 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 224 × INTEL(R) XEON(R) PLATINUM 8570
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, sapphirerapids)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 224 virtual cores)

Details on CUDA:

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 13.0, artifact installation
- driver 570.124.6 for 13.0
- compiler 13.0

CUDA libraries: 
- CUBLAS: 13.0.2
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 12.0.0+570.124.6

Julia packages: 
- CUDA: 5.9.0
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0

Toolchain:
- Julia: 1.12.0-rc2
- LLVM: 18.1.7

8 devices:
  0: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  1: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  2: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  3: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  4: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  5: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  6: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
  7: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)

Additional context

jl_kD4L6b8dQb.ptx.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuda kernelsStuff about writing CUDA kernels.upstreamSomebody else's problem.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions