-
Notifications
You must be signed in to change notification settings - Fork 256
Description
Sanity checks (read this first, then remove this section)
-
Make sure you're reporting a bug; for general questions, please use Discourse or
Slack. -
If you're dealing with a performance issue, make sure you disable scalar iteration
(CUDA.allowscalar(false)). Only file an issue if that shows scalar iteration happening
in CUDA.jl or Base Julia, as opposed to your own code. -
If you're seeing an error message, follow the error message instructions, if any
(e.g.inspect code with @device_code_warntype). If you can't solve the problem using
that information, make sure to post it as part of the issue. -
Always ensure you're using the latest version of CUDA.jl, and if possible, please
check the master branch to see if your issue hasn't been resolved yet.
If your bug is still valid, please go ahead and fill out the template below.
Describe the bug
I'm trying to run https://github.com/WaterLily-jl/WaterLily-Benchmarks on GB200 with Julia v1.12 and I'm getting:
ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 255) 00:09:56 [6/1300]
Invocation arguments: --generate-line-info --verbose --gpu-name sm_100 --output-file /tmp/jl_MjiXQV2lej.cubin /tmp/jl_kD4L6b8dQb.ptx
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 119; error : Illegal modifier '.NaN' for instruction 'max'
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 131; error : Illegal modifier '.NaN' for instruction 'max'
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 147; error : Illegal modifier '.NaN' for instruction 'max'
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 160; error : Illegal modifier '.NaN' for instruction 'max'
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 181; error : Illegal modifier '.NaN' for instruction 'max'
ptxas /tmp/jl_kD4L6b8dQb.ptx, line 194; error : Illegal modifier '.NaN' for instruction 'max'
ptxas fatal : Ptx assembly aborted due to errors
If you think this is a bug, please file an issue and attach /tmp/jl_kD4L6b8dQb.ptx
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:44
[2] compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/gqNji/src/compiler/compilation.jl:356
[3] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof
(CUDA.link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:245
[4] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:159
[5] macro expansion
@ ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:373 [inlined]
[6] macro expansion
@ ./lock.jl:376 [inlined]
[7] cufunction(f::WaterLily.var"#gpu_##kern_#606#117", tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{3, Tuple{Base
.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{3, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(64, 1, 1)}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, Nothing}}, CUDA.CuDeviceArray{Float32, 3, 1}, CUDA.CuDeviceArray{Float32, 4, 1}, CartesianIndex{3}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Int64})
@ CUDA ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:368
[8] macro expansion
@ ~/.julia/packages/CUDA/gqNji/src/compiler/execution.jl:112 [inlined]
[9] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.StaticSize{(64,)}, KernelAbstractions.NDIteration.DynamicSize, WaterLily.var"#gpu_##kern_#606#117"})(::CuArray{Float32, 3, CUDA.DeviceMemory
}, ::Vararg{Any}; ndrange::Tuple{Int64, Int64, Int64}, workgroupsize::Nothing)
@ CUDA.CUDAKernels ~/.julia/packages/CUDA/gqNji/src/CUDAKernels.jl:124
[10] (::WaterLily.var"##kern#605#120"{Flow{3, Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{Float32, 4, CUDA.DeviceMemory}, CuArray{Float32, 5, CUDA.DeviceMemory}}})(σ::CuArray{Float32, 3, CUDA.DeviceMemory}, u::CuArray{Floa
t32, 4, CUDA.DeviceMemory})
@ WaterLily ~/.julia/dev/WaterLily/src/util.jl:149
[11] macro expansion
@ ~/.julia/dev/WaterLily/src/util.jl:151 [inlined]
[12] #CFL#115
@ ~/.julia/dev/WaterLily/src/Flow.jl:169 [inlined]
[13] CFL
@ ~/.julia/dev/WaterLily/src/Flow.jl:168 [inlined]
[14] mom_step!(a::Flow{3, Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{Float32, 4, CUDA.DeviceMemory}, CuArray{Float32, 5, CUDA.DeviceMemory}}, b::MultiLevelPoisson{Float32, CuArray{Float32, 3, CUDA.DeviceMemory}, CuArray{F
loat32, 4, CUDA.DeviceMemory}}; λ::Function, udf::Nothing, kwargs::@Kwargs{})
@ WaterLily ~/.julia/dev/WaterLily/src/Flow.jl:164
[15] sim_step!(sim::Simulation; remeasure::Bool, λ::Function, udf::Nothing, kwargs::@Kwargs{})
@ WaterLily ~/.julia/dev/WaterLily/src/WaterLily.jl:112
[16] sim_step!
@ ~/.julia/dev/WaterLily/src/WaterLily.jl:110 [inlined]
[17] sim_step!(sim::Simulation, t_end::Float32; remeasure::Bool, λ::Function, max_steps::Int64, verbose::Bool, udf::Nothing, kwargs::@Kwargs{})
@ WaterLily ~/.julia/dev/WaterLily/src/WaterLily.jl:106
[18] add_to_suite!(suite::BenchmarkGroup, sim_function::typeof(tgv); p::Tuple{Int64, Int64}, s::Int64, ft::Type, backend::Type, bstr::String, remeasure::Bool)
@ Main ~/mose/waterlily/WaterLily-Benchmarks/util.jl:37
[19] run_benchmarks(cases::Vector{String}, log2p::Vector{Tuple{Int64, Int64}}, max_steps::Vector{Int64}, ftype::Vector{DataType}, backend::Type, bstr::String; data_dir::String)
@ Main ~/mose/waterlily/WaterLily-Benchmarks/benchmark.jl:10
[20] top-level scope
@ ~/mose/waterlily/WaterLily-Benchmarks/benchmark.jl:26
[21] include(mod::Module, _path::String)
@ Base ./Base.jl:305
To reproduce
Not very minimal, but I'm running the benchmark in https://github.com/WaterLily-jl/WaterLily-Benchmarks
Expected behavior
A clear and concise description of what you expected to happen.
Version info
Details on Julia:
julia> versioninfo()
Julia Version 1.12.0-rc2
Commit 72cbf019d04 (2025-09-06 12:00 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 224 × INTEL(R) XEON(R) PLATINUM 8570
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, sapphirerapids)
GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 224 virtual cores)Details on CUDA:
julia> CUDA.versioninfo()
CUDA toolchain:
- runtime 13.0, artifact installation
- driver 570.124.6 for 13.0
- compiler 13.0
CUDA libraries:
- CUBLAS: 13.0.2
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 12.0.0+570.124.6
Julia packages:
- CUDA: 5.9.0
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
Toolchain:
- Julia: 1.12.0-rc2
- LLVM: 18.1.7
8 devices:
0: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
1: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
2: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
3: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
4: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
5: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
6: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)
7: NVIDIA B200 (sm_100, 178.358 GiB / 179.062 GiB available)Additional context