Skip to content

CUDA: Check PTX version on host side to guard PDL dispatch#23530

Merged
JohannesGaessler merged 6 commits into
ggml-org:masterfrom
ORippler:osimons/ptx_pdl_checks_cuda
May 29, 2026
Merged

CUDA: Check PTX version on host side to guard PDL dispatch#23530
JohannesGaessler merged 6 commits into
ggml-org:masterfrom
ORippler:osimons/ptx_pdl_checks_cuda

Conversation

@ORippler

Copy link
Copy Markdown
Collaborator

Overview

Checking on __CUDA_ARCH_LIST__ alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
DCMAKE_CUDA_ARCHITECTURES="89;90a", where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking cudaFuncAttributes::ptxVersion of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see:
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

Additional information

Follow-up to #23471 that implements a more complete fix.

To verify, apply

diff --git a/ggml/src/ggml-cuda/common.cuh b/ggml/src/ggml-cuda/common.cuh
index 20b1846c0..19dd53a72 100644
--- a/ggml/src/ggml-cuda/common.cuh
+++ b/ggml/src/ggml-cuda/common.cuh
@@ -1584,6 +1584,9 @@ static bool ggml_cuda_kernel_can_use_pdl(const void * kernel) {
     // We have to guard on a loaded kernel's PTX version so a kernel forward-JIT'ed
     // from pre-Hopper PTX to a Hopper-or-newer GPU does not opt into PDL.
     const bool can_use_pdl = attr.ptxVersion >= 90;
+    printf("Kernel %p on device %d has PTX version %d -> can_use_pdl = %d, highest compiled_arch = %d\n", kernel,
+           device, attr.ptxVersion, can_use_pdl,
+           ggml_cuda_highest_compiled_arch(ggml_cuda_info().devices[ggml_cuda_get_device()].cc));
     cache.emplace(key, can_use_pdl);
     return can_use_pdl;
 }

& compile with something like

cmake -S . -B build_test_ptx -G Ninja \                                
  -DLLAMA_FATAL_WARNINGS=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES="89;90a" \
  -DGGML_CUDA=ON

Running test-backend-ops -o MUL_MAT on a CC > 9.0 GPU will now correctly disable PDL, and show logs like
Kernel 0x7cf22ed84aa0 on device 0 has PTX version 89 -> can_use_pdl = 0, highest compiled_arch = 900

Requirements

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code
@ORippler ORippler requested a review from a team as a code owner May 22, 2026 13:08
@ORippler ORippler requested a review from JohannesGaessler May 22, 2026 13:09
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 22, 2026
Comment thread ggml/src/ggml-cuda/common.cuh Outdated
@ORippler ORippler requested a review from JohannesGaessler May 27, 2026 09:08
Comment thread ggml/src/ggml-cuda/common.cuh Outdated
Comment thread ggml/src/ggml-cuda/common.cuh Outdated
Comment thread ggml/src/ggml-cuda/common.cuh Outdated
@ORippler ORippler requested a review from JohannesGaessler May 27, 2026 15:58

@JohannesGaessler JohannesGaessler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention of the surrounding code is size_t vs. e.g. std::size_t but either way is fine I think.

@ORippler ORippler requested a review from gaugarg-nv May 28, 2026 08:44
@JohannesGaessler JohannesGaessler merged commit 6ed481e into ggml-org:master May 29, 2026
28 checks passed
@ORippler ORippler deleted the osimons/ptx_pdl_checks_cuda branch May 29, 2026 11:50
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 29, 2026
* origin/master:
vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826)
graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864)
server: remove obsolete scripts (ggml-org#23870)
ci : update macos release to use macos-26 runner (ggml-org#23878)
download: add option to skip_download (ggml-org#23059)
mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975)
CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530)
server: bump timeout to 3600s (ggml-org#23842)
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346)
llama: use f16 mask for FA to save VRAM (ggml-org#23764)
sync : ggml
ggml : bump version to 0.13.1 (ggml/1523)
ngram-mod : Add missing include (ggml-org#23857)
llama: add llm_graph_input_mtp (ggml-org#23643)
app : move licences to llama-app (ggml-org#23824)
cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825)
meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
…23530)

* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
…23530)

* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants