Avoid PDL race conditions by disabling __restrict__ when PDL is used#24030
Conversation
PDL. Adds preprocessor directives based on arch in kernel body to add __restrict__ to retain performance on older architectures.
There was a problem hiding this comment.
Thanks for your work. Adding some more context:
- For the interested, a minimal repro with workaround is here: https://godbolt.org/z/zzrrarWss
- In
broken, SASS from sm_100 has LDG hoisted before ACQBULK, which effectively introduces the data-race. fixedconditionally applies restrict to b based on CUDA_ARCH and does not show this pattern on post-Hopper GPUS while maintaining restrict perf gains on older HW.
- In
- While we could simply remove
__restrict__without perf penalty on newer hardware that have iso-VRAM throughput for texture and compute memory-paths, we saw some perf regression on Ampere-based A40 and hence opted for device-side restriction. Fun-fact: CCCL does not use restrict anywhere and still manages to achieve SOL perf across gens. - The conditional restrict cannot happen in the function signature of the kernel, as this will fail for templated
__global__kernels - We will file an update to CUDA docs to explicitly state that PDL and restrict are mutually exclusive for the time being.
From my side this is generally good to go, but I'd love for us to conditionally #define the restrict once, which should make the kernels significantly easier to read (see the godbolt I shared). Given PDL is currently potentially bugged, we can do this in a follow-up PR.
JohannesGaessler
left a comment
There was a problem hiding this comment.
The current solution is quite cumbersome. As suggested by Oliver, please define a macro like GGML_CUDA_RESTRICT.
More generally, beyond just documentation my opinion is that the combination of __restrict__ and PDL in a single kernel should result in a compiler warning since evidently even for NVIDIA engineers this is a non-trivial issue to track down.
Is that a Feature Request for NVCC/CUDA compiler chain? Or something you would like to see implemented in ggml's CUDA backend. For the former, adding a warning when the hoisting happens is something we are discussing as one of the ways to improve developer UX. |
It's not so much a feature request (since I personally am now aware of the problem) but rather a feature suggestion to avoid time spent on debugging in other projects. For ggml I would consider a macro to be 100% sufficient. |
Is this "working as designed" or is it just a compiler bug that needs to be fixed? |
|
@jeffbolznv everything works as designed, we need to add a big disclaimer that PDL cannot be mixed with |
|
@ORippler @JohannesGaessler ready for review. |
| float * dst_ptr, | ||
| const float2 * dst_fixup_ptr, |
There was a problem hiding this comment.
| float * dst_ptr, | |
| const float2 * dst_fixup_ptr, | |
| float * GGML_CUDA_RESTRICT dst_ptr, | |
| const float2 * GGML_CUDA_RESTRICT dst_fixup_ptr, |
I would say to just use the macro in the declaration. Also, if __restrict__ has a negligible impact on Blackwell anyways, please replace all instances of __restrict__ in the CUDA backend with GGML_CUDA_RESTRICT, even if those kernels currently do not use PDL.
There was a problem hiding this comment.
That sadly does not work in templated function headers which are __global__.
The reason is that __CUDA_ARCH__ is undefined on host compilation passes, but defined on device compilation passes, and this leads to ABI mismatches.
We could do that on some of the function headers (e.g. quantize_q8_1), but it is discouraged by our engineering and possibly/likely unstable.
There was a problem hiding this comment.
We could do that on some of the function headers (e.g. quantize_q8_1), but it is discouraged by our engineering and possibly/likely unstable.
This is not supported, quoting from the cuda docs:
https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html#cuda-arch
There was a problem hiding this comment.
Also, if
__restrict__has a negligible impact on Blackwell anyways, please replace all instances of__restrict__in the CUDA backend with GGML_CUDA_RESTRICT, even if those kernels currently do not use PDL.
Would you still want me to do this by adding code outside of the function header?
I assume that because its not a seamless replacement, it does not make sense at this point.
There was a problem hiding this comment.
- Please include Hopper in the fix
- We could think about naming this to
GGML_CUDA_RESTRICT_PREHOPPER, but that may be too verbose (I don't have any stance on this)
I'd also recommend a comment linking this PR to where we #define GGML_CUDA_RESTRICT so people can understand why this is necessary until the CUDA documentation update is in place
Co-authored-by: Oliver Simons <osimons@nvidia.com>
ORippler
left a comment
There was a problem hiding this comment.
LGTM. I'd personally recommend to fix PDL first and do another round of mechanical __restrict__ changes + associated perf analysis in a follow-up PR
JohannesGaessler
left a comment
There was a problem hiding this comment.
I don't really like this syntax but it seems it's the least bad option we have. Thank you for debugging this!
Overview
Follow up to #23825.
Together with CUDA engineers, we identified the suspected bug of #23825; PDL and
__restrict__cannot coexist, as__restrict__can move data reads before the PDL barrier and cause race conditions in the GPU byte code.This PR disables
__restrict__for device code which leverages PDL, and retains__restrict__(and thus performance) for all other GPU architectures.Performance
I tested performance on an RTX PRO 6000 Blackwell Max-Q and an RTX 6000 Ada Generation. On Ada, there is no impact on performance discernible from noise. This aligns with expectations as it does not use PDL, and because it benefits from
__restrict__On Blackwell, we see a small improvement on Qwen3.6 (~1%) because we re-enroll the flash-attention kernels disabled in #23825.
Detailed Bug Report
Using
__restrict__can lead to data loads which leverage the read-only texture cache path. This is a non-coherent cache.An example load into this cache is via the
ld.global.ncSASS byte code instruction. It is explicitly excluded from the CUDA/PTX memory consistency model in which PDL works (https://docs.nvidia.com/cuda/parallel-thread-execution/#scope-and-applicability).This means that the compiler is free to move any
__restrict__input load compiled intold.global.ncahead of the PDL synchronization barrier (cudaGridDependencySynchronize()is compiled toACQBULK). Race conditions are possible.Until now, this was not common knowledge, we'll improve tooling and guides.
The bug addressed in #23825 was caused by a race condition like this. A
__restrict__ptr was hoisted over (=moved in front of) the SASS instruction of theggml_pdl_cuda_sync()barrier. The data read was thus executed before the PDL synchronization barrier in GPU byte code, even though in C++, the synchronization was placed correctly before the data read.Requirements
@ORippler for viz