Removes PDL enrollment of launch_fattn kernels to fix bug on DGX Spark#23825
Conversation
|
Is the bug internal i.e. PDL has an issue or is it the placement of this particular instance was wrong? |
|
@am17an it is an internal PDL issue, otherwise the fix would've been to move In essence, a global load, located behind the barrier in C++, is moved ahead of the barrier in bytecode during compilation, which causes an invalid read. |
|
Are we sure this bug wouldn't affect other placements? |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Please also link this PR in the inline comment and document the affected compiler versions if possible.
|
Is it known to which CUDA versions fixes for PDL will be backported? As of right now we are enabling PDL for CUDA versions as old as 11.8 by default but if those remain unpatched we can't do that. |
Generally, this depends on the severity of the issue that was fixed. Will let you know once we know more |
ggerganov
left a comment
There was a problem hiding this comment.
Works on my end now.
Btw, as of few days now, we have a DGX Spark doing some of the CUDA CI so we have this covered continuously.
Regarding the concern about affecting other kernels: it's a valid concern, but I think it is worth keeping PDL enabled so we can surface such potential problems faster.
|
I've fast-tracked this to include it in the ggml and whisper.cpp releases. |
* origin/master: vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826) graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864) server: remove obsolete scripts (ggml-org#23870) ci : update macos release to use macos-26 runner (ggml-org#23878) download: add option to skip_download (ggml-org#23059) mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975) CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530) server: bump timeout to 3600s (ggml-org#23842) model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346) llama: use f16 mask for FA to save VRAM (ggml-org#23764) sync : ggml ggml : bump version to 0.13.1 (ggml/1523) ngram-mod : Add missing include (ggml-org#23857) llama: add llm_graph_input_mtp (ggml-org#23643) app : move licences to llama-app (ggml-org#23824) cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825) meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
Overview
On DGX Spark, we saw spurious test failures when running
test-backend-ops -o FLASH_ATTN_EXTwith PDL enabled.We identified an internal bug which caused a race condition in a kernel launched with
launch_fattn().For now, moving these kernels out of PDL enrollment fixes this bug in my testing.
Performance Impact
Negative perf impact is limited, I saw around ~0.2% perf loss on both DGX Spark and RTX Pro 6000 for the models gpt-oss20B and qwen35moe 35B.A3B Q4_K.
Requirements
@ORippler @ggerganov let me know if this fixes the bug on your setups.