Skip to content

Removes PDL enrollment of launch_fattn kernels to fix bug on DGX Spark#23825

Merged
ggerganov merged 1 commit into
ggml-org:masterfrom
aendk:akieslinger/pdl-fattn-fix
May 29, 2026
Merged

Removes PDL enrollment of launch_fattn kernels to fix bug on DGX Spark#23825
ggerganov merged 1 commit into
ggml-org:masterfrom
aendk:akieslinger/pdl-fattn-fix

Conversation

@aendk
Copy link
Copy Markdown
Contributor

@aendk aendk commented May 28, 2026

Overview

On DGX Spark, we saw spurious test failures when running test-backend-ops -o FLASH_ATTN_EXT with PDL enabled.
We identified an internal bug which caused a race condition in a kernel launched with launch_fattn().
For now, moving these kernels out of PDL enrollment fixes this bug in my testing.

Performance Impact

Negative perf impact is limited, I saw around ~0.2% perf loss on both DGX Spark and RTX Pro 6000 for the models gpt-oss20B and qwen35moe 35B.A3B Q4_K.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for debugging. Every line of code proposed here was manually checked and tested before commit.

@ORippler @ggerganov let me know if this fixes the bug on your setups.

@aendk aendk requested a review from a team as a code owner May 28, 2026 15:35
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 28, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 28, 2026

Is the bug internal i.e. PDL has an issue or is it the placement of this particular instance was wrong?

@aendk
Copy link
Copy Markdown
Contributor Author

aendk commented May 28, 2026

@am17an it is an internal PDL issue, otherwise the fix would've been to move ggml_cuda_pdl_sync() to the correct place.

In essence, a global load, located behind the barrier in C++, is moved ahead of the barrier in bytecode during compilation, which causes an invalid read.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 28, 2026

Are we sure this bug wouldn't affect other placements? FLASH_ATTN_EXT has quite an extensive suite of shapes which exercise a lot of paths, other tests are maybe relatively sparse.

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also link this PR in the inline comment and document the affected compiler versions if possible.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Is it known to which CUDA versions fixes for PDL will be backported? As of right now we are enabling PDL for CUDA versions as old as 11.8 by default but if those remain unpatched we can't do that.

Copy link
Copy Markdown
Collaborator

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@ORippler
Copy link
Copy Markdown
Collaborator

Is it known to which CUDA versions fixes for PDL will be backported?

Generally, this depends on the severity of the issue that was fixed. Will let you know once we know more

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works on my end now.

Btw, as of few days now, we have a DGX Spark doing some of the CUDA CI so we have this covered continuously.

Regarding the concern about affecting other kernels: it's a valid concern, but I think it is worth keeping PDL enabled so we can surface such potential problems faster.

@ggerganov ggerganov merged commit 241cbd4 into ggml-org:master May 29, 2026
28 checks passed
@ggerganov
Copy link
Copy Markdown
Member

I've fast-tracked this to include it in the ggml and whisper.cpp releases.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 29, 2026
* origin/master:
vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826)
graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864)
server: remove obsolete scripts (ggml-org#23870)
ci : update macos release to use macos-26 runner (ggml-org#23878)
download: add option to skip_download (ggml-org#23059)
mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975)
CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530)
server: bump timeout to 3600s (ggml-org#23842)
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346)
llama: use f16 mask for FA to save VRAM (ggml-org#23764)
sync : ggml
ggml : bump version to 0.13.1 (ggml/1523)
ngram-mod : Add missing include (ggml-org#23857)
llama: add llm_graph_input_mtp (ggml-org#23643)
app : move licences to llama-app (ggml-org#23824)
cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825)
meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants