vulkan: fix SSM_CONV crash on multi-GPU (#20462) by ProgenyAlpha · Pull Request #20495 · ggml-org/llama.cpp

ProgenyAlpha · 2026-03-13T09:33:48Z

Reverts the 2D workgroup tiling (32x16) from #20379 which causes vk::DeviceLostError on multi-GPU RADV setups. Keeps the vec4 dot product fast path for nc=4.

The 2D tiling reduced workgroup launch overhead at large ubatch sizes, but triggers a driver-level fault on multi-GPU configurations (confirmed on dual 7900 XTX, bisected to 40c550d).

Based on master at 983df14.

Fixes #20462

Test plan

@itterative: confirm crash reproduces on master (983df14) with multi-GPU
@itterative: confirm this branch fixes the crash
Single-GPU SSM_CONV test-backend-ops pass (verified, 45/45)

The 2D tiling (32x16 workgroups) from ggml-org#20379 causes DeviceLost on multi-GPU RADV setups. Revert to 1D dispatch but keep the vec4 dot product fast path for nc=4. Fixes ggml-org#20462

itterative · 2026-03-13T10:45:55Z

Tested against the pr, and it seems to be failing (same as master). Master fails at the same point as the second log file (vulkan-issue-0a2088a-2.txt).

$ llama-cli --version
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
version: 8325 (0a2088a8c)
built with GNU 13.3.0 for Linux x86_64

Ran it twice after setting the temperature to 0. Here are the logs:
vulkan-issue-0a2088a.txt
vulkan-issue-0a2088a-2.txt

Some other notes:

I tried to run llama-bench with context size 4100 (basically, 4 trailing tokens) and neither this pr, nor master fail
there's a performance regression for tg128 (master is 90.70 t/s and this pr is 76.19 t/s) and I've seen this when running llama-server as well (pp4100 was approximately the same)

@ProgenyAlpha is there a way I could give you better stack traces for the failing vulkan shaders?

ProgenyAlpha · 2026-03-13T11:03:25Z

Closing, the revert didn't fix it. @itterative if you want to keep debugging this, the Vulkan debug output might help narrow down which op is faulting. You can enable it by adding target_compile_definitions(ggml-vulkan PRIVATE GGML_VULKAN_DEBUG) to ggml/src/ggml-vulkan/CMakeLists.txt and rebuilding.

ProgenyAlpha · 2026-03-13T11:05:07Z

Correction: easier way to enable it is just the cmake flag:

cmake -B build -DGGML_VULKAN=ON -DGGML_VULKAN_DEBUG=ON

No need to edit CMakeLists.txt.

vulkan: revert SSM_CONV 2D workgroup tiling to fix multi-GPU crash

0a2088a

The 2D tiling (32x16 workgroups) from ggml-org#20379 causes DeviceLost on multi-GPU RADV setups. Revert to 1D dispatch but keep the vec4 dot product fast path for nc=4. Fixes ggml-org#20462

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 13, 2026

ProgenyAlpha mentioned this pull request Mar 13, 2026

Eval bug: Vulkan throws vk::DeviceLostError on Qwen3.5 35B A3B #20462

Closed

ProgenyAlpha closed this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: fix SSM_CONV crash on multi-GPU (#20462)#20495

vulkan: fix SSM_CONV crash on multi-GPU (#20462)#20495
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:fix/ssm-conv-multi-gpu

ProgenyAlpha commented Mar 13, 2026 •

edited

Loading

Uh oh!

itterative commented Mar 13, 2026 •

edited

Loading

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ProgenyAlpha commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

itterative commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ProgenyAlpha commented Mar 13, 2026 •

edited

Loading

itterative commented Mar 13, 2026 •

edited

Loading