Skip to content

vulkan: fix SSM_CONV crash on multi-GPU (#20462)#20495

Closed
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:fix/ssm-conv-multi-gpu
Closed

vulkan: fix SSM_CONV crash on multi-GPU (#20462)#20495
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:fix/ssm-conv-multi-gpu

Conversation

@ProgenyAlpha
Copy link
Contributor

@ProgenyAlpha ProgenyAlpha commented Mar 13, 2026

Reverts the 2D workgroup tiling (32x16) from #20379 which causes vk::DeviceLostError on multi-GPU RADV setups. Keeps the vec4 dot product fast path for nc=4.

The 2D tiling reduced workgroup launch overhead at large ubatch sizes, but triggers a driver-level fault on multi-GPU configurations (confirmed on dual 7900 XTX, bisected to 40c550d).

Based on master at 983df14.

Fixes #20462

Test plan

  • @itterative: confirm crash reproduces on master (983df14) with multi-GPU
  • @itterative: confirm this branch fixes the crash
  • Single-GPU SSM_CONV test-backend-ops pass (verified, 45/45)

The 2D tiling (32x16 workgroups) from ggml-org#20379 causes DeviceLost on
multi-GPU RADV setups. Revert to 1D dispatch but keep the vec4 dot
product fast path for nc=4.

Fixes ggml-org#20462
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 13, 2026
@itterative
Copy link

itterative commented Mar 13, 2026

Tested against the pr, and it seems to be failing (same as master). Master fails at the same point as the second log file (vulkan-issue-0a2088a-2.txt).

$ llama-cli --version
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
version: 8325 (0a2088a8c)
built with GNU 13.3.0 for Linux x86_64

Ran it twice after setting the temperature to 0. Here are the logs:
vulkan-issue-0a2088a.txt
vulkan-issue-0a2088a-2.txt

Some other notes:

  • I tried to run llama-bench with context size 4100 (basically, 4 trailing tokens) and neither this pr, nor master fail
  • there's a performance regression for tg128 (master is 90.70 t/s and this pr is 76.19 t/s) and I've seen this when running llama-server as well (pp4100 was approximately the same)

@ProgenyAlpha is there a way I could give you better stack traces for the failing vulkan shaders?

@ProgenyAlpha
Copy link
Contributor Author

Closing, the revert didn't fix it. @itterative if you want to keep debugging this, the Vulkan debug output might help narrow down which op is faulting. You can enable it by adding target_compile_definitions(ggml-vulkan PRIVATE GGML_VULKAN_DEBUG) to ggml/src/ggml-vulkan/CMakeLists.txt and rebuilding.

@ProgenyAlpha
Copy link
Contributor Author

Correction: easier way to enable it is just the cmake flag:

cmake -B build -DGGML_VULKAN=ON -DGGML_VULKAN_DEBUG=ON

No need to edit CMakeLists.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Vulkan throws vk::DeviceLostError on Qwen3.5 35B A3B

2 participants