vulkan: fix SSM_CONV crash on multi-GPU (#20462)#20495
vulkan: fix SSM_CONV crash on multi-GPU (#20462)#20495ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
Conversation
The 2D tiling (32x16 workgroups) from ggml-org#20379 causes DeviceLost on multi-GPU RADV setups. Revert to 1D dispatch but keep the vec4 dot product fast path for nc=4. Fixes ggml-org#20462
|
Tested against the pr, and it seems to be failing (same as master). Master fails at the same point as the second log file (vulkan-issue-0a2088a-2.txt). Ran it twice after setting the temperature to 0. Here are the logs: Some other notes:
@ProgenyAlpha is there a way I could give you better stack traces for the failing vulkan shaders? |
|
Closing, the revert didn't fix it. @itterative if you want to keep debugging this, the Vulkan debug output might help narrow down which op is faulting. You can enable it by adding |
|
Correction: easier way to enable it is just the cmake flag: No need to edit CMakeLists.txt. |
Reverts the 2D workgroup tiling (32x16) from #20379 which causes
vk::DeviceLostErroron multi-GPU RADV setups. Keeps the vec4 dot product fast path for nc=4.The 2D tiling reduced workgroup launch overhead at large ubatch sizes, but triggers a driver-level fault on multi-GPU configurations (confirmed on dual 7900 XTX, bisected to 40c550d).
Based on master at 983df14.
Fixes #20462
Test plan