Skip to content

Ggml/cuda snake fusion hardening#22912

Merged
ServeurpersoCom merged 5 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cuda-snake-fusion-hardening
May 11, 2026
Merged

Ggml/cuda snake fusion hardening#22912
ServeurpersoCom merged 5 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cuda-snake-fusion-hardening

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Overview

Tightening of fusion pattern matching edge cases, mirroring the Vulkan PR. Thanks to @jeffbolznv for the review remarks.

Additional information

Vulkan counterpart: #22855

  1. All Snake fusion operands and intermediates now share x's type, matching the kernel's single-T template and the float cast on a / inv_b. Mixed-precision chains cleanly fall back to the naive path. Mirrors the Vulkan fix.

  2. Reject Snake fusion when ne[2] > 1 or ne[3] > 1. The kernel only iterates over the first two dimensions, so higher-rank tensors would silently produce garbage on the upper dims. The matcher now falls back to the naive chain, mirroring the Vulkan fix.

Requirements

@ServeurpersoCom ServeurpersoCom requested a review from a team as a code owner May 10, 2026 16:21
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 10, 2026
@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

ADD/SUB/MUL/DIV in CUDA don't support BF16 in bin_bcast, but supports_op blindly returns true, so my PR crashes the CI when the fallback kicks in. Should I fix supports_op to tell the truth (mirroring Vulkan), or would you prefer extending bin_bcast to BF16 in a dedicated backend PR?

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 9f90b656d..e25be3592 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -5306,12 +5306,8 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
         case GGML_OP_VIEW:
         case GGML_OP_PERMUTE:
         case GGML_OP_TRANSPOSE:
-        case GGML_OP_ADD:
         case GGML_OP_ADD_ID:
         case GGML_OP_ADD1:
-        case GGML_OP_SUB:
-        case GGML_OP_MUL:
-        case GGML_OP_DIV:
         case GGML_OP_SCALE:
         case GGML_OP_SQR:
         case GGML_OP_SQRT:
@@ -5320,6 +5316,13 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
         case GGML_OP_CLAMP:
         case GGML_OP_LOG:
             return true;
+        case GGML_OP_ADD:
+        case GGML_OP_SUB:
+        case GGML_OP_MUL:
+        case GGML_OP_DIV:
+            return (op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16) &&
+                   (op->src[1]->type == GGML_TYPE_F32 || op->src[1]->type == GGML_TYPE_F16) &&
+                   (op->type         == GGML_TYPE_F32 || op->type         == GGML_TYPE_F16);
         case GGML_OP_SSM_SCAN: {
             if (op->src[3]->ne[0] == 1) {
                 // Mamba2

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 11, 2026

ADD/SUB/MUL/DIV in CUDA don't support BF16 in bin_bcast, but supports_op blindly returns true, so my PR crashes the CI when the fallback kicks in. Should I fix supports_op to tell the truth (mirroring Vulkan)

I think you can just adds the supports_op fallback here, we can add bf16 support if you'd like later.

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.
@ORippler
Copy link
Copy Markdown
Collaborator

  1. All Snake fusion operands and intermediates now share x's type, matching the kernel's single-T template and the float cast on a / inv_b. Mixed-precision chains cleanly fall back to the naive path. Mirrors the Vulkan fix.
  2. Reject Snake fusion when ne[2] > 1 or ne[3] > 1. The kernel only iterates over the first two dimensions, so higher-rank tensors would silently produce garbage on the upper dims. The matcher now falls back to the naive chain, mirroring the Vulkan fix.

Does it make sense to add those checks as test cases to test-backend-ops? Seems like the same fusion pattern is implemented independently in multiple backends.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

Does it make sense to add those checks as test cases to test-backend-ops? Seems like the same fusion pattern is implemented independently in multiple backends.

Done: added rank-3/rank-4 shapes to test_snake_fuse. Mixed-precision is already covered by the existing F16/BF16 variants (x typed, a/inv_b in F32), which exercise the same types_ok rejection path.

CUDA: 16/16 SNAKE_FUSE tests passed (F32 + F16, BF16 not supported)
Vulkan: 8/8 SNAKE_FUSE tests passed (F32, F16/BF16 not supported)

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -o SNAKE_FUSE
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Device memory: 97247 MB (95403 MB free)

  SNAKE_FUSE(type=f32,ne=[5,7,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[33,32,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[1025,13,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[128,16,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[256,192,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,1,2]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,3]): OK
  SNAKE_FUSE(type=f16,ne=[5,7,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[33,32,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[1025,13,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[128,16,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[256,192,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[64,32,2,1]): OK
  SNAKE_FUSE(type=f16,ne=[64,32,1,2]): OK
  SNAKE_FUSE(type=f16,ne=[64,32,2,3]): OK
  SNAKE_FUSE(type=bf16,ne=[5,7,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[33,32,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[1025,13,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[128,16,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[256,192,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[64,32,1,2]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,3]): not supported [CUDA0]
  16/16 tests passed
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK
root@pod:/mnt/workspace/llama.cpp#
root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -o SNAKE_FUSE
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX PRO 6000 Blackwell Workstation Edition (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Device memory: 97887 MB (95904 MB free)

  SNAKE_FUSE(type=f32,ne=[5,7,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[33,32,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[1025,13,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[128,16,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[256,192,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,1,2]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,3]): OK
  SNAKE_FUSE(type=f16,ne=[5,7,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[33,32,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[1025,13,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[128,16,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[256,192,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[64,32,2,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[64,32,1,2]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[64,32,2,3]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[5,7,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[33,32,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[1025,13,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[128,16,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[256,192,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[64,32,1,2]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,3]): not supported [Vulkan0]
  8/8 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

@github-actions github-actions Bot added the testing Everything test related label May 11, 2026
@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

@ggml-org/maintainers Need a re-approval, please.

@ServeurpersoCom ServeurpersoCom merged commit e936660 into ggml-org:master May 11, 2026
47 checks passed
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 12, 2026
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)

* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)

* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)

* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.

* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 13, 2026
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)

* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)

* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)

* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.

* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)

* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)

* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)

* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.

* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)

* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)

* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)

* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.

* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)

* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)

* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)

* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.

* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)

* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)

* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)

* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.

* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants