Ggml/cuda snake fusion hardening by ServeurpersoCom · Pull Request #22912 · ggml-org/llama.cpp

ServeurpersoCom · 2026-05-10T16:21:43Z

Overview

Tightening of fusion pattern matching edge cases, mirroring the Vulkan PR. Thanks to @jeffbolznv for the review remarks.

Additional information

Vulkan counterpart: #22855

All Snake fusion operands and intermediates now share x's type, matching the kernel's single-T template and the float cast on a / inv_b. Mixed-precision chains cleanly fall back to the naive path. Mirrors the Vulkan fix.
Reject Snake fusion when ne[2] > 1 or ne[3] > 1. The kernel only iterates over the first two dimensions, so higher-rank tensors would silently produce garbage on the upper dims. The matcher now falls back to the naive chain, mirroring the Vulkan fix.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES Opus 4.7 and MCP local container

…ync vulkan)

…eview)

…an review)

ServeurpersoCom · 2026-05-11T05:41:18Z

ADD/SUB/MUL/DIV in CUDA don't support BF16 in bin_bcast, but supports_op blindly returns true, so my PR crashes the CI when the fallback kicks in. Should I fix supports_op to tell the truth (mirroring Vulkan), or would you prefer extending bin_bcast to BF16 in a dedicated backend PR?

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 9f90b656d..e25be3592 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -5306,12 +5306,8 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
         case GGML_OP_VIEW:
         case GGML_OP_PERMUTE:
         case GGML_OP_TRANSPOSE:
-        case GGML_OP_ADD:
         case GGML_OP_ADD_ID:
         case GGML_OP_ADD1:
-        case GGML_OP_SUB:
-        case GGML_OP_MUL:
-        case GGML_OP_DIV:
         case GGML_OP_SCALE:
         case GGML_OP_SQR:
         case GGML_OP_SQRT:
@@ -5320,6 +5316,13 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
         case GGML_OP_CLAMP:
         case GGML_OP_LOG:
             return true;
+        case GGML_OP_ADD:
+        case GGML_OP_SUB:
+        case GGML_OP_MUL:
+        case GGML_OP_DIV:
+            return (op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16) &&
+                   (op->src[1]->type == GGML_TYPE_F32 || op->src[1]->type == GGML_TYPE_F16) &&
+                   (op->type         == GGML_TYPE_F32 || op->type         == GGML_TYPE_F16);
         case GGML_OP_SSM_SCAN: {
             if (op->src[3]->ne[0] == 1) {
                 // Mamba2

am17an · 2026-05-11T09:14:09Z

ADD/SUB/MUL/DIV in CUDA don't support BF16 in bin_bcast, but supports_op blindly returns true, so my PR crashes the CI when the fallback kicks in. Should I fix supports_op to tell the truth (mirroring Vulkan)

I think you can just adds the supports_op fallback here, we can add bf16 support if you'd like later.

bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting.

ORippler · 2026-05-11T09:36:56Z

All Snake fusion operands and intermediates now share x's type, matching the kernel's single-T template and the float cast on a / inv_b. Mixed-precision chains cleanly fall back to the naive path. Mirrors the Vulkan fix.

Reject Snake fusion when ne[2] > 1 or ne[3] > 1. The kernel only iterates over the first two dimensions, so higher-rank tensors would silently produce garbage on the upper dims. The matcher now falls back to the naive chain, mirroring the Vulkan fix.

Does it make sense to add those checks as test cases to test-backend-ops? Seems like the same fusion pattern is implemented independently in multiple backends.

ServeurpersoCom · 2026-05-11T10:33:40Z

Does it make sense to add those checks as test cases to test-backend-ops? Seems like the same fusion pattern is implemented independently in multiple backends.

Done: added rank-3/rank-4 shapes to test_snake_fuse. Mixed-precision is already covered by the existing F16/BF16 variants (x typed, a/inv_b in F32), which exercise the same types_ok rejection path.

CUDA: 16/16 SNAKE_FUSE tests passed (F32 + F16, BF16 not supported)
Vulkan: 8/8 SNAKE_FUSE tests passed (F32, F16/BF16 not supported)

root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -o SNAKE_FUSE
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Device memory: 97247 MB (95403 MB free)

  SNAKE_FUSE(type=f32,ne=[5,7,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[33,32,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[1025,13,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[128,16,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[256,192,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,1,2]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,3]): OK
  SNAKE_FUSE(type=f16,ne=[5,7,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[33,32,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[1025,13,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[128,16,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[256,192,1,1]): OK
  SNAKE_FUSE(type=f16,ne=[64,32,2,1]): OK
  SNAKE_FUSE(type=f16,ne=[64,32,1,2]): OK
  SNAKE_FUSE(type=f16,ne=[64,32,2,3]): OK
  SNAKE_FUSE(type=bf16,ne=[5,7,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[33,32,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[1025,13,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[128,16,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[256,192,1,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,1]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[64,32,1,2]): not supported [CUDA0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,3]): not supported [CUDA0]
  16/16 tests passed
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK
root@pod:/mnt/workspace/llama.cpp#
root@pod:/mnt/workspace/llama.cpp# ./build/bin/test-backend-ops -o SNAKE_FUSE
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX PRO 6000 Blackwell Workstation Edition (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Device memory: 97887 MB (95904 MB free)

  SNAKE_FUSE(type=f32,ne=[5,7,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[33,32,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[1025,13,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[128,16,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[256,192,1,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,1]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,1,2]): OK
  SNAKE_FUSE(type=f32,ne=[64,32,2,3]): OK
  SNAKE_FUSE(type=f16,ne=[5,7,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[33,32,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[1025,13,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[128,16,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[256,192,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[64,32,2,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[64,32,1,2]): not supported [Vulkan0]
  SNAKE_FUSE(type=f16,ne=[64,32,2,3]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[5,7,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[33,32,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[1025,13,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[128,16,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[256,192,1,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,1]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[64,32,1,2]): not supported [Vulkan0]
  SNAKE_FUSE(type=bf16,ne=[64,32,2,3]): not supported [Vulkan0]
  8/8 tests passed
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

ServeurpersoCom · 2026-05-11T16:37:49Z

@ggml-org/maintainers Need a re-approval, please.

* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan) * cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review) * cuda: merge type_ok and types_ok into a single types_ok (address am17an review) * cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16 bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting. * test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases

ServeurpersoCom added 2 commits May 10, 2026 18:09

cuda: tighten snake fusion type checks for all operands (defensive, s…

e4f8ba3

…ync vulkan)

cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR r…

742dda0

…eview)

ServeurpersoCom requested a review from a team as a code owner May 10, 2026 16:21

am17an reviewed May 10, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

cuda: merge type_ok and types_ok into a single types_ok (address am17…

4cd7e1d

…an review)

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 10, 2026

am17an approved these changes May 11, 2026

View reviewed changes

JohannesGaessler approved these changes May 11, 2026

View reviewed changes

cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

5d32517

bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting.

test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases

c7d1a68

ServeurpersoCom requested a review from ggerganov as a code owner May 11, 2026 10:28

github-actions Bot added the testing Everything test related label May 11, 2026

pwilkin approved these changes May 11, 2026

View reviewed changes

ServeurpersoCom merged commit e936660 into ggml-org:master May 11, 2026
47 checks passed

albertnsoliz mentioned this pull request May 24, 2026

Misc. bug: Performance regression from Snake Fusion Hardening on RTX 8000 GPU #23626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ggml/cuda snake fusion hardening#22912

Ggml/cuda snake fusion hardening#22912
ServeurpersoCom merged 5 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cuda-snake-fusion-hardening

ServeurpersoCom commented May 10, 2026

Uh oh!

Uh oh!

ServeurpersoCom commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

ORippler commented May 11, 2026

Uh oh!

ServeurpersoCom commented May 11, 2026

Uh oh!

ServeurpersoCom commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ServeurpersoCom commented May 10, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

ServeurpersoCom commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

ORippler commented May 11, 2026

Uh oh!

ServeurpersoCom commented May 11, 2026

Uh oh!

ServeurpersoCom commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants