CUDA: fuse SSM_CONV + ADD(bias) + SILU by anavp-nvidia · Pull Request #22478 · ggml-org/llama.cpp

anavp-nvidia · 2026-04-28T14:38:15Z

Overview

Adds a CUDA fusion for SSM_CONV + ADD(bias) + SILU. The existing SSM_CONV + SILU fusion didn't match on Mamba-1 and Mamba-2 layers (used by Nemotron-H, Granite-Hybrid, Jamba, and other Mamba-style hybrids) because of a bias ADD operation between the conv and the SILU.

Additional information

Model	Test	t/s master	t/s ssm_conv-bias-silu-fusion	Speedup
State Spaces Mamba 2.8B Q4_K_M	tg128	292.79	307.65	1.05
Mamba Codestral 7B v0.1 Q4_K_M	tg128	193.84	200.32	1.03
Nemotron 3 Nano 30B-A3B Q4_K_M	tg128	304.77	312.37	1.02
Nemotron H 8B Reasoning 128K Q4_K_M	tg128	224.33	228.94	1.02
Granite 4.0 H Small Q4_K_M	tg128	149.26	151.04	1.01
Qwen 3 Next 80B-A3B Q4_K_M	tg128	191.56	191.33	1.00
Mistral Nemo Minitron 8B Base Q4_K_M	tg128	221.41	221.23	1.00

All testing was done on a Windows system with RTX Pro 6000 Blackwell GPU.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, AI tools were used for code review.

gaugarg-nv

I think we need to verify the actual unary subtype with ggml_get_unary_op(silu) in both previous and new pattern matching code.

Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>

ORippler · 2026-04-29T09:00:27Z

+template <bool apply_bias, bool apply_silu, size_t split_d_inner, size_t d_conv>
 static __global__ void ssm_conv_f32(const float * __restrict__ src0, const float * __restrict__ src1,
+                                    const float * __restrict__ bias,


Is it really necessary to template the kernel from a perf-perspective as opposed to checking bias against nullptr (this can be done in the same ternary expression)? We should be mindful of binary bloat and only template that which is truly necessary from a perf perspective.

I'd imagine the same can potentially apply to apply_silu as well, but that's beyond the scope of this PR

Makes sense. I've done as you've suggested now.

ggerganov

The test-backend-ops changes are OK

* 'master' of github.com:tekintian/llama.cpp: (659 commits) ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464) Update llama-mmap to use ftello/fseeko (ggml-org#22497) common : check for null getpwuid in hf-cache (ggml-org#22550) vulkan: add get/set tensor 2d functions (ggml-org#22514) spec: fix argument typo (ggml-org#22552) ci : bump ty to 0.0.33 (ggml-org#22535) vendor : update cpp-httplib to 0.43.2 (ggml-org#22548) CUDA: fix tile FA kernel on Pascal (ggml-org#22541) scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513) add fast matmul iquants (ggml-org#22504) spec : fix draft model checkpoints (ggml-org#22521) spec : fix vocab compat checks in spec example (ggml-org#22426) common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488) hexagon: make vmem and buffer-size configurable (ggml-org#22487) CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478) spec : disacard last drafted token with low prob (ggml-org#22506) sync : ggml ggml : bump version to 0.10.1 (ggml/1469) webui: fix slow mic stop and WAV encode (ggml-org#22480) ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293) ... # Conflicts: # .gitignore

This reverts commit 098705a.

Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anavp-nvidia requested review from a team and ggerganov as code owners April 28, 2026 14:38

am17an reviewed Apr 28, 2026

View reviewed changes

Comment thread tests/test-backend-ops.cpp

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 28, 2026

gaugarg-nv reviewed Apr 28, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu

anavp-nvidia and others added 3 commits April 29, 2026 08:38

CUDA: ssm_conv + bias + silu fusion

f8742f8

Apply suggestions from code review

aa960f7

Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>

adding back accidentally deleted line

3a7085f

anavp-nvidia force-pushed the ssm_conv_bias_silu_fusion branch from 6410eb7 to 3a7085f Compare April 29, 2026 08:54

ORippler reviewed Apr 29, 2026

View reviewed changes

simplify ssm_conv fusion templating and test struct

373ca03

am17an approved these changes Apr 29, 2026

View reviewed changes

ggerganov approved these changes Apr 29, 2026

View reviewed changes

am17an merged commit 098705a into ggml-org:master Apr 29, 2026
47 checks passed

cnsiva added a commit to saas-home/llama.cpp that referenced this pull request May 1, 2026

Revert "CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)"

1c03bc8

This reverts commit 098705a.

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)

84be669

jeffbolznv mentioned this pull request May 3, 2026

vulkan: fuse SSM_CONV + ADD + SILU #22653

Open

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)

ab1863c

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)

3f4c128

meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026

CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)

bb7e975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fuse SSM_CONV + ADD(bias) + SILU#22478

CUDA: fuse SSM_CONV + ADD(bias) + SILU#22478
am17an merged 4 commits into
ggml-org:masterfrom
anavp-nvidia:ssm_conv_bias_silu_fusion

anavp-nvidia commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

gaugarg-nv left a comment

Uh oh!

Uh oh!

Uh oh!

ORippler Apr 29, 2026

Uh oh!

anavp-nvidia Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

anavp-nvidia commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

gaugarg-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ORippler Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

anavp-nvidia Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anavp-nvidia commented Apr 28, 2026 •

edited

Loading