[Kernel] Mamba2 SSD add fused kernel for 1.5-2.5x SSD (prefill) speedup #27299

RishiAstra · 2025-10-21T22:15:15Z

Purpose

This PR speeds up the Mamba2 SSD prefill by about 1.5-2.5x (depending on state dtype) by adding a fully fused Triton kernel. This fused kernel can be used instead of the original Chunk Cumsum, BMM, Chunk State, State Passing, and Chunk Scan kernels. This fusion reorders work and uses synchronization to eliminate some intermediate VRAM writes/reads and increase cache locality. For Mamba2-2.7B with 64k context (tested in state-spaces/mamba), this results in an end-to-end speedup of ~15-17% for A100 and H100 GPUs.

Test Plan

More tests by @cyang49 below.

Run all SSD tests from tests/kernels/mamba/test_mamba_ssm_ssd.py on the fused kernel.
This effectively doubles the tests in that file, so it might reduce the CI speedup benefit from #26538

16 extra tests are added as test_mamba_chunk_scan_cont_batch_z_d in tests/kernels/mamba/test_mamba_ssm_ssd.py to test that the kernels work with optional args z and d.

I also tested an end-to-end model:

vllm serve ibm-ai-platform/Bamba-9B-v1 --hf-overrides '{"mamba2_fast_kernel": true}'

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "ibm-ai-platform/Bamba-9B-v1",
    "prompt": "ROMEO AND JULIET\n\nby William Shakespeare\n\nPROLOGUE\n\nTwo households, both alike in dignity,\nIn fair Verona, where we lay our scene,",
    "temperature": 0,
    "top_p": 1,
    "top_k": 1,
    "max_tokens": 100
  }'

Test Result

See more in-depth benchmark and accuracy results from @cyang49 below.

Tests pass locally on a RTX4090.

For the vllm serve test, I get:

{"id":"cmpl-f64ef265c09943b384fd1acf8579d8f7","object":"text_completion","created":1761083880,"model":"ibm-ai-platform/Bamba-9B-v1","choices":[{"index":0,"text":"--\nFrom ancient grudge break to new mutiny,\nWhere civil blood makes civil hands unclean.\nFrom forth the fatal loins of these two foes\nA pair of star-cross'd lovers take their life;\nWhose misadventured piteous overthrows\nDo with their death bury their parents' strife.\nThe fearful passage of their death-mark'd love,\nAnd the continuance of their parents' rage,\nWhich, but their children's end, nought could remove,\nIs now the","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":35,"total_tokens":135,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}(vllm_mamba_fused)

Questions

Is --hf-overrides '{"mamba2_fast_kernel": true}' the most appropriate way to give users the option to use the fused kernel?' If so, how can we document it?
Can we change the chunk_size to 128 for this kernel? It gets a relative ~5-15% speedup, is this worth it?

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

cyang49 · 2025-10-25T16:29:00Z

I collected some supporting performance results on single H100.
Short summary:

The microbenchmark shows significant throughput improvement for all problem sizes tested, especially if using CHUNK_SIZE_FUSED=128.
The latency improvement for both granite models is around 2% ~3% for longer context. For nvidia/NVIDIA-Nemotron-Nano-12B-v2 there is no significant improvement, but there's no significant degradation either.

microbenchmark

The script is provided by @RishiAstra
I modified the size configs to match granite-4.0-h-small. The metric is throughput tokens/s.

default (CHUNK_SIZE_FUSED=128)

      dims_b_seq_nh_hd_ng_ds      Original         Fused
0    (1024, 128, 64, 1, 128)  3.452741e+06  1.453885e+07
1    (2048, 128, 64, 1, 128)  7.044192e+06  1.842257e+07
2    (4096, 128, 64, 1, 128)  1.206409e+07  2.090137e+07
3    (8192, 128, 64, 1, 128)  1.262327e+07  2.265086e+07
4   (16384, 128, 64, 1, 128)  1.292440e+07  2.372347e+07
5   (32768, 128, 64, 1, 128)  1.180002e+07  2.272778e+07
6   (65536, 128, 64, 1, 128)  1.106830e+07  2.340398e+07
7  (131072, 128, 64, 1, 128)  1.070444e+07  2.304179e+07
8  (262144, 128, 64, 1, 128)  1.081547e+07  2.386912e+07

matching mamba config (CHUNK_SIZE_FUSED=256)

      dims_b_seq_nh_hd_ng_ds      Original         Fused
0    (1024, 128, 64, 1, 128)  3.547672e+06  1.318500e+07
1    (2048, 128, 64, 1, 128)  7.195458e+06  1.558695e+07
2    (4096, 128, 64, 1, 128)  1.206637e+07  1.756071e+07
3    (8192, 128, 64, 1, 128)  1.262483e+07  1.895314e+07
4   (16384, 128, 64, 1, 128)  1.291429e+07  1.997308e+07
5   (32768, 128, 64, 1, 128)  1.210910e+07  1.947564e+07
6   (65536, 128, 64, 1, 128)  1.130736e+07  2.048072e+07
7  (131072, 128, 64, 1, 128)  1.086217e+07  1.851559e+07
8  (262144, 128, 64, 1, 128)  1.071730e+07  2.051457e+07

Latency benchmark

As the changes mainly involved mamba2 SSD used in prefill, I collected the latency measurements with fused ssd on and off.

input-len=131071
output-len=1
batch-size=1
max_batched_tokens is changed to control chunked prefill sizes

Test command examples

vllm bench latency --model $MODEL --input-len=131071 --output-len=1 --batch-size=1 --max_num_batched_tokens=16384

vllm bench latency --model $MODEL --input-len=16384 --output-len=1 --batch-size=1 --max_num_batched_tokens=16384 --hf-overrides '{"mamba2_fast_kernel": true}'

ibm-granite/granite-4.0-h-tiny

chunk size	regular (s)	fused5 (s)	Speedup
8192	2.16256852	2.10809545	1.026
16384	2.06876952	2.02572432	1.021
32768	2.04912627	1.98667002	1.031

ibm-granite/granite-4.0-h-small

chunk size	regular (s)	fused5 (s)	Speedup
8192	7.21763917	6.94847805	1.039
16384	6.91420193	6.78786916	1.019
32768	6.95490595	6.78407113	1.025

nvidia/NVIDIA-Nemotron-Nano-12B-v2

chunk size	regular (s)	fused5 (s)	Speedup
8192	7.39740898	7.33307095	1.009
16384	7.3228417	7.26544746	1.008
32768	7.48103932	7.56693773	0.989

cyang49 · 2025-10-25T16:42:23Z

lm_eval Results

No loss of quality

ibm-granite/granite-4.0-h-tiny

fused off

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7961	±	0.0111
		strict-match	5	exact_match	↑	0.7945	±	0.0111

fused on

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8059	±	0.0109
		strict-match	5	exact_match	↑	0.8044	±	0.0109

ibm-granite/granite-4.0-h-small

fused off

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8560	±	0.0097
		strict-match	5	exact_match	↑	0.8552	±	0.0097

fused on

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8544	±	0.0097
		strict-match	5	exact_match	↑	0.8537	±	0.0097

nvidia/NVIDIA-Nemotron-Nano-12B-v2

fused off

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8734	±	0.0092
		strict-match	5	exact_match	↑	0.8741	±	0.0091

fused on

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8787	±	0.0090
		strict-match	5	exact_match	↑	0.8802	±	0.0089

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/mamba/ops/ssd_fused5.py

Signed-off-by: Rishi Astra <[email protected]>

RishiAstra · 2025-10-25T18:40:14Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/mamba/ops/ssd_fused5.py

Signed-off-by: Rishi Astra <[email protected]>

RishiAstra · 2025-10-28T22:48:40Z

@codex review

chatgpt-codex-connector · 2025-10-28T22:55:29Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

tdoublep · 2025-11-11T20:00:13Z

Amazing work!

Is --hf-overrides '{"mamba2_fast_kernel": true}' the most appropriate way to give users the option to use the fused kernel?' If so, how can we document it?

I haven't encountered a feature flag being implemented using hf-overrides in vLLM until now. I've seen plenty of environment variables introduced for doing things like this, although don't necessarily think that is the right approach either.

A more pressing concern though the maintenance cost of introducing so much new code, for relatively small latency improvements. Do we understand why the bigger gains seen in the microbenchmarks (up to 2x if I'm reading correctly?) don't translate into E2E speedups?

RishiAstra · 2025-11-11T21:18:11Z

A more pressing concern though the maintenance cost of introducing so much new code, for relatively small latency improvements. Do we understand why the bigger gains seen in the microbenchmarks (up to 2x if I'm reading correctly?) don't translate into E2E speedups?

This kernel speeds up the Mamba2 SSD layer, and the microbenchmarks show that the speedup is ~2-3x. However, most models that contain Mamba2 SSD layers also contain many other layers. For example, even Mamba2-2.7b contains linear projections, diluting the 2-3x speedup down to ~15%. Some other models like granite-4.0-h-tiny, granite-4.0-h-small, and NVIDIA-Nemotron-Nano-12B-v2 contain even more (or heavier) other layers.

As a concrete example using Amdahl's law, imagine that the Mamba2 SSD layers take up 1/4th of the total runtime in Mamba2-2.7b, and we speed them up by 2x. The Amdahl predicted speedup is:
1 / ( other_components + Mamba2_SSD / speedup) = 1 / ( 3/4 + 1/4 * 1/2) = 1 / (7/8) = 8/7 = 1.14x e2e speedup.
This roughly matches the speedup for Mamba2-2.7b. For other models, the Mamba2 SSD layers make up a smaller fraction of the e2e runtime, so the e2e speedup is less.

Although the kernel is a lot of code, it's mostly contained in 1 file. There is a chance that future variants of Mamba would require modifying this kernel, adding maintenance cost, but there is also a chance that future models will use more or larger Mamba2 layers, causing larger speedups and more benefit.

RishiAstra marked this pull request as ready for review October 25, 2025 18:10

RishiAstra requested review from WoosukKwon, mgoin, tdoublep, tlrmchlsmth and yewentao256 as code owners October 25, 2025 18:10

chatgpt-codex-connector bot reviewed Oct 25, 2025

View reviewed changes

vllm/model_executor/layers/mamba/ops/ssd_fused5.py Outdated Show resolved Hide resolved

RishiAstra added 7 commits October 25, 2025 13:35

add old varlen mamba2 fused kernel

719fbc0

Signed-off-by: Rishi Astra <[email protected]>

add fused option to ssd_combined.py

edf8b96

Signed-off-by: Rishi Astra <[email protected]>

adapted fused5 kernel

ee7d6d2

Signed-off-by: Rishi Astra <[email protected]>

kernel fixes

1400b95

Signed-off-by: Rishi Astra <[email protected]>

added cli mamba option and clean up

74116dc

Signed-off-by: Rishi Astra <[email protected]>

fixes, tests, and cleanup

4b1c1e1

Signed-off-by: Rishi Astra <[email protected]>

select cuda device for Mamba2 fused SSD

9c0d274

Signed-off-by: Rishi Astra <[email protected]>

RishiAstra force-pushed the mamba2_fused_ssd_pr branch from 02f5e5a to 9c0d274 Compare October 25, 2025 18:35

chatgpt-codex-connector bot reviewed Oct 25, 2025

View reviewed changes

vllm/model_executor/layers/mamba/ops/ssd_fused5.py Show resolved Hide resolved

RishiAstra added 3 commits October 28, 2025 17:10

small cleanup

8a9664d

Signed-off-by: Rishi Astra <[email protected]>

add back z support

08305e0

Signed-off-by: Rishi Astra <[email protected]>

added test for z and d

404c597

Signed-off-by: Rishi Astra <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] Mamba2 SSD add fused kernel for 1.5-2.5x SSD (prefill) speedup #27299

[Kernel] Mamba2 SSD add fused kernel for 1.5-2.5x SSD (prefill) speedup #27299

RishiAstra commented Oct 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

cyang49 commented Oct 25, 2025

Uh oh!

cyang49 commented Oct 25, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

RishiAstra commented Oct 25, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

RishiAstra commented Oct 28, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 28, 2025

Uh oh!

tdoublep commented Nov 11, 2025 •

edited

Loading

Uh oh!

RishiAstra commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Kernel] Mamba2 SSD add fused kernel for 1.5-2.5x SSD (prefill) speedup #27299

Are you sure you want to change the base?

[Kernel] Mamba2 SSD add fused kernel for 1.5-2.5x SSD (prefill) speedup #27299

Conversation

RishiAstra commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Questions

Uh oh!

cyang49 commented Oct 25, 2025

microbenchmark

default (CHUNK_SIZE_FUSED=128)

matching mamba config (CHUNK_SIZE_FUSED=256)

Latency benchmark

ibm-granite/granite-4.0-h-tiny

ibm-granite/granite-4.0-h-small

nvidia/NVIDIA-Nemotron-Nano-12B-v2

Uh oh!

cyang49 commented Oct 25, 2025

ibm-granite/granite-4.0-h-tiny

ibm-granite/granite-4.0-h-small

nvidia/NVIDIA-Nemotron-Nano-12B-v2

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

RishiAstra commented Oct 25, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

RishiAstra commented Oct 28, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 28, 2025

Uh oh!

tdoublep commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RishiAstra commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RishiAstra commented Oct 21, 2025 •

edited by github-actions bot

Loading

tdoublep commented Nov 11, 2025 •

edited

Loading