Skip to content

Conversation

@RishiAstra
Copy link
Contributor

@RishiAstra RishiAstra commented Oct 21, 2025

Purpose

This PR speeds up the Mamba2 SSD prefill by about 1.5-2.5x (depending on state dtype) by adding a fully fused Triton kernel. This fused kernel can be used instead of the original Chunk Cumsum, BMM, Chunk State, State Passing, and Chunk Scan kernels. This fusion reorders work and uses synchronization to eliminate some intermediate VRAM writes/reads and increase cache locality. For Mamba2-2.7B with 64k context (tested in state-spaces/mamba), this results in an end-to-end speedup of ~15-17% for A100 and H100 GPUs.

Test Plan

More tests by @cyang49 below.

Run all SSD tests from tests/kernels/mamba/test_mamba_ssm_ssd.py on the fused kernel.
This effectively doubles the tests in that file, so it might reduce the CI speedup benefit from #26538

16 extra tests are added as test_mamba_chunk_scan_cont_batch_z_d in tests/kernels/mamba/test_mamba_ssm_ssd.py to test that the kernels work with optional args z and d.

I also tested an end-to-end model:

vllm serve ibm-ai-platform/Bamba-9B-v1 --hf-overrides '{"mamba2_fast_kernel": true}'
curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "ibm-ai-platform/Bamba-9B-v1",
    "prompt": "ROMEO AND JULIET\n\nby William Shakespeare\n\nPROLOGUE\n\nTwo households, both alike in dignity,\nIn fair Verona, where we lay our scene,",
    "temperature": 0,
    "top_p": 1,
    "top_k": 1,
    "max_tokens": 100
  }'

Test Result

See more in-depth benchmark and accuracy results from @cyang49 below.

Tests pass locally on a RTX4090.

For the vllm serve test, I get:

{"id":"cmpl-f64ef265c09943b384fd1acf8579d8f7","object":"text_completion","created":1761083880,"model":"ibm-ai-platform/Bamba-9B-v1","choices":[{"index":0,"text":"--\nFrom ancient grudge break to new mutiny,\nWhere civil blood makes civil hands unclean.\nFrom forth the fatal loins of these two foes\nA pair of star-cross'd lovers take their life;\nWhose misadventured piteous overthrows\nDo with their death bury their parents' strife.\nThe fearful passage of their death-mark'd love,\nAnd the continuance of their parents' rage,\nWhich, but their children's end, nought could remove,\nIs now the","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":35,"total_tokens":135,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}(vllm_mamba_fused)

Questions

  • Is --hf-overrides '{"mamba2_fast_kernel": true}' the most appropriate way to give users the option to use the fused kernel?' If so, how can we document it?
  • Can we change the chunk_size to 128 for this kernel? It gets a relative ~5-15% speedup, is this worth it?

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@cyang49
Copy link
Contributor

cyang49 commented Oct 25, 2025

I collected some supporting performance results on single H100.
Short summary:

  • The microbenchmark shows significant throughput improvement for all problem sizes tested, especially if using CHUNK_SIZE_FUSED=128.
  • The latency improvement for both granite models is around 2% ~3% for longer context. For nvidia/NVIDIA-Nemotron-Nano-12B-v2 there is no significant improvement, but there's no significant degradation either.

microbenchmark

The script is provided by @RishiAstra
I modified the size configs to match granite-4.0-h-small. The metric is throughput tokens/s.

default (CHUNK_SIZE_FUSED=128)

      dims_b_seq_nh_hd_ng_ds      Original         Fused
0    (1024, 128, 64, 1, 128)  3.452741e+06  1.453885e+07
1    (2048, 128, 64, 1, 128)  7.044192e+06  1.842257e+07
2    (4096, 128, 64, 1, 128)  1.206409e+07  2.090137e+07
3    (8192, 128, 64, 1, 128)  1.262327e+07  2.265086e+07
4   (16384, 128, 64, 1, 128)  1.292440e+07  2.372347e+07
5   (32768, 128, 64, 1, 128)  1.180002e+07  2.272778e+07
6   (65536, 128, 64, 1, 128)  1.106830e+07  2.340398e+07
7  (131072, 128, 64, 1, 128)  1.070444e+07  2.304179e+07
8  (262144, 128, 64, 1, 128)  1.081547e+07  2.386912e+07

matching mamba config (CHUNK_SIZE_FUSED=256)

      dims_b_seq_nh_hd_ng_ds      Original         Fused
0    (1024, 128, 64, 1, 128)  3.547672e+06  1.318500e+07
1    (2048, 128, 64, 1, 128)  7.195458e+06  1.558695e+07
2    (4096, 128, 64, 1, 128)  1.206637e+07  1.756071e+07
3    (8192, 128, 64, 1, 128)  1.262483e+07  1.895314e+07
4   (16384, 128, 64, 1, 128)  1.291429e+07  1.997308e+07
5   (32768, 128, 64, 1, 128)  1.210910e+07  1.947564e+07
6   (65536, 128, 64, 1, 128)  1.130736e+07  2.048072e+07
7  (131072, 128, 64, 1, 128)  1.086217e+07  1.851559e+07
8  (262144, 128, 64, 1, 128)  1.071730e+07  2.051457e+07

Latency benchmark

As the changes mainly involved mamba2 SSD used in prefill, I collected the latency measurements with fused ssd on and off.

  • input-len=131071
  • output-len=1
  • batch-size=1
  • max_batched_tokens is changed to control chunked prefill sizes

Test command examples

vllm bench latency --model $MODEL --input-len=131071 --output-len=1 --batch-size=1 --max_num_batched_tokens=16384
vllm bench latency --model $MODEL --input-len=16384 --output-len=1 --batch-size=1 --max_num_batched_tokens=16384 --hf-overrides '{"mamba2_fast_kernel": true}'

ibm-granite/granite-4.0-h-tiny

chunk size regular (s) fused5 (s) Speedup
8192 2.16256852 2.10809545 1.026
16384 2.06876952 2.02572432 1.021
32768 2.04912627 1.98667002 1.031

ibm-granite/granite-4.0-h-small

chunk size regular (s) fused5 (s) Speedup
8192 7.21763917 6.94847805 1.039
16384 6.91420193 6.78786916 1.019
32768 6.95490595 6.78407113 1.025

nvidia/NVIDIA-Nemotron-Nano-12B-v2

chunk size regular (s) fused5 (s) Speedup
8192 7.39740898 7.33307095 1.009
16384 7.3228417 7.26544746 1.008
32768 7.48103932 7.56693773 0.989

@cyang49
Copy link
Contributor

cyang49 commented Oct 25, 2025

lm_eval Results

  • No loss of quality

ibm-granite/granite-4.0-h-tiny

fused off

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7961 ± 0.0111
strict-match 5 exact_match 0.7945 ± 0.0111

fused on

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8059 ± 0.0109
strict-match 5 exact_match 0.8044 ± 0.0109

ibm-granite/granite-4.0-h-small

fused off

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8560 ± 0.0097
strict-match 5 exact_match 0.8552 ± 0.0097

fused on

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8544 ± 0.0097
strict-match 5 exact_match 0.8537 ± 0.0097

nvidia/NVIDIA-Nemotron-Nano-12B-v2

fused off

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8734 ± 0.0092
strict-match 5 exact_match 0.8741 ± 0.0091

fused on

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8787 ± 0.0090
strict-match 5 exact_match 0.8802 ± 0.0089

@RishiAstra RishiAstra marked this pull request as ready for review October 25, 2025 18:10
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@RishiAstra
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Rishi Astra <[email protected]>
Signed-off-by: Rishi Astra <[email protected]>
Signed-off-by: Rishi Astra <[email protected]>
@RishiAstra
Copy link
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@tdoublep
Copy link
Member

tdoublep commented Nov 11, 2025

Amazing work!

Is --hf-overrides '{"mamba2_fast_kernel": true}' the most appropriate way to give users the option to use the fused kernel?' If so, how can we document it?

I haven't encountered a feature flag being implemented using hf-overrides in vLLM until now. I've seen plenty of environment variables introduced for doing things like this, although don't necessarily think that is the right approach either.

A more pressing concern though the maintenance cost of introducing so much new code, for relatively small latency improvements. Do we understand why the bigger gains seen in the microbenchmarks (up to 2x if I'm reading correctly?) don't translate into E2E speedups?

@RishiAstra
Copy link
Contributor Author

A more pressing concern though the maintenance cost of introducing so much new code, for relatively small latency improvements. Do we understand why the bigger gains seen in the microbenchmarks (up to 2x if I'm reading correctly?) don't translate into E2E speedups?

This kernel speeds up the Mamba2 SSD layer, and the microbenchmarks show that the speedup is ~2-3x. However, most models that contain Mamba2 SSD layers also contain many other layers. For example, even Mamba2-2.7b contains linear projections, diluting the 2-3x speedup down to ~15%. Some other models like granite-4.0-h-tiny, granite-4.0-h-small, and NVIDIA-Nemotron-Nano-12B-v2 contain even more (or heavier) other layers.

As a concrete example using Amdahl's law, imagine that the Mamba2 SSD layers take up 1/4th of the total runtime in Mamba2-2.7b, and we speed them up by 2x. The Amdahl predicted speedup is:
1 / ( other_components + Mamba2_SSD / speedup) = 1 / ( 3/4 + 1/4 * 1/2) = 1 / (7/8) = 8/7 = 1.14x e2e speedup.
This roughly matches the speedup for Mamba2-2.7b. For other models, the Mamba2 SSD layers make up a smaller fraction of the e2e runtime, so the e2e speedup is less.

Although the kernel is a lot of code, it's mostly contained in 1 file. There is a chance that future variants of Mamba would require modifying this kernel, adding maintenance cost, but there is also a chance that future models will use more or larger Mamba2 layers, causing larger speedups and more benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants