[Mamba] Flashinfer selective_state_update by roikoren755 · Pull Request #36162 · vllm-project/vllm

roikoren755 · 2026-03-05T17:45:28Z

Purpose

Add wrapper for FI's selective_state_update kernel, with a runtime dispatcher, connected to a config field, to select between it and the existing triton implementation.

As suggested by @tdoublep in #35753, I've introduced MambaConfig in this PR, and a followup (or this, if you'd prefer) could move config fields relevant to Mamba to it.

Test Plan

New test file for the dispatcher's functionality.
Add tests e2e for Nemotron 3 Nano that use FI's kernel.
Benchmark the two options.

Test Result

Tests pass, e2e showing the same GSM8K score for both backends for nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.
Running nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with the following command (replacing the --mamba-backend argument for the flashinfer measurement):

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --tensor-parallel-size 4 --max-model-len 8192 --mamba-backend triton \
    --trust-remote-code

And benchmarking with:

vllm bench serve \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --num-prompts 500 --request-rate inf \
    --input-len 256 --output-len 256

Got the following results:
Triton:

============ Serving Benchmark Result ============
Successful requests:                     500
Failed requests:                         0
Benchmark duration (s):                  5.88
Total input tokens:                      128002
Total generated tokens:                  128000
Request throughput (req/s):              85.02
Output token throughput (tok/s):         21765.03
Peak output token throughput (tok/s):    31129.00
Peak concurrent requests:                500.00
Total token throughput (tok/s):          43530.41
---------------Time to First Token----------------
Mean TTFT (ms):                          1256.51
Median TTFT (ms):                        1233.83
P99 TTFT (ms):                           1770.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.77
Median TPOT (ms):                        17.90
P99 TPOT (ms):                           19.41
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.82
Median ITL (ms):                         15.93
P99 ITL (ms):                            69.00
==================================================

Flashinfer:

============ Serving Benchmark Result ============
Successful requests:                     500
Failed requests:                         0
Benchmark duration (s):                  5.64
Total input tokens:                      128002
Total generated tokens:                  128000
Request throughput (req/s):              88.60
Output token throughput (tok/s):         22682.25
Peak output token throughput (tok/s):    32938.00
Peak concurrent requests:                500.00
Total token throughput (tok/s):          45364.86
---------------Time to First Token----------------
Mean TTFT (ms):                          1183.95
Median TTFT (ms):                        1161.50
P99 TTFT (ms):                           1656.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.13
Median TPOT (ms):                        17.26
P99 TPOT (ms):                           18.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.18
Median ITL (ms):                         15.53
P99 ITL (ms):                            63.94
==================================================

Which comes up as ~4-5% speedup in most all metrics:

  ┌────────────────────────┬──────────────┬──────────────┬───────┐
  │         Metric         │    Triton    │  FlashInfer  │ Delta │
  ├────────────────────────┼──────────────┼──────────────┼───────┤
  │ Duration               │ 5.88s        │ 5.64s        │ -4.1% │
  ├────────────────────────┼──────────────┼──────────────┼───────┤
  │ Request throughput     │ 85.02 req/s  │ 88.60 req/s  │ +4.2% │
  ├────────────────────────┼──────────────┼──────────────┼───────┤
  │ Output throughput      │ 21,765 tok/s │ 22,682 tok/s │ +4.2% │
  ├────────────────────────┼──────────────┼──────────────┼───────┤
  │ Peak output throughput │ 31,129 tok/s │ 32,938 tok/s │ +5.8% │
  ├────────────────────────┼──────────────┼──────────────┼───────┤
  │ TPOT median            │ 17.90 ms     │ 17.26 ms     │ -3.6% │
  ├────────────────────────┼──────────────┼──────────────┼───────┤
  │ TPOT P99               │ 19.41 ms     │ 18.60 ms     │ -4.2% │
  ├────────────────────────┼──────────────┼──────────────┼───────┤
  │ ITL P99                │ 69.00 ms     │ 63.94 ms     │ -7.3% │
  └────────────────────────┴──────────────┴──────────────┴───────┘

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces support for the Flashinfer selective_state_update kernel as an alternative to the existing Triton implementation. A dispatcher has been added to select between the two backends at runtime, controlled by a new MambaConfig configuration object. The changes are well-structured, with the new backend logic cleanly integrated into the existing architecture. The pull request also includes comprehensive tests for the new dispatcher and backend, including checks for unsupported features in the Flashinfer kernel, ensuring robustness. The argument parsing and engine configuration have been updated appropriately to expose the new backend option. Overall, this is a high-quality contribution that significantly improves performance for Mamba models.

hmellor

Could we not introduce an enum to vllm.config, we we have been reserving this namespace for config classes only

hmellor · 2026-03-16T09:09:21Z

@@ -16,6 +16,7 @@
 from vllm.config.kv_transfer import KVTransferConfig
 from vllm.config.load import LoadConfig
 from vllm.config.lora import LoRAConfig
+from vllm.config.mamba import MambaBackendEnum, MambaConfig


Suggested change

from vllm.config.mamba import MambaBackendEnum, MambaConfig

from vllm.config.mamba import MambaConfig

hmellor · 2026-03-16T09:09:31Z

@@ -82,6 +83,9 @@
    "LoadConfig",
    # From vllm.config.lora
    "LoRAConfig",
+    # From vllm.config.mamba
+    "MambaBackendEnum",


Suggested change

"MambaBackendEnum",

mergify · 2026-03-30T10:25:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @roikoren755.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tomeras91

Overall looks good!

I do have some comments though:

I agree with @amirkl94 we should have the triton backend as default so we don't need to deal with None
I'm not sure the current dispatch mechanism to choose between triton/FlashInfer is the most fitting one. I understand the pros and cons, but worth thinking if we should use a pattern that is already used elsewhere for backend selection instead of introducing something new. Maybe the recently added "Custom Ops IR" is more fitting?
[most important] You mention in the description that the current FlashInfer impl lacks support for some features (specifically SpecDec). Yet, these are not reflected in the code. So if the user selects the flashinfer backend and runs with SpecDec, they will get ab opaque error from flashinfer (or worse - silent failures resulting in bad outputs). Maybe we should add some defensive runtime checks? Something like this:

  if dst_state_batch_indices is not None and \
     dst_state_batch_indices is not state_batch_indices:
      raise NotImplementedError(
          "FlashInfer SSU backend does not yet support separate "
          "dst_state_batch_indices. Use --mamba-backend triton.")

  if num_accepted_tokens is not None:
      raise NotImplementedError(
          "FlashInfer SSU backend does not yet support "
          "num_accepted_tokens (cache rewind). Use --mamba-backend triton.")

roikoren755 · 2026-04-13T10:03:53Z

Overall looks good!

I do have some comments though:

I agree with @amirkl94 we should have the triton backend as default so we don't need to deal with None

I'm not sure the current dispatch mechanism to choose between triton/FlashInfer is the most fitting one. I understand the pros and cons, but worth thinking if we should use a pattern that is already used elsewhere for backend selection instead of introducing something new. Maybe the recently added "Custom Ops IR" is more fitting?

[most important] You mention in the description that the current FlashInfer impl lacks support for some features (specifically SpecDec). Yet, these are not reflected in the code. So if the user selects the flashinfer backend and runs with SpecDec, they will get ab opaque error from flashinfer (or worse - silent failures resulting in bad outputs). Maybe we should add some defensive runtime checks? Something like this:
  if dst_state_batch_indices is not None and \
     dst_state_batch_indices is not state_batch_indices:
      raise NotImplementedError(
          "FlashInfer SSU backend does not yet support separate "
          "dst_state_batch_indices. Use --mamba-backend triton.")

  if num_accepted_tokens is not None:
      raise NotImplementedError(
          "FlashInfer SSU backend does not yet support "
          "num_accepted_tokens (cache rewind). Use --mamba-backend triton.")

Dealing with None was actually simpler, but I'll fix that now.
I may be mistaken, but it looks like this doesn't quite fit, as there isn't (currently) a PyTorch native implementation, and the triton and flashinfer kernels support the same features, so it boils down to an implementation priority list, which I don't think is what we want to aim for...
That's a mistake, and I've deleted that part of the description. SpecDec (and all other features) is fully supported, and is covered in tests after this PR, so there's no need for the NotImplementedErrors.

Signed-off-by: Roi Koren <roik@nvidia.com>

tdoublep

LGTM

We might want to revisit whether the dispatch could be better handled by vLLM IR at a later stage.

roikoren755 · 2026-04-15T08:30:19Z

LGTM

We might want to revisit whether the dispatch could be better handled by vLLM IR at a later stage.

That was @tomeras91 's suggestion as well, but from what I saw there are a couple of issues there:

We need a native PyTorch base implementation ([CPU] Enable Granite 4 / Mamba models on CPU backend #39157 will add it)
The main differentiator in IR seems to be which args the different implementations support. Since the flashinfer and triton kernels have the same support, I'm not sure this is quite the right fit...

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com>

tomeras91 · 2026-04-15T09:32:17Z

The main differentiator in IR seems to be which args the different implementations support. Since the flashinfer and triton kernels have the same support, I'm not sure this is quite the right fit...

I think in #39262 IR is used also as a priority mechanism. Not exactly what we want here (control from cli arg), but also not only for different arg support for different implementations.

Signed-off-by: Roi Koren <roik@nvidia.com>

Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

Signed-off-by: Roi Koren <roik@nvidia.com>

roikoren755 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, tdoublep, tlrmchlsmth, yewentao256 and youkaichao as code owners March 5, 2026 17:45

mergify Bot added the v1 label Mar 5, 2026

gemini-code-assist Bot reviewed Mar 5, 2026

View reviewed changes

roikoren755 force-pushed the feat/flashinfer-selective-state-update branch from c8b1288 to c2e89c3 Compare March 5, 2026 17:49

mergify Bot added the ci/build label Mar 5, 2026

roikoren755 force-pushed the feat/flashinfer-selective-state-update branch from 10c42a1 to 2761ad6 Compare March 15, 2026 13:53

hmellor reviewed Mar 16, 2026

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 28, 2026

mergify Bot added the needs-rebase label Mar 30, 2026

roikoren755 force-pushed the feat/flashinfer-selective-state-update branch from 2761ad6 to 5bdac84 Compare April 5, 2026 18:45

roikoren755 requested review from heheda12345 and tomeras91 as code owners April 5, 2026 18:45

mergify Bot removed the needs-rebase label Apr 5, 2026

roikoren755 requested a review from vadiklyutiy as a code owner April 5, 2026 19:02