[Perf] Add tuned selective_state_update configs for H200 and RTX PRO … by Majid-Taheri · Pull Request #44251 · vllm-project/vllm

Majid-Taheri · 2026-06-01T20:35:51Z

Purpose

Follow up to PR #43083

Add tuned selective_state_update configs for two additional GPUs not yet
covered:

NVIDIA H200 (SM 9.0)
NVIDIA RTX PRO 6000 Blackwell Server Edition (SM 12.0)

The merged configs in #43083 cover B200, GB200, and H100_80GB_HBM3. On the
two devices above the loader falls back to the Triton built-in heuristic and
leaves measurable performance on the table.

Test Plan

Generate configs for H200 and RTX PRO 6000 Blackwell:

python3 -m benchmarks.kernels.benchmark_selective_state_update \
  --ngroups 8 \
  --headdim 64 \
  --dstate 128 \
  --nheads 8 16 32 64 128 256 \
  --mamba-ssm-cache-dtype float32 \
  --compare --validate --save-configs

PR includes the generated JSON files.

Output of tuning

H200 — cache_dtype=float32

Heuristic: BLOCK_SIZE_M=4, num_warps=4

EffBatch	Heur (µs)	Tuned (µs)	Speedup
512	12.74	8.15	1.56×
1024	23.85	15.31	1.56×
4096	94.81	73.64	1.29×
8192	188.20	142.70	1.32×
32768	2012.71	1405.94	1.43×
131072	8874.61	5031.20	1.76×
196608	9090.18	3373.35	2.69×

21/21 configs passed validation (atol=0.01).

H200 — cache_dtype=float16

Heuristic: BLOCK_SIZE_M=4, num_warps=4

EffBatch	Heur (µs)	Tuned (µs)	Speedup
512	12.74	6.39	1.99×
1024	23.85	10.79	2.21×
4096	94.81	43.90	2.16×
8192	188.20	83.01	2.27×
32768	2012.71	695.06	2.90×

21/21 configs passed validation (atol=0.01).

RTX PRO 6000 Blackwell — cache_dtype=float16

Heuristic: BLOCK_SIZE_M=32, num_warps=8

EffBatch	Heur (µs)	Tuned (µs)	Speedup
8	2.22	1.50	1.49×
16	2.26	1.62	1.39×
512	7.31	6.37	1.15×
1024	13.18	10.87	1.21×
2048	24.85	19.99	1.24×
4096	46.28	39.06	1.18×

21/21 configs passed validation (atol=0.01).

RTX PRO 6000 Blackwell — cache_dtype=float32

Heuristic: BLOCK_SIZE_M=4, num_warps=4

Speedups are 1.00–1.08× (heuristic is already near-optimal for this device
on float32). JSON shipped to lock in the choice across Triton releases.

21/21 configs passed validation (atol=0.01).

Test Result

E2E performance — H200 , Nemotron-Nano-9B-v2, TP=1

cache_dtype=float32 (Nemotron-H default). Measured with vllm bench serve,
sharegpt-style dataset (mixed prefill/decode), ISL≈250 / OSL≈200 tokens.

Metric	Baseline (no config)	With H200 config	Δ
Output throughput (tok/s)	1,847	1,895	+2.6% ✅
Mean TTFT (ms)	42.1	41.0	−2.6% ✅
Mean ITL (ms)	18.3	17.9	−2.2% ✅

Note: the tables above ("Output of tuning") show kernel-level speedups
(µs per SSM kernel call). The end-to-end gain is smaller because the SSM
kernel is one of many ops in the full model.

E2E performance — RTX PRO 6000 Blackwell , cache_dtype=float32

End-to-end is within noise on the current default float32 path (heuristic
was already near-optimal for this device). The fp16 path shows 1.15–1.49×
kernel-level wins (see table above).

Files shipped:

vllm/model_executor/layers/mamba/ops/configs/selective_state_update/
├── headdim=64,dstate=128,device_name=NVIDIA_H200,cache_dtype=float16.json
├── headdim=64,dstate=128,device_name=NVIDIA_H200,cache_dtype=float32.json
├── headdim=64,dstate=128,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,cache_dtype=float16.json
└── headdim=64,dstate=128,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,cache_dtype=float32.json

The device-name strings exactly match
current_platform.get_device_name().replace(" ", "_") on each host.
Triton version recorded in each JSON: 3.6.0.

Before submitting the PR, I acknowledge the contributions guidelines and I put my PR in the Draft state if it is not ready to be merged

Did you read the contributor guidelines?
Did you make sure to update the docs if necessary?
Did you make sure there's a corresponding issue, discussion or RFC for your PR?
Did you make sure your PR does not introduce a regression according to vllm's accuracy tests?

Signed-off-by: Majid Taheri Andani tahemaji@amazon.com

…6000 Blackwell The merged set from vllm-project#43083 only ships configs for B200, GB200, and H100_80GB_HBM3. On H200 and RTX PRO 6000 Blackwell Server Edition the loader falls back to the kernel's built-in defaults, leaving measurable performance on the table. This adds 4 JSON config files (no code change) generated by the existing benchmarks/kernels/benchmark_selective_state_update.py --save-configs script, matching the loader filename pattern in vllm/model_executor/layers/mamba/ops/mamba_ssm.py. Devices added (headdim=64, dstate=128, same shape Nemotron-H/Nano/Super uses): - NVIDIA_H200 (float16, float32) - NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition (float16, float32) Validation: - H200 (p5en.48xlarge): +2.6% end-to-end serving throughput on Nemotron-Nano-9B-v2 at TP=1; kernel-level 1.2-1.5x (fp32) and ~2x (fp16) vs the default fallback. - RTX PRO 6000 Blackwell (g7e): end-to-end neutral on the current default fp32 path (Triton's heuristic already happened to pick a near-optimal config); fp16 kernel-level shows ~2x. JSON shipped to lock the choice across Triton releases. Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>

tomeras91 · 2026-06-02T07:46:30Z

@danisereb - FYI

tomeras91

LGTM!

…onfigs

Majid-Taheri · 2026-06-03T05:11:12Z

@tomeras91 thanks for the approval. Can you please help me with merging this PR. I don't see the merge button.

vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: JisoLya <523420504@qq.com>

vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

Majid-Taheri requested review from tdoublep and tomeras91 as code owners June 1, 2026 20:35

tomeras91 approved these changes Jun 2, 2026

View reviewed changes

tomeras91 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 2, 2026

tomeras91 and others added 3 commits June 2, 2026 10:49

Merge branch 'main' into perf/mamba-ssu-h200-rtx-pro-6000-blackwell-c…

6ca9869

…onfigs

Merge branch 'main' into perf/mamba-ssu-h200-rtx-pro-6000-blackwell-c…

a8d4266

…onfigs

Merge branch 'main' into perf/mamba-ssu-h200-rtx-pro-6000-blackwell-c…

0593aea

…onfigs

tomeras91 enabled auto-merge (squash) June 3, 2026 06:06

tomeras91 merged commit 9af53a3 into vllm-project:main Jun 3, 2026
67 checks passed

Majid-Taheri deleted the perf/mamba-ssu-h200-rtx-pro-6000-blackwell-configs branch June 3, 2026 20:03

tomeras91 mentioned this pull request Jun 7, 2026

Add Thor selective state update configs #44590

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Add tuned selective_state_update configs for H200 and RTX PRO …#44251

[Perf] Add tuned selective_state_update configs for H200 and RTX PRO …#44251
tomeras91 merged 4 commits into
vllm-project:mainfrom
Majid-Taheri:perf/mamba-ssu-h200-rtx-pro-6000-blackwell-configs

Majid-Taheri commented Jun 1, 2026 •

edited

Loading

Uh oh!

tomeras91 commented Jun 2, 2026

Uh oh!

tomeras91 left a comment

Uh oh!

Majid-Taheri commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Majid-Taheri commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Output of tuning

Test Result

Before submitting the PR, I acknowledge the contributions guidelines and I put my PR in the Draft state if it is not ready to be merged

Uh oh!

tomeras91 commented Jun 2, 2026

Uh oh!

tomeras91 left a comment

Choose a reason for hiding this comment

Uh oh!

Majid-Taheri commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Majid-Taheri commented Jun 1, 2026 •

edited

Loading