Skip to content

[Perf] Add tuned selective_state_update configs for H200 and RTX PRO …#44251

Merged
tomeras91 merged 4 commits into
vllm-project:mainfrom
Majid-Taheri:perf/mamba-ssu-h200-rtx-pro-6000-blackwell-configs
Jun 3, 2026
Merged

[Perf] Add tuned selective_state_update configs for H200 and RTX PRO …#44251
tomeras91 merged 4 commits into
vllm-project:mainfrom
Majid-Taheri:perf/mamba-ssu-h200-rtx-pro-6000-blackwell-configs

Conversation

@Majid-Taheri

@Majid-Taheri Majid-Taheri commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Purpose

Follow up to PR #43083

Add tuned selective_state_update configs for two additional GPUs not yet
covered:

  • NVIDIA H200 (SM 9.0)
  • NVIDIA RTX PRO 6000 Blackwell Server Edition (SM 12.0)

The merged configs in #43083 cover B200, GB200, and H100_80GB_HBM3. On the
two devices above the loader falls back to the Triton built-in heuristic and
leaves measurable performance on the table.

Test Plan

Generate configs for H200 and RTX PRO 6000 Blackwell:

python3 -m benchmarks.kernels.benchmark_selective_state_update \
  --ngroups 8 \
  --headdim 64 \
  --dstate 128 \
  --nheads 8 16 32 64 128 256 \
  --mamba-ssm-cache-dtype float32 \
  --compare --validate --save-configs

PR includes the generated JSON files.

Output of tuning

H200 — cache_dtype=float32

Heuristic: BLOCK_SIZE_M=4, num_warps=4

EffBatch Heur (µs) Tuned (µs) Speedup
512 12.74 8.15 1.56×
1024 23.85 15.31 1.56×
4096 94.81 73.64 1.29×
8192 188.20 142.70 1.32×
32768 2012.71 1405.94 1.43×
131072 8874.61 5031.20 1.76×
196608 9090.18 3373.35 2.69×

21/21 configs passed validation (atol=0.01).

H200 — cache_dtype=float16

Heuristic: BLOCK_SIZE_M=4, num_warps=4

EffBatch Heur (µs) Tuned (µs) Speedup
512 12.74 6.39 1.99×
1024 23.85 10.79 2.21×
4096 94.81 43.90 2.16×
8192 188.20 83.01 2.27×
32768 2012.71 695.06 2.90×

21/21 configs passed validation (atol=0.01).

RTX PRO 6000 Blackwell — cache_dtype=float16

Heuristic: BLOCK_SIZE_M=32, num_warps=8

EffBatch Heur (µs) Tuned (µs) Speedup
8 2.22 1.50 1.49×
16 2.26 1.62 1.39×
512 7.31 6.37 1.15×
1024 13.18 10.87 1.21×
2048 24.85 19.99 1.24×
4096 46.28 39.06 1.18×

21/21 configs passed validation (atol=0.01).

RTX PRO 6000 Blackwell — cache_dtype=float32

Heuristic: BLOCK_SIZE_M=4, num_warps=4

Speedups are 1.00–1.08× (heuristic is already near-optimal for this device
on float32). JSON shipped to lock in the choice across Triton releases.

21/21 configs passed validation (atol=0.01).

Test Result

E2E performance — H200 , Nemotron-Nano-9B-v2, TP=1

cache_dtype=float32 (Nemotron-H default). Measured with vllm bench serve,
sharegpt-style dataset (mixed prefill/decode), ISL≈250 / OSL≈200 tokens.

Metric Baseline (no config) With H200 config Δ
Output throughput (tok/s) 1,847 1,895 +2.6%
Mean TTFT (ms) 42.1 41.0 −2.6%
Mean ITL (ms) 18.3 17.9 −2.2%

Note: the tables above ("Output of tuning") show kernel-level speedups
(µs per SSM kernel call). The end-to-end gain is smaller because the SSM
kernel is one of many ops in the full model.

E2E performance — RTX PRO 6000 Blackwell , cache_dtype=float32

End-to-end is within noise on the current default float32 path (heuristic
was already near-optimal for this device). The fp16 path shows 1.15–1.49×
kernel-level wins (see table above).

Files shipped:

vllm/model_executor/layers/mamba/ops/configs/selective_state_update/
├── headdim=64,dstate=128,device_name=NVIDIA_H200,cache_dtype=float16.json
├── headdim=64,dstate=128,device_name=NVIDIA_H200,cache_dtype=float32.json
├── headdim=64,dstate=128,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,cache_dtype=float16.json
└── headdim=64,dstate=128,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,cache_dtype=float32.json

The device-name strings exactly match
current_platform.get_device_name().replace(" ", "_") on each host.
Triton version recorded in each JSON: 3.6.0.

Before submitting the PR, I acknowledge the contributions guidelines and I put my PR in the Draft state if it is not ready to be merged

  • Did you read the contributor guidelines?
  • Did you make sure to update the docs if necessary?
  • Did you make sure there's a corresponding issue, discussion or RFC for your PR?
  • Did you make sure your PR does not introduce a regression according to vllm's accuracy tests?

Signed-off-by: Majid Taheri Andani tahemaji@amazon.com

…6000 Blackwell

The merged set from vllm-project#43083 only ships configs for B200, GB200, and
H100_80GB_HBM3. On H200 and RTX PRO 6000 Blackwell Server Edition the
loader falls back to the kernel's built-in defaults, leaving measurable
performance on the table.

This adds 4 JSON config files (no code change) generated by the existing
benchmarks/kernels/benchmark_selective_state_update.py --save-configs
script, matching the loader filename pattern in
vllm/model_executor/layers/mamba/ops/mamba_ssm.py.

Devices added (headdim=64, dstate=128, same shape Nemotron-H/Nano/Super uses):
- NVIDIA_H200 (float16, float32)
- NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition (float16, float32)

Validation:
- H200 (p5en.48xlarge): +2.6% end-to-end serving throughput on
  Nemotron-Nano-9B-v2 at TP=1; kernel-level 1.2-1.5x (fp32) and ~2x (fp16)
  vs the default fallback.
- RTX PRO 6000 Blackwell (g7e): end-to-end neutral on the current default
  fp32 path (Triton's heuristic already happened to pick a near-optimal
  config); fp16 kernel-level shows ~2x. JSON shipped to lock the choice
  across Triton releases.

Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
@tomeras91

Copy link
Copy Markdown
Member

@danisereb - FYI

@tomeras91 tomeras91 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tomeras91 tomeras91 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 2, 2026
@Majid-Taheri

Copy link
Copy Markdown
Contributor Author

@tomeras91 thanks for the approval. Can you please help me with merging this PR. I don't see the merge button.

@tomeras91 tomeras91 enabled auto-merge (squash) June 3, 2026 06:06
@tomeras91 tomeras91 merged commit 9af53a3 into vllm-project:main Jun 3, 2026
67 checks passed
@Majid-Taheri Majid-Taheri deleted the perf/mamba-ssu-h200-rtx-pro-6000-blackwell-configs branch June 3, 2026 20:03
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
vllm-project#44251)

Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026
vllm-project#44251)

Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
JisoLya pushed a commit to JisoLya/vllm that referenced this pull request Jun 5, 2026
vllm-project#44251)

Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: JisoLya <523420504@qq.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
vllm-project#44251)

Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
vllm-project#44251)

Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants