[Perf] Add tuned selective_state_update configs for H200 and RTX PRO …#44251
Merged
tomeras91 merged 4 commits intoJun 3, 2026
Conversation
…6000 Blackwell The merged set from vllm-project#43083 only ships configs for B200, GB200, and H100_80GB_HBM3. On H200 and RTX PRO 6000 Blackwell Server Edition the loader falls back to the kernel's built-in defaults, leaving measurable performance on the table. This adds 4 JSON config files (no code change) generated by the existing benchmarks/kernels/benchmark_selective_state_update.py --save-configs script, matching the loader filename pattern in vllm/model_executor/layers/mamba/ops/mamba_ssm.py. Devices added (headdim=64, dstate=128, same shape Nemotron-H/Nano/Super uses): - NVIDIA_H200 (float16, float32) - NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition (float16, float32) Validation: - H200 (p5en.48xlarge): +2.6% end-to-end serving throughput on Nemotron-Nano-9B-v2 at TP=1; kernel-level 1.2-1.5x (fp32) and ~2x (fp16) vs the default fallback. - RTX PRO 6000 Blackwell (g7e): end-to-end neutral on the current default fp32 path (Triton's heuristic already happened to pick a near-optimal config); fp16 kernel-level shows ~2x. JSON shipped to lock the choice across Triton releases. Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Member
|
@danisereb - FYI |
Contributor
Author
|
@tomeras91 thanks for the approval. Can you please help me with merging this PR. I don't see the merge button. |
mvanhorn
pushed a commit
to mvanhorn/vllm
that referenced
this pull request
Jun 4, 2026
vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai
pushed a commit
to andakai/vllm
that referenced
this pull request
Jun 4, 2026
vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
JisoLya
pushed a commit
to JisoLya/vllm
that referenced
this pull request
Jun 5, 2026
vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: JisoLya <523420504@qq.com>
4 tasks
knight0528
pushed a commit
to knight0528/vllm
that referenced
this pull request
Jun 8, 2026
vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
waqahmed-amd-fi
pushed a commit
to waqahmed-amd-fi/vllm
that referenced
this pull request
Jun 10, 2026
vllm-project#44251) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Follow up to PR #43083
Add tuned
selective_state_updateconfigs for two additional GPUs not yetcovered:
The merged configs in #43083 cover B200, GB200, and H100_80GB_HBM3. On the
two devices above the loader falls back to the Triton built-in heuristic and
leaves measurable performance on the table.
Test Plan
Generate configs for H200 and RTX PRO 6000 Blackwell:
PR includes the generated JSON files.
Output of tuning
H200 —
cache_dtype=float32Heuristic:
BLOCK_SIZE_M=4, num_warps=421/21 configs passed validation (atol=0.01).
H200 —
cache_dtype=float16Heuristic:
BLOCK_SIZE_M=4, num_warps=421/21 configs passed validation (atol=0.01).
RTX PRO 6000 Blackwell —
cache_dtype=float16Heuristic:
BLOCK_SIZE_M=32, num_warps=821/21 configs passed validation (atol=0.01).
RTX PRO 6000 Blackwell —
cache_dtype=float32Heuristic:
BLOCK_SIZE_M=4, num_warps=4Speedups are 1.00–1.08× (heuristic is already near-optimal for this device
on float32). JSON shipped to lock in the choice across Triton releases.
21/21 configs passed validation (atol=0.01).
Test Result
E2E performance — H200 , Nemotron-Nano-9B-v2, TP=1
Note: the tables above ("Output of tuning") show kernel-level speedups
(µs per SSM kernel call). The end-to-end gain is smaller because the SSM
kernel is one of many ops in the full model.
E2E performance — RTX PRO 6000 Blackwell ,
cache_dtype=float32End-to-end is within noise on the current default float32 path (heuristic
was already near-optimal for this device). The fp16 path shows 1.15–1.49×
kernel-level wins (see table above).
Files shipped:
The device-name strings exactly match
current_platform.get_device_name().replace(" ", "_")on each host.Triton version recorded in each JSON:
3.6.0.Before submitting the PR, I acknowledge the contributions guidelines and I put my PR in the Draft state if it is not ready to be merged
Signed-off-by: Majid Taheri Andani tahemaji@amazon.com