fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels#33417
Conversation
|
Documentation preview: https://vllm--33417.org.readthedocs.build/en/33417/ |
|
This pull request has merge conflicts that must be resolved before it can be |
7a32add to
9b71b86
Compare
@mgoin hopefully ok now. |
|
YES! Works on 5090rtx Confirmed running |
|
Hello all, I've just tested this on both the RTX 5090 and the RTX 6000 Pro Blackwell, but I am still facing an issue when running The error I’m getting is:
Steps to reproduce: |
Extend device capability checks to include SM110 and SM120 GPU families, matching the approach used in flashinfer_cutlass_moe.py and cutlass_moe.py after PR vllm-project#33417. These files were not updated in vllm-project#33417 and still only checked for SM100: - flashinfer_fp4_moe.py - flashinfer_trtllm_moe.py - flashinfer_cutedsl_moe.py - flashinfer_utils.py The fix adds explicit family checks for SM100/110/120, enabling support for: - SM100-109: Blackwell data center (B100, B200) - SM110-119: Future Blackwell variants - SM120-129: Blackwell consumer/workstation (RTX 5090, DGX Spark GB10) Tested on RTX 5090 (SM120) and DGX Spark GB10 (SM121) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
Extend device capability checks to include SM110 and SM120 GPU families, matching the approach used in flashinfer_cutlass_moe.py and cutlass_moe.py after PR vllm-project#33417. These files were not updated in vllm-project#33417 and still only checked for SM100: - flashinfer_fp4_moe.py - flashinfer_trtllm_moe.py - flashinfer_cutedsl_moe.py - flashinfer_utils.py The fix adds explicit family checks for SM100/110/120 using any() for cleaner, more maintainable code, enabling support for: - SM100-109: Blackwell data center (B100, B200) - SM110-119: Future Blackwell variants - SM120-129: Blackwell consumer/workstation (RTX 5090, DGX Spark GB10) Tested on RTX 5090 (SM120) and DGX Spark GB10 (SM121) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
…oE kernels (vllm-project#33417) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Pai <416932041@qq.com>
|
gptoss20b getting OOM on 15.1 |
|
hi @renehonig @shahizat @mgoin does your pr also fix the following when running gpt-oss-20b? and what about the following pr? #31089 |
|
If it helps anyone, i run this,with a added hack of setting matmul 32
precision to high in torch.. getting 12000-15000 tokens per second read,
250-280 generation on nemotron 30 nan 3A... max 131000 context at around
88% vram allocated on 5090rtx.. swap space is just in case, but it does a
fairly good job as long as i don'y overdo it with the same
repetitive inputs.
this is with the git package release of 0.16, which includes the nvfp4
stuff.. so, this works... but also, if anyone knows about the scaling
factor for the kv cache .. three seems to be not much i cam acrss as to
what that even does.. could be something or not if anyone as advice there..
the works here but if i can improve i'm all ears..
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
FLASHINFER_DISABLE_VERSION_CHECK=1 \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND=throughput \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--served-model-name nemotron \
--max-num-seqs 6 \
--tensor-parallel-size 1 \
--max-model-len 130000 \
--port 3337 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nano_v3 \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--mamba-ssm-cache-dtype float16 \
--kv-cache-dtype fp8 \
--quantization modelopt_fp4 \
--gpu-memory-utilization 0.875 \
--max-num-batched-tokens 8192 \
--swap-space 4
…On Sun, Mar 8, 2026 at 2:44 PM geraldstanje ***@***.***> wrote:
*geraldstanje* left a comment (vllm-project/vllm#33417)
<#33417 (comment)>
hi @renehonig <https://github.com/renehonig> does your pr also fix #31089
<#31089> ?
—
Reply to this email directly, view it on GitHub
<#33417 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAF53A4WBQAV4ZOEPNPY6YT4PXSTBAVCNFSM6AAAAACTN4OPJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DAMJZHE4DANJVHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Summary
This PR adds SM120 (RTX Blackwell) device capability family support to the NVFP4 MoE kernel backend selection code. The NVFP4 quantization kernels check for specific GPU architecture families, but currently only recognize SM9.0 (Hopper) and SM10.x (B100/B200 data center Blackwell), missing SM12.0 (RTX Blackwell workstation GPUs).
Problem
On RTX Blackwell GPUs (e.g., RTX PRO 6000 Blackwell Workstation Edition with compute capability 12.0), vLLM v0.15.0 crashes when loading MiniMax-M2.1-NVFP4 or other NVFP4 MoE models with:
Root Cause
The
is_device_capability_family(100)check returnsFalsefor SM12.0 devices because:This is a regression introduced in commit 42135d6 ([MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority #32414).
Solution
Add
or current_platform.is_device_capability_family(120)checks alongside existing SM100 family checks in all NVFP4 MoE kernel selection code.Files Changed
vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.pyvllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.pyvllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.pyvllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.pyTesting
Tested on RTX PRO 6000 Blackwell Workstation Edition with MiniMax-M2.1-NVFP4 model - inference working successfully after fix.
Related Issue
Fixes #33416
For Maintainers
This is a regression bugfix affecting NVFP4 MoE models on RTX Blackwell GPUs (SM12.0).
Please consider cherry-picking this to
releases/v0.15.0for inclusion in v0.15.1.