fix: Exclude SM12x (desktop Blackwell) from CUDA_PTX_FP4FP6_CVT_ENABLED#3120
Closed
RobTand wants to merge 1 commit intoNVIDIA:mainfrom
Closed
fix: Exclude SM12x (desktop Blackwell) from CUDA_PTX_FP4FP6_CVT_ENABLED#3120RobTand wants to merge 1 commit intoNVIDIA:mainfrom
RobTand wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
SM12x GPUs (RTX 5090/5080/PRO 6000, DGX Spark GB10) have mma.e2m1 tensor cores but lack the cvt.rn.satfinite.e2m1x2.f32 PTX instruction for native FP4/FP6 conversion — this instruction is SM100-family only. When SM120A/F or SM121A/F is included in the CUDA_PTX_FP4FP6_CVT_ENABLED guard, CUTLASS emits the missing PTX instruction, which produces NaN during NVFP4 inference. This change removes all SM12x variants from the guard, causing SM12x to fall through to the existing software E2M1 conversion path. Tested on DGX Spark (SM121) running Nemotron-3-Super-120B and Qwen3.5-122B NVFP4 models via vLLM + FlashInfer. Without this fix, all NVFP4 inference on SM12x produces NaN output. Signed-off-by: Rob Tand <robert.tand@icloud.com>
|
cc @depaulmillz |
Author
|
Withdrawing this PR. After further investigation prompted by @depaulmillz's comment on vllm-project/vllm#35947, we confirmed that The root cause is that vLLM's cmake strips the The fix belongs in vLLM's build system, not in CUTLASS. Thank you for the guidance. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SM12x GPUs (RTX 5090/5080/PRO 6000 = SM120, DGX Spark GB10 = SM121) have
mma.e2m1tensor cores but lack thecvt.rn.satfinite.e2m1x2.f32PTX instruction for native FP4/FP6 conversion. This instruction is SM100-family only.When
SM120A/ForSM121A/Fis included in theCUDA_PTX_FP4FP6_CVT_ENABLEDguard infloat_subbyte.h, CUTLASS emits the missing PTX instruction, which produces NaN during all NVFP4 inference on SM12x hardware.Fix
Remove all SM12x variants (
SM120A,SM120F,SM121A,SM121F) from theCUDA_PTX_FP4FP6_CVT_ENABLEDpreprocessor guard. SM12x falls through to the existing software E2M1 conversion path, which works correctly.Testing
Tested on DGX Spark (SM121, 128 GB unified LPDDR5X) running:
Without this fix, both models produce NaN output on SM12x.
Related work
This CUTLASS fix is the root cause. Downstream projects have independent software E2M1 workarounds that complement it:
nvfp4_utils.cuh(same root cause, different code path)quantization_utils.cuh(no PR yet)cc @blake-snc — your vLLM PR #35947 addresses the same hardware limitation in vLLM's copy of the quantization code. Happy to withdraw this if you'd prefer to upstream the CUTLASS-side fix yourself, otherwise I'll work this through the review process.
Impact
This is a correctness blocker for all NVFP4 inference on desktop Blackwell GPUs (RTX 50-series and DGX Spark). Without this fix, no NVFP4 model produces valid output on SM12x.