fix: default FP4 GEMM backend to flashinfer_cudnn on SM120 (Blackwell)#20047
Conversation
The flashinfer_cutlass FP4 GEMM backend produces NaN values in dense MLP layers when processing heterogeneous batches on SM120 (Blackwell) GPUs. This causes torch.multinomial crashes under concurrent request load. The flashinfer_cudnn backend does not exhibit this issue — stress-tested with 3,200 concurrent requests (50 rounds x 64 concurrent) with zero NaN. Changes: - Change fp4_gemm_runner_backend default from "flashinfer_cutlass" to "auto" - Add SM120 detection in auto-resolve: select flashinfer_cudnn on Blackwell, flashinfer_cutlass on other architectures (preserving existing behavior) Fixes: sgl-project#20043 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial fix for numerical stability issues encountered on NVIDIA SM120 (Blackwell) GPUs when using FP4 GEMM operations. By changing the default backend selection to an 'auto' mode, the system now dynamically chooses the more stable Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@b8zhong please check |
There was a problem hiding this comment.
Code Review
This pull request changes the default FP4 GEMM backend to auto to address a NaN issue on SM120 (Blackwell) GPUs by selecting flashinfer_cudnn. For other architectures, auto now defaults to flashinfer_cutlass. While this fixes the issue for Blackwell, I've raised a concern about a potential regression for non-Blackwell architectures. The previous behavior of auto was described as selecting between flashinfer_cudnn and flashinfer_cutlass based on CUDA/cuDNN versions, but the new implementation hardcodes flashinfer_cutlass for non-Blackwell systems. This could impact users who were benefiting from the dynamic selection.
| else: | ||
| backend = "flashinfer_cutlass" |
There was a problem hiding this comment.
This change hardcodes the auto backend to flashinfer_cutlass for non-Blackwell architectures. However, the previous help text for this option stated: auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version. This suggests there might have been more complex logic for auto-selection that is now being removed, which could be a regression for users on non-Blackwell hardware who were relying on auto to potentially select flashinfer_cudnn.
While the PR description mentions that the behavior is unchanged for non-Blackwell, the discrepancy with the old help text is concerning. If the old help text was inaccurate and auto always resolved to flashinfer_cutlass, then this change is fine. Otherwise, the previous auto-selection logic should be preserved here for non-SM120 architectures.
There was a problem hiding this comment.
Actually, before this was the case (sm100/103 and sm120 will both pick flashinfer cutlass, due to to a memory leak). So it's alright I think
b8zhong
left a comment
There was a problem hiding this comment.
@Fridge003 , since #18350, do you still think this PR is hacky to set it to auto? Because, it looks like the SM120 CUTLASS-based implementation has a bug. Thus now SM120 and SM100 will resolve to default backend
python/sglang/srt/server_args.py
Outdated
| help="Choose the runner backend for NVFP4 GEMM operations. " | ||
| "Options: 'flashinfer_cutlass' (default), " | ||
| "'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), " | ||
| "Options: 'auto' (default; selects flashinfer_cudnn on SM120/Blackwell, flashinfer_cutlass otherwise), " |
There was a problem hiding this comment.
QQ: @Fridge003 do you still think this is hacky? Because it looks like the SM120 impl has a bug, and now the two devices will resolve to different backend. Otherwise, we can just hardcode cuDNN for SM120 temporarily...
The CUTLASS FP4 GEMM kernel on SM120 (Blackwell) intermittently skips writing certain output tiles, leaving uninitialized memory (NaN) in contiguous 128-aligned blocks. Pre-zeroing the output buffer ensures these unwritten tiles contain 0 instead of NaN. Verified: 1,280 concurrent requests with zero NaN (vs hundreds without the fix). The pre-zeroing is applied to all backends as a safety measure. Upstream bug report: flashinfer-ai/flashinfer#2708 Fixes: sgl-project#20043 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…itten tiles" This reverts commit 1b5fff4.
Root cause found — FlashInfer missing GDC compile flagsThe NaN crash is caused by missing Fix PR: flashinfer-ai/flashinfer#2716 This PR (defaulting to cudnn on SM120) remains useful as a workaround until the FlashInfer fix is released. This also probably needs cutlass >= 4.3.0 (not sure about this but this release contains also PDL fixes) |
…ags cause PDL synchronization barriers to compile as no-ops (#2716) ## Summary All CUTLASS GEMM templates use `enablePDL=true` (Programmatic Dependent Launch), but the JIT compilation is missing `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` compile flags. Without these flags, `wait_on_dependent_grids()` and `launch_dependent_grids()` in CUTLASS `grid_dependency_control.h` compile as **empty no-ops**, eliminating the synchronization barriers needed for safe PDL execution. ## Root Cause In `cutlass/include/cutlass/arch/grid_dependency_control.h`: ```cpp CUTLASS_DEVICE void wait_on_dependent_grids() { #if (defined(CUTLASS_GDC_ENABLED)) // only defined when CUTLASS_ENABLE_GDC_FOR_SM100 is set asm volatile("griddepcontrol.wait;"); #endif } ``` The `CUTLASS_GDC_ENABLED` macro is only defined when `CUTLASS_ENABLE_GDC_FOR_SM100` is passed as a compile flag. Without it, PDL launches kernels with overlap enabled at the host level (`cudaLaunchAttributeProgrammaticStreamSerialization`), but the device-side synchronization barriers are compiled out — creating a race condition. ## Symptoms On SM120 (Blackwell RTX PRO 6000 / RTX 5090) with high concurrency (64+ simultaneous requests in SGLang with TP=8): - CUTLASS FP4 GEMM intermittently fails to write output tiles - Unwritten tiles contain uninitialized memory (NaN/garbage) - NaN blocks are always contiguous and 128-aligned, matching CTA tile boundaries - `CUDA_LAUNCH_BLOCKING=1` eliminates the bug (confirms race condition) - cudnn backend is unaffected (does not use CUTLASS PDL) - Retry with identical inputs produces correct output ## Fix Add `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to all affected GEMM JIT modules: - `fp4_gemm_cutlass` (SM100) - `fp4_gemm_cutlass_sm103` (SM103) - `fp4_gemm_cutlass_sm120` (SM120) - `fp8_gemm_cutlass` (SM100) - `mxfp8_gemm_cutlass` (SM100) - `gemm_sm120` (SM120 FP8 groupwise) The `tgv_gemm` module already had `DCUTLASS_ENABLE_GDC_FOR_SM100`. Note: `DCUTLASS_ENABLE_GDC_FOR_SM90` is needed because the SM120 CUTLASS kernel (`sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp`) guards `launch_dependent_grids()` with `#ifdef CUTLASS_ENABLE_GDC_FOR_SM90` instead of `SM100` (upstream CUTLASS bug). ## Verification | Configuration | Result | |---|---| | PDL=true, no GDC flags (current) | **NaN crash** under high concurrency | | PDL=false (workaround) | OK | | PDL=true + GDC flags (this PR) | **OK** — tested with 64 concurrent requests, multiple SGLang restarts from JIT cache | | `CUDA_LAUNCH_BLOCKING=1` | OK (confirms race condition) | ## Environment - Hardware: 8x NVIDIA RTX PRO 6000 Blackwell (SM120, 96GB) - FlashInfer 0.6.4, CUTLASS 4.4.1 - SGLang with TP=8, EAGLE-v2, GLM-5-NVFP4-MTP model - PyTorch 2.12.0.dev, CUDA 12.8+ ## Related - #2708 - sgl-project/sglang#20043 - sgl-project/sglang#20047 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes **Chores** - Updated CUDA compilation configuration for SM100 and SM90 GPU architectures, enhancing build optimization and extending hardware compatibility for GPU acceleration workloads. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
sgl-project#20047) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
By the way, is there a performance difference between flashinfer_cudnn and flashinfer_cutlass? |
…ags cause PDL synchronization barriers to compile as no-ops (flashinfer-ai#2716) ## Summary All CUTLASS GEMM templates use `enablePDL=true` (Programmatic Dependent Launch), but the JIT compilation is missing `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` compile flags. Without these flags, `wait_on_dependent_grids()` and `launch_dependent_grids()` in CUTLASS `grid_dependency_control.h` compile as **empty no-ops**, eliminating the synchronization barriers needed for safe PDL execution. ## Root Cause In `cutlass/include/cutlass/arch/grid_dependency_control.h`: ```cpp CUTLASS_DEVICE void wait_on_dependent_grids() { #if (defined(CUTLASS_GDC_ENABLED)) // only defined when CUTLASS_ENABLE_GDC_FOR_SM100 is set asm volatile("griddepcontrol.wait;"); #endif } ``` The `CUTLASS_GDC_ENABLED` macro is only defined when `CUTLASS_ENABLE_GDC_FOR_SM100` is passed as a compile flag. Without it, PDL launches kernels with overlap enabled at the host level (`cudaLaunchAttributeProgrammaticStreamSerialization`), but the device-side synchronization barriers are compiled out — creating a race condition. ## Symptoms On SM120 (Blackwell RTX PRO 6000 / RTX 5090) with high concurrency (64+ simultaneous requests in SGLang with TP=8): - CUTLASS FP4 GEMM intermittently fails to write output tiles - Unwritten tiles contain uninitialized memory (NaN/garbage) - NaN blocks are always contiguous and 128-aligned, matching CTA tile boundaries - `CUDA_LAUNCH_BLOCKING=1` eliminates the bug (confirms race condition) - cudnn backend is unaffected (does not use CUTLASS PDL) - Retry with identical inputs produces correct output ## Fix Add `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to all affected GEMM JIT modules: - `fp4_gemm_cutlass` (SM100) - `fp4_gemm_cutlass_sm103` (SM103) - `fp4_gemm_cutlass_sm120` (SM120) - `fp8_gemm_cutlass` (SM100) - `mxfp8_gemm_cutlass` (SM100) - `gemm_sm120` (SM120 FP8 groupwise) The `tgv_gemm` module already had `DCUTLASS_ENABLE_GDC_FOR_SM100`. Note: `DCUTLASS_ENABLE_GDC_FOR_SM90` is needed because the SM120 CUTLASS kernel (`sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp`) guards `launch_dependent_grids()` with `#ifdef CUTLASS_ENABLE_GDC_FOR_SM90` instead of `SM100` (upstream CUTLASS bug). ## Verification | Configuration | Result | |---|---| | PDL=true, no GDC flags (current) | **NaN crash** under high concurrency | | PDL=false (workaround) | OK | | PDL=true + GDC flags (this PR) | **OK** — tested with 64 concurrent requests, multiple SGLang restarts from JIT cache | | `CUDA_LAUNCH_BLOCKING=1` | OK (confirms race condition) | ## Environment - Hardware: 8x NVIDIA RTX PRO 6000 Blackwell (SM120, 96GB) - FlashInfer 0.6.4, CUTLASS 4.4.1 - SGLang with TP=8, EAGLE-v2, GLM-5-NVFP4-MTP model - PyTorch 2.12.0.dev, CUDA 12.8+ ## Related - flashinfer-ai#2708 - sgl-project/sglang#20043 - sgl-project/sglang#20047 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes **Chores** - Updated CUDA compilation configuration for SM100 and SM90 GPU architectures, enhancing build optimization and extending hardware compatibility for GPU acceleration workloads. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ags cause PDL synchronization barriers to compile as no-ops (flashinfer-ai#2716) ## Summary All CUTLASS GEMM templates use `enablePDL=true` (Programmatic Dependent Launch), but the JIT compilation is missing `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` compile flags. Without these flags, `wait_on_dependent_grids()` and `launch_dependent_grids()` in CUTLASS `grid_dependency_control.h` compile as **empty no-ops**, eliminating the synchronization barriers needed for safe PDL execution. ## Root Cause In `cutlass/include/cutlass/arch/grid_dependency_control.h`: ```cpp CUTLASS_DEVICE void wait_on_dependent_grids() { #if (defined(CUTLASS_GDC_ENABLED)) // only defined when CUTLASS_ENABLE_GDC_FOR_SM100 is set asm volatile("griddepcontrol.wait;"); #endif } ``` The `CUTLASS_GDC_ENABLED` macro is only defined when `CUTLASS_ENABLE_GDC_FOR_SM100` is passed as a compile flag. Without it, PDL launches kernels with overlap enabled at the host level (`cudaLaunchAttributeProgrammaticStreamSerialization`), but the device-side synchronization barriers are compiled out — creating a race condition. ## Symptoms On SM120 (Blackwell RTX PRO 6000 / RTX 5090) with high concurrency (64+ simultaneous requests in SGLang with TP=8): - CUTLASS FP4 GEMM intermittently fails to write output tiles - Unwritten tiles contain uninitialized memory (NaN/garbage) - NaN blocks are always contiguous and 128-aligned, matching CTA tile boundaries - `CUDA_LAUNCH_BLOCKING=1` eliminates the bug (confirms race condition) - cudnn backend is unaffected (does not use CUTLASS PDL) - Retry with identical inputs produces correct output ## Fix Add `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` to all affected GEMM JIT modules: - `fp4_gemm_cutlass` (SM100) - `fp4_gemm_cutlass_sm103` (SM103) - `fp4_gemm_cutlass_sm120` (SM120) - `fp8_gemm_cutlass` (SM100) - `mxfp8_gemm_cutlass` (SM100) - `gemm_sm120` (SM120 FP8 groupwise) The `tgv_gemm` module already had `DCUTLASS_ENABLE_GDC_FOR_SM100`. Note: `DCUTLASS_ENABLE_GDC_FOR_SM90` is needed because the SM120 CUTLASS kernel (`sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp`) guards `launch_dependent_grids()` with `#ifdef CUTLASS_ENABLE_GDC_FOR_SM90` instead of `SM100` (upstream CUTLASS bug). ## Verification | Configuration | Result | |---|---| | PDL=true, no GDC flags (current) | **NaN crash** under high concurrency | | PDL=false (workaround) | OK | | PDL=true + GDC flags (this PR) | **OK** — tested with 64 concurrent requests, multiple SGLang restarts from JIT cache | | `CUDA_LAUNCH_BLOCKING=1` | OK (confirms race condition) | ## Environment - Hardware: 8x NVIDIA RTX PRO 6000 Blackwell (SM120, 96GB) - FlashInfer 0.6.4, CUTLASS 4.4.1 - SGLang with TP=8, EAGLE-v2, GLM-5-NVFP4-MTP model - PyTorch 2.12.0.dev, CUDA 12.8+ ## Related - flashinfer-ai#2708 - sgl-project/sglang#20043 - sgl-project/sglang#20047 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes **Chores** - Updated CUDA compilation configuration for SM100 and SM90 GPU architectures, enhancing build optimization and extending hardware compatibility for GPU acceleration workloads. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
sgl-project#20047) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Summary
--fp4-gemm-backendfromflashinfer_cutlasstoautoautonow selectsflashinfer_cudnnon SM120 (Blackwell),flashinfer_cutlasson other architectures (no behavior change for non-Blackwell)Problem
The
flashinfer_cutlassFP4 GEMM backend produces NaN values in dense MLP layers when processing heterogeneous batches on SM120 (Blackwell) GPUs. This causestorch.multinomialto crash withprobability tensor contains either inf, nan or element < 0under concurrent request load.Key findings from debugging:
Fix
flashinfer_cudnndoes not exhibit this issue. Stress-tested:This PR changes the default to
autowhich auto-selectsflashinfer_cudnnon SM120 and preservesflashinfer_cutlasson all other architectures.Test plan
flashinfer_cudnnproduces zero NaN on SM120 with 3,200 concurrent requests--fp4-gemm-backend flashinfer_cutlassstill works as explicit overrideautoresolves toflashinfer_cutlass(existing behavior unchanged)Environment tested
--quantization modelopt_fp4, TP=8mainFull investigation details: Fix #20043
🤖 Generated with Claude Code