Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions python/sglang/srt/layers/quantization/fp4_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from typing import TYPE_CHECKING

from sglang.srt.environ import envs
from sglang.srt.utils.common import is_sm120_supported

if TYPE_CHECKING:
from sglang.srt.server_args import ServerArgs
Expand Down Expand Up @@ -75,6 +76,19 @@ def initialize_fp4_gemm_config(server_args: ServerArgs) -> None:
"Using server argument value."
)

if backend == "auto":
if is_sm120_supported():
# flashinfer_cutlass produces NaN in dense MLP layers with
# heterogeneous batches on SM120 (Blackwell). cudnn is stable.
# See: https://github.com/sgl-project/sglang/issues/20043
backend = "flashinfer_cudnn"
logger.info(
"SM120 (Blackwell) detected: auto-selecting "
"fp4-gemm-backend=flashinfer_cudnn"
)
else:
backend = "flashinfer_cutlass"
Comment on lines +89 to +90
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change hardcodes the auto backend to flashinfer_cutlass for non-Blackwell architectures. However, the previous help text for this option stated: auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version. This suggests there might have been more complex logic for auto-selection that is now being removed, which could be a regression for users on non-Blackwell hardware who were relying on auto to potentially select flashinfer_cudnn.

While the PR description mentions that the behavior is unchanged for non-Blackwell, the discrepancy with the old help text is concerning. If the old help text was inaccurate and auto always resolved to flashinfer_cutlass, then this change is fine. Otherwise, the previous auto-selection logic should be preserved here for non-SM120 architectures.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, before this was the case (sm100/103 and sm120 will both pick flashinfer cutlass, due to to a memory leak). So it's alright I think


FP4_GEMM_RUNNER_BACKEND = Fp4GemmRunnerBackend(backend)


Expand Down
6 changes: 3 additions & 3 deletions python/sglang/srt/server_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,7 +466,7 @@ class ServerArgs:
grammar_backend: Optional[str] = None
mm_attention_backend: Optional[str] = None
fp8_gemm_runner_backend: str = "auto"
fp4_gemm_runner_backend: str = "flashinfer_cutlass"
fp4_gemm_runner_backend: str = "auto"
nsa_prefill_backend: Optional[str] = (
None # None = auto-detect based on hardware/kv_cache_dtype
)
Expand Down Expand Up @@ -4308,8 +4308,8 @@ def add_cli_args(parser: argparse.ArgumentParser):
default=ServerArgs.fp4_gemm_runner_backend,
dest="fp4_gemm_runner_backend",
help="Choose the runner backend for NVFP4 GEMM operations. "
"Options: 'flashinfer_cutlass' (default), "
"'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), "
"Options: 'auto' (default; selects flashinfer_cudnn on SM120, flashinfer_cutlass otherwise), "
"'flashinfer_cutlass' (CUTLASS backend), "
"'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), "
"'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). "
"NOTE: This replaces the deprecated environment variable "
Expand Down
Loading