Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/advanced_features/server_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,8 +268,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `fa4`, `triton_attn`, `ascend_attn`, `aiter_attn` |
| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter`, `trtllm` |
| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`, `trtllm` |
| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (FlashInfer TRTLLM backend; SM100/SM103 only), 'flashinfer_cutlass' (FlashInfer CUTLASS backend, SM120 only), 'flashinfer_deepgemm' (Hopper SM90 only, uses swapAB optimization for small M dimensions in decoding), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only). **NOTE**: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. | `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_deepgemm`, `cutlass`, `triton`, `aiter` |
| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'flashinfer_cutlass' (default), 'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. **NOTE**: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. | `flashinfer_cutlass` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` |
| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (FlashInfer TRTLLM backend; SM100/SM103 only), 'flashinfer_cutlass' (FlashInfer CUTLASS backend, SM120 only), 'flashinfer_deepgemm' (Hopper SM90 only, uses swapAB optimization for small M dimensions in decoding), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only).| `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_deepgemm`, `cutlass`, `triton`, `aiter` |
| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'flashinfer_cutlass' (default), 'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback.| `flashinfer_cutlass` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` |
| `--disable-flashinfer-autotune` | Flashinfer autotune is enabled by default. Set this flag to disable the autotune. | `False` | bool flag (set to enable) |

## Speculative decoding
Expand Down
3 changes: 0 additions & 3 deletions docs/references/environment_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,12 +119,9 @@ SGLang supports various environment variables that can be used to configure its
| `SGLANG_INT4_WEIGHT` | Enable INT4 weight quantization | `false` |
| `SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2` | Apply per token group quantization kernel with fused silu and mul and masked m | `false` |
| `SGLANG_FORCE_FP8_MARLIN` | Force using FP8 MARLIN kernels even if other FP8 kernels are available | `false` |
| `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (deprecated) | Select backend for `mm_fp4` on Blackwell GPUs. **DEPRECATED**: Please use `--fp4-gemm-backend` instead. | `` |
| `SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN` | Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | `false` |
| `SGLANG_MOE_NVFP4_DISPATCH` | Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend) | `"false"` |
| `SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE` | Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | `false` |
| `SGLANG_ENABLE_FLASHINFER_FP8_GEMM` (deprecated) | Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs. **DEPRECATED**: Please use `--fp8-gemm-backend=flashinfer_trtllm` (SM100/SM103) or `--fp8-gemm-backend=flashinfer_cutlass` (SM120/SM121 and newer) instead. | `false` |
| `SGLANG_SUPPORT_CUTLASS_BLOCK_FP8` (deprecated) | Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs. **DEPRECATED**: Please use `--fp8-gemm-backend=cutlass` instead. | `false` |
| `SGLANG_QUANT_ALLOW_DOWNCASTING` | Allow weight dtype downcasting during loading (e.g., fp32 → fp16). By default, SGLang rejects this kind of downcasting when using quantization. | `false` |
| `SGLANG_FP8_IGNORED_LAYERS` | A comma-separated list of layer names to ignore during FP8 quantization. For example: `model.layers.0,model.layers.1.,qkv_proj`. | `""` |

Expand Down
23 changes: 0 additions & 23 deletions python/sglang/srt/environ.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,9 +337,7 @@ class Envs:

# Flashinfer
SGLANG_IS_FLASHINFER_AVAILABLE = EnvBool(True)
SGLANG_ENABLE_FLASHINFER_FP8_GEMM = EnvBool(False)
# Default to the pick from flashinfer
SGLANG_FLASHINFER_FP4_GEMM_BACKEND = EnvStr("")
SGLANG_FLASHINFER_WORKSPACE_SIZE = EnvInt(384 * 1024 * 1024)
# TODO(mmangkad): Remove this once the FlashInfer unified allreduce-fusion
# transport issue on GB200/GB300 platforms is fixed and verified resolved.
Expand Down Expand Up @@ -408,7 +406,6 @@ class Envs:
DISABLE_OPENAPI_DOC = EnvBool(False)
SGLANG_ENABLE_TORCH_INFERENCE_MODE = EnvBool(False)
SGLANG_IS_FIRST_RANK_ON_NODE = EnvBool(True)
SGLANG_SUPPORT_CUTLASS_BLOCK_FP8 = EnvBool(False)
SGLANG_SYNC_TOKEN_IDS_ACROSS_TP = EnvBool(False)
SGLANG_ENABLE_COLOCATED_BATCH_GEN = EnvBool(False)

Expand Down Expand Up @@ -548,9 +545,6 @@ def _warn_deprecated_env_to_cli_flag(env_name: str, suggestion: str):

def _convert_SGL_to_SGLANG():
_print_deprecated_env("SGLANG_LOG_GC", "SGLANG_GC_LOG")
_print_deprecated_env(
"SGLANG_ENABLE_FLASHINFER_FP8_GEMM", "SGLANG_ENABLE_FLASHINFER_GEMM"
)
_print_deprecated_env(
"SGLANG_MOE_NVFP4_DISPATCH", "SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH"
)
Expand Down Expand Up @@ -581,23 +575,6 @@ def _convert_SGL_to_SGLANG():


_convert_SGL_to_SGLANG()

_warn_deprecated_env_to_cli_flag(
"SGLANG_ENABLE_FLASHINFER_FP8_GEMM",
"It will be completely removed in 0.5.7. Please use '--fp8-gemm-backend=flashinfer_trtllm' instead.",
)
_warn_deprecated_env_to_cli_flag(
"SGLANG_ENABLE_FLASHINFER_GEMM",
"It will be completely removed in 0.5.7. Please use '--fp8-gemm-backend=flashinfer_trtllm' instead.",
)
_warn_deprecated_env_to_cli_flag(
"SGLANG_SUPPORT_CUTLASS_BLOCK_FP8",
"It will be completely removed in 0.5.7. Please use '--fp8-gemm-backend=cutlass' instead.",
)
_warn_deprecated_env_to_cli_flag(
"SGLANG_FLASHINFER_FP4_GEMM_BACKEND",
"It will be completely removed in 0.5.9. Please use '--fp4-gemm-backend' instead.",
)
_warn_deprecated_env_to_cli_flag(
"SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE",
"Please use '--enable-prefill-delayer' instead.",
Expand Down
21 changes: 0 additions & 21 deletions python/sglang/srt/layers/quantization/fp4_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
from enum import Enum
from typing import TYPE_CHECKING

from sglang.srt.environ import envs
from sglang.srt.utils.common import is_sm120_supported

if TYPE_CHECKING:
Expand Down Expand Up @@ -56,26 +55,6 @@ def initialize_fp4_gemm_config(server_args: ServerArgs) -> None:
global FP4_GEMM_RUNNER_BACKEND

backend = server_args.fp4_gemm_runner_backend

# Handle deprecated env var for backward compatibility
# TODO: Remove this in a future version
if envs.SGLANG_FLASHINFER_FP4_GEMM_BACKEND.is_set():
env_backend = envs.SGLANG_FLASHINFER_FP4_GEMM_BACKEND.get()
if backend == "auto":
logger.warning(
"SGLANG_FLASHINFER_FP4_GEMM_BACKEND is deprecated. "
f"Please use '--fp4-gemm-backend={env_backend}' instead."
)
if not env_backend.startswith("flashinfer_"):
env_backend = "flashinfer_" + env_backend
backend = env_backend
else:
logger.warning(
f"FP4 GEMM backend set to '{backend}' via --fp4-gemm-backend overrides "
"environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. "
"Using server argument value."
)

if backend == "auto":
if is_sm120_supported():
# flashinfer_cutlass produces NaN in dense MLP layers with
Expand Down
20 changes: 0 additions & 20 deletions python/sglang/srt/layers/quantization/fp8_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@

import torch

from sglang.srt.environ import envs
from sglang.srt.layers import deep_gemm_wrapper
from sglang.srt.layers.quantization.fp8_kernel import sglang_per_token_group_quant_fp8
from sglang.srt.layers.quantization.mxfp4_tensor import MXFP4QuantizeUtil
Expand Down Expand Up @@ -453,25 +452,6 @@ def initialize_fp8_gemm_config(server_args: ServerArgs) -> None:
global FP8_GEMM_RUNNER_BACKEND

backend = server_args.fp8_gemm_runner_backend

# TODO(brayden): Remove env-based overrides in v0.5.7, they will be fully removed in v0.5.7.
# Only check environment variables when the server args is not set, server args should take priority.
if backend == "auto":
if envs.SGLANG_ENABLE_FLASHINFER_FP8_GEMM.get():
backend = "flashinfer_trtllm"
elif envs.SGLANG_SUPPORT_CUTLASS_BLOCK_FP8.get():
backend = "cutlass"
else:
if (
envs.SGLANG_ENABLE_FLASHINFER_FP8_GEMM.get()
or envs.SGLANG_SUPPORT_CUTLASS_BLOCK_FP8.get()
):
logger.warning(
f"FP8 GEMM backend set to '{backend}' via --fp8-gemm-backend overrides "
"environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and "
"SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. Using server argument value."
)

if backend == "auto" and is_sm120_supported():
# TODO(brayden): Verify if CUTLASS can be set by default once SwapAB is supported
backend = "triton"
Expand Down
8 changes: 2 additions & 6 deletions python/sglang/srt/server_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -4632,9 +4632,7 @@ def add_cli_args(parser: argparse.ArgumentParser):
"'flashinfer_deepgemm' (Hopper SM90 only; uses swapAB optimization for small M dimensions in decoding), "
"'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), "
"'triton' (fallback, widely compatible), "
"'aiter' (ROCm only). "
"NOTE: This replaces the deprecated environment variables "
"SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8.",
"'aiter' (ROCm only). ",
)
parser.add_argument(
"--fp4-gemm-backend",
Expand All @@ -4646,9 +4644,7 @@ def add_cli_args(parser: argparse.ArgumentParser):
"Options: 'auto' (default; selects flashinfer_cudnn on SM120, flashinfer_cutlass otherwise), "
"'flashinfer_cutlass' (CUTLASS backend), "
"'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), "
"'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). "
"NOTE: This replaces the deprecated environment variable "
"SGLANG_FLASHINFER_FP4_GEMM_BACKEND.",
"'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). ",
)
parser.add_argument(
"--disable-flashinfer-autotune",
Expand Down
Loading