Skip to content

feat: port SGLANG_JIT_DEEPGEMM_FAST_WARMUP to deepseek_v4 branch#23756

Merged
Fridge003 merged 1 commit into
sgl-project:deepseek_v4from
parrot18:feat/fast-warmup-deepseek-v4
Apr 27, 2026
Merged

feat: port SGLANG_JIT_DEEPGEMM_FAST_WARMUP to deepseek_v4 branch#23756
Fridge003 merged 1 commit into
sgl-project:deepseek_v4from
parrot18:feat/fast-warmup-deepseek-v4

Conversation

@parrot18
Copy link
Copy Markdown

Background

When deploying DeepSeek-V4-Flash with TP>=2 and CUDA graph enabled, the server fails to start due to NCCL timeout. The root cause:

  1. During CUDA graph capture warmup, Rank 0 blocks for minutes compiling all M values (1..16384) via DeepGEMM JIT, while other ranks finish their GEMM quickly and wait at all-reduce.
  2. NCCL has a default 30-minute timeout. With the full M list, Rank 0 compilation takes 5-10+ minutes per kernel type, easily exceeding the timeout when 6 kernel types are compiled sequentially.

Without CUDA graph, single-request decode TPOT is ~133ms/tok regardless of context length (kernel launch overhead dominates). With CUDA graph, single-request decode TPOT drops to ~22ms/tok (~6x improvement). So enabling CUDA graph is critical for decode performance.

What FAST_WARMUP does

Ported from main branch (PR #18111), this feature reduces the M list from ~16384 to ~2560 values:

  • M=1..1024: all compiled (covers decode batch sizes completely)
  • M=1025..max_prefill_bs: logarithmic sampling (step doubles each range) e.g. step 2 for [1024,2048), step 4 for [2048,4096), etc.

This reduces Rank 0 compilation time from ~5-10min to ~90s, avoiding the NCCL timeout. Total cold start with CUDA graph: ~5.5min.

Tradeoff: some prefill M values may not be pre-compiled, causing one-time JIT delay on first encounter. Decode is unaffected since all M<=1024 are always compiled.

Changes

  • compile_utils.py: Add _FAST_WARMUP path in update_deep_gemm_config() with sampled M list generation; add nullcontext/is_musa imports; refactor deep_gemm_execution_hook to plain function returning context manager (MUSA compat); add hasattr guards for get_compile_mode/set_compile_mode (older DeepGEMM compat); defer _BUILTIN_M_LIST init to update_deep_gemm_config()
  • environ.py: Add SGLANG_JIT_DEEPGEMM_FAST_WARMUP (EnvBool, default False) and SGLANG_DEEPGEMM_SANITY_CHECK (EnvBool, default False)
  • entrypoint.py: Use envs.SGLANG_DEEPGEMM_SANITY_CHECK instead of get_bool_env_var; add ENABLE_JIT_DEEPGEMM guard in configure_deep_gemm_num_sms

Usage

SGLANG_JIT_DEEPGEMM_FAST_WARMUP=True python3 -m sglang.launch_server
--model-path --tp-size 4 --moe-runner-backend deep_gemm ...

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a fast warmup mode for DeepGEMM JIT compilation by sampling batch sizes, reducing initialization overhead. It also adds environment variables for configuration, improves MUSA support, and includes safety checks for DeepGEMM API calls. Reviewers identified a potential bug in compile mode restoration, suggested using dynamic environment variable lookups to support configuration overrides, and recommended capping the batch size sampling range to prevent redundant compilation.

Comment thread python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py Outdated
Comment thread python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py Outdated
Comment thread python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py Outdated
Comment thread python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py Outdated
Comment thread python/sglang/srt/mem_cache/hisparse_memory_pool.py Outdated
… time

In the deepseek_v4 branch, DeepGEMM JIT compiles up to 16K M values
during CUDA graph warmup. With TP=4 on B200, this exceeds NCCL timeout
thresholds and causes initialization failures.

SGLANG_JIT_DEEPGEMM_FAST_WARMUP=True replaces the full M-list with a
sparse sampled set (~2560 values): all M in [1,1024] for decode
performance, plus geometrically-spaced values up to chunked_prefill_size
for prefill coverage. This reduces cold-start time from >30min to ~5.5min
while preserving decode TPOT (~22ms/tok on B200).

Also guard get/set_compile_mode calls with hasattr() to support DeepGEMM
versions that do not expose this API.

Signed-off-by: yingru <yingru@baidu.com>
Comment thread python/sglang/srt/environ.py
@Fridge003 Fridge003 merged commit c409f44 into sgl-project:deepseek_v4 Apr 27, 2026
1 check passed
@liaol
Copy link
Copy Markdown

liaol commented Apr 28, 2026

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=1
export SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1

Can not work on B300

  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/deep_gemm.py", line 186, in _run_contiguous_gemm
    deep_gemm_wrapper.grouped_gemm_nt_f8f8bf16_contig(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/entrypoint.py", line 117, in grouped_gemm_nt_f8f8bf16_contig
    with compile_utils.deep_gemm_execution_hook(m, n, k, num_groups, kernel_type):
  File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 347, in deep_gemm_execution_hook
    _maybe_compile_deep_gemm_one_type_all(kernel_type, n, k, num_groups)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 143, in _maybe_compile_deep_gemm_one_type_all
    _compile_deep_gemm_one_type_all(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 207, in _compile_deep_gemm_one_type_all
    executor.execute(m=m)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 304, in execute
    deep_gemm.m_grouped_fp8_gemm_nt_contiguous(
TypeError: m_grouped_fp8_fp4_gemm_nt_contiguous(): incompatible function arguments. The following argument types are supported:
    1. (a: tuple[torch.Tensor, torch.Tensor], b: tuple[torch.Tensor, torch.Tensor], d: torch.Tensor, grouped_layout: torch.Tensor, recipe: tuple[typing.SupportsInt, typing.SupportsInt, typing.SupportsInt] | None = None, recipe_a: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, recipe_b: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, compiled_dims: str = 'nk', disable_ue8m0_cast: bool = False, use_psum_layout: bool = False, expected_m_for_psum_layout: typing.SupportsInt | None = None) -> None

@parrot18
Copy link
Copy Markdown
Author

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=1 export SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1

Can not work on B300

  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/deep_gemm.py", line 186, in _run_contiguous_gemm
    deep_gemm_wrapper.grouped_gemm_nt_f8f8bf16_contig(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/entrypoint.py", line 117, in grouped_gemm_nt_f8f8bf16_contig
    with compile_utils.deep_gemm_execution_hook(m, n, k, num_groups, kernel_type):
  File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 347, in deep_gemm_execution_hook
    _maybe_compile_deep_gemm_one_type_all(kernel_type, n, k, num_groups)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 143, in _maybe_compile_deep_gemm_one_type_all
    _compile_deep_gemm_one_type_all(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 207, in _compile_deep_gemm_one_type_all
    executor.execute(m=m)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 304, in execute
    deep_gemm.m_grouped_fp8_gemm_nt_contiguous(
TypeError: m_grouped_fp8_fp4_gemm_nt_contiguous(): incompatible function arguments. The following argument types are supported:
    1. (a: tuple[torch.Tensor, torch.Tensor], b: tuple[torch.Tensor, torch.Tensor], d: torch.Tensor, grouped_layout: torch.Tensor, recipe: tuple[typing.SupportsInt, typing.SupportsInt, typing.SupportsInt] | None = None, recipe_a: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, recipe_b: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, compiled_dims: str = 'nk', disable_ue8m0_cast: bool = False, use_psum_layout: bool = False, expected_m_for_psum_layout: typing.SupportsInt | None = None) -> None

pls use the latestest branch code after PR #23686 is_fp4_expert will use flashinfer_mxfp4 instead of deep_gemm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants