feat: port SGLANG_JIT_DEEPGEMM_FAST_WARMUP to deepseek_v4 branch by parrot18 · Pull Request #23756 · sgl-project/sglang

parrot18 · 2026-04-26T07:02:22Z

Background

When deploying DeepSeek-V4-Flash with TP>=2 and CUDA graph enabled, the server fails to start due to NCCL timeout. The root cause:

During CUDA graph capture warmup, Rank 0 blocks for minutes compiling all M values (1..16384) via DeepGEMM JIT, while other ranks finish their GEMM quickly and wait at all-reduce.
NCCL has a default 30-minute timeout. With the full M list, Rank 0 compilation takes 5-10+ minutes per kernel type, easily exceeding the timeout when 6 kernel types are compiled sequentially.

Without CUDA graph, single-request decode TPOT is ~133ms/tok regardless of context length (kernel launch overhead dominates). With CUDA graph, single-request decode TPOT drops to ~22ms/tok (~6x improvement). So enabling CUDA graph is critical for decode performance.

What FAST_WARMUP does

Ported from main branch (PR #18111), this feature reduces the M list from ~16384 to ~2560 values:

M=1..1024: all compiled (covers decode batch sizes completely)
M=1025..max_prefill_bs: logarithmic sampling (step doubles each range) e.g. step 2 for [1024,2048), step 4 for [2048,4096), etc.

This reduces Rank 0 compilation time from ~5-10min to ~90s, avoiding the NCCL timeout. Total cold start with CUDA graph: ~5.5min.

Tradeoff: some prefill M values may not be pre-compiled, causing one-time JIT delay on first encounter. Decode is unaffected since all M<=1024 are always compiled.

Changes

compile_utils.py: Add _FAST_WARMUP path in update_deep_gemm_config() with sampled M list generation; add nullcontext/is_musa imports; refactor deep_gemm_execution_hook to plain function returning context manager (MUSA compat); add hasattr guards for get_compile_mode/set_compile_mode (older DeepGEMM compat); defer _BUILTIN_M_LIST init to update_deep_gemm_config()
environ.py: Add SGLANG_JIT_DEEPGEMM_FAST_WARMUP (EnvBool, default False) and SGLANG_DEEPGEMM_SANITY_CHECK (EnvBool, default False)
entrypoint.py: Use envs.SGLANG_DEEPGEMM_SANITY_CHECK instead of get_bool_env_var; add ENABLE_JIT_DEEPGEMM guard in configure_deep_gemm_num_sms

Usage

SGLANG_JIT_DEEPGEMM_FAST_WARMUP=True python3 -m sglang.launch_server
--model-path --tp-size 4 --moe-runner-backend deep_gemm ...

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request implements a fast warmup mode for DeepGEMM JIT compilation by sampling batch sizes, reducing initialization overhead. It also adds environment variables for configuration, improves MUSA support, and includes safety checks for DeepGEMM API calls. Reviewers identified a potential bug in compile mode restoration, suggested using dynamic environment variable lookups to support configuration overrides, and recommended capping the batch size sampling range to prevent redundant compilation.

… time In the deepseek_v4 branch, DeepGEMM JIT compiles up to 16K M values during CUDA graph warmup. With TP=4 on B200, this exceeds NCCL timeout thresholds and causes initialization failures. SGLANG_JIT_DEEPGEMM_FAST_WARMUP=True replaces the full M-list with a sparse sampled set (~2560 values): all M in [1,1024] for decode performance, plus geometrically-spaced values up to chunked_prefill_size for prefill coverage. This reduces cold-start time from >30min to ~5.5min while preserving decode TPOT (~22ms/tok on B200). Also guard get/set_compile_mode calls with hasattr() to support DeepGEMM versions that do not expose this API. Signed-off-by: yingru <yingru@baidu.com>

liaol · 2026-04-28T09:23:05Z

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=1
export SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1

Can not work on B300

  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/deep_gemm.py", line 186, in _run_contiguous_gemm
    deep_gemm_wrapper.grouped_gemm_nt_f8f8bf16_contig(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/entrypoint.py", line 117, in grouped_gemm_nt_f8f8bf16_contig
    with compile_utils.deep_gemm_execution_hook(m, n, k, num_groups, kernel_type):
  File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 347, in deep_gemm_execution_hook
    _maybe_compile_deep_gemm_one_type_all(kernel_type, n, k, num_groups)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 143, in _maybe_compile_deep_gemm_one_type_all
    _compile_deep_gemm_one_type_all(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 207, in _compile_deep_gemm_one_type_all
    executor.execute(m=m)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 304, in execute
    deep_gemm.m_grouped_fp8_gemm_nt_contiguous(
TypeError: m_grouped_fp8_fp4_gemm_nt_contiguous(): incompatible function arguments. The following argument types are supported:
    1. (a: tuple[torch.Tensor, torch.Tensor], b: tuple[torch.Tensor, torch.Tensor], d: torch.Tensor, grouped_layout: torch.Tensor, recipe: tuple[typing.SupportsInt, typing.SupportsInt, typing.SupportsInt] | None = None, recipe_a: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, recipe_b: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, compiled_dims: str = 'nk', disable_ue8m0_cast: bool = False, use_psum_layout: bool = False, expected_m_for_psum_layout: typing.SupportsInt | None = None) -> None

parrot18 · 2026-04-29T08:34:15Z

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=1 export SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1

Can not work on B300

  File "/workspace/sglang/python/sglang/srt/layers/moe/moe_runner/deep_gemm.py", line 186, in _run_contiguous_gemm
    deep_gemm_wrapper.grouped_gemm_nt_f8f8bf16_contig(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/entrypoint.py", line 117, in grouped_gemm_nt_f8f8bf16_contig
    with compile_utils.deep_gemm_execution_hook(m, n, k, num_groups, kernel_type):
  File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 347, in deep_gemm_execution_hook
    _maybe_compile_deep_gemm_one_type_all(kernel_type, n, k, num_groups)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 143, in _maybe_compile_deep_gemm_one_type_all
    _compile_deep_gemm_one_type_all(
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 207, in _compile_deep_gemm_one_type_all
    executor.execute(m=m)
  File "/workspace/sglang/python/sglang/srt/layers/deep_gemm_wrapper/compile_utils.py", line 304, in execute
    deep_gemm.m_grouped_fp8_gemm_nt_contiguous(
TypeError: m_grouped_fp8_fp4_gemm_nt_contiguous(): incompatible function arguments. The following argument types are supported:
    1. (a: tuple[torch.Tensor, torch.Tensor], b: tuple[torch.Tensor, torch.Tensor], d: torch.Tensor, grouped_layout: torch.Tensor, recipe: tuple[typing.SupportsInt, typing.SupportsInt, typing.SupportsInt] | None = None, recipe_a: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, recipe_b: tuple[typing.SupportsInt, typing.SupportsInt] | None = None, compiled_dims: str = 'nk', disable_ue8m0_cast: bool = False, use_psum_layout: bool = False, expected_m_for_psum_layout: typing.SupportsInt | None = None) -> None

pls use the latestest branch code after PR #23686 is_fp4_expert will use flashinfer_mxfp4 instead of deep_gemm

parrot18 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 26, 2026 07:02

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

parrot18 requested review from hanming-lu, hnyls2002, xiezhq-hermann and yizhang2077 as code owners April 26, 2026 16:16

Fridge003 mentioned this pull request Apr 26, 2026

DeepSeek V4 Roadmap #23602

Open

34 tasks

Fridge003 reviewed Apr 26, 2026

View reviewed changes

Comment thread python/sglang/srt/mem_cache/hisparse_memory_pool.py Outdated

parrot18 force-pushed the feat/fast-warmup-deepseek-v4 branch from 52e846f to 1e0defd Compare April 27, 2026 13:38

junliu-mde mentioned this pull request Apr 27, 2026

[Bug] compile_deep_gemm fails with DeepGEMM 2.4.2: missing get_compile_mode/set_compile_mode in DeepSeek-V4-Flash FP8 image #23843

Closed

5 tasks

Fridge003 reviewed Apr 27, 2026

View reviewed changes

Comment thread python/sglang/srt/environ.py

Fridge003 approved these changes Apr 27, 2026

View reviewed changes

Fridge003 mentioned this pull request Apr 27, 2026

Enable DeepGemm warmup in DeepSeek-V4 cookbook #23883

Merged

5 tasks

Fridge003 merged commit c409f44 into sgl-project:deepseek_v4 Apr 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: port SGLANG_JIT_DEEPGEMM_FAST_WARMUP to deepseek_v4 branch#23756

feat: port SGLANG_JIT_DEEPGEMM_FAST_WARMUP to deepseek_v4 branch#23756
Fridge003 merged 1 commit into
sgl-project:deepseek_v4from
parrot18:feat/fast-warmup-deepseek-v4

parrot18 commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liaol commented Apr 28, 2026

Uh oh!

parrot18 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

parrot18 commented Apr 26, 2026

Background

What FAST_WARMUP does

Changes

Usage

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liaol commented Apr 28, 2026

Uh oh!

parrot18 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants