feat: port SGLANG_JIT_DEEPGEMM_FAST_WARMUP to deepseek_v4 branch#23756
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a fast warmup mode for DeepGEMM JIT compilation by sampling batch sizes, reducing initialization overhead. It also adds environment variables for configuration, improves MUSA support, and includes safety checks for DeepGEMM API calls. Reviewers identified a potential bug in compile mode restoration, suggested using dynamic environment variable lookups to support configuration overrides, and recommended capping the batch size sampling range to prevent redundant compilation.
… time In the deepseek_v4 branch, DeepGEMM JIT compiles up to 16K M values during CUDA graph warmup. With TP=4 on B200, this exceeds NCCL timeout thresholds and causes initialization failures. SGLANG_JIT_DEEPGEMM_FAST_WARMUP=True replaces the full M-list with a sparse sampled set (~2560 values): all M in [1,1024] for decode performance, plus geometrically-spaced values up to chunked_prefill_size for prefill coverage. This reduces cold-start time from >30min to ~5.5min while preserving decode TPOT (~22ms/tok on B200). Also guard get/set_compile_mode calls with hasattr() to support DeepGEMM versions that do not expose this API. Signed-off-by: yingru <yingru@baidu.com>
52e846f to
1e0defd
Compare
|
export SGLANG_JIT_DEEPGEMM_PRECOMPILE=1 Can not work on B300 |
pls use the latestest branch code after PR #23686 is_fp4_expert will use flashinfer_mxfp4 instead of deep_gemm |
Background
When deploying DeepSeek-V4-Flash with TP>=2 and CUDA graph enabled, the server fails to start due to NCCL timeout. The root cause:
Without CUDA graph, single-request decode TPOT is ~133ms/tok regardless of context length (kernel launch overhead dominates). With CUDA graph, single-request decode TPOT drops to ~22ms/tok (~6x improvement). So enabling CUDA graph is critical for decode performance.
What FAST_WARMUP does
Ported from main branch (PR #18111), this feature reduces the M list from ~16384 to ~2560 values:
This reduces Rank 0 compilation time from ~5-10min to ~90s, avoiding the NCCL timeout. Total cold start with CUDA graph: ~5.5min.
Tradeoff: some prefill M values may not be pre-compiled, causing one-time JIT delay on first encounter. Decode is unaffected since all M<=1024 are always compiled.
Changes
Usage
SGLANG_JIT_DEEPGEMM_FAST_WARMUP=True python3 -m sglang.launch_server
--model-path --tp-size 4 --moe-runner-backend deep_gemm ...
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci