[MUSA][16/N] Add MUSA backend support for layers and DeepSeek models (V2/V3/R1)#22774
[MUSA][16/N] Add MUSA backend support for layers and DeepSeek models (V2/V3/R1)#22774Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
Conversation
|
/label mthreads |
There was a problem hiding this comment.
Code Review
This pull request implements support for the MUSA (Moore Threads) architecture throughout the SGLang runtime, including updates to activation functions, layer normalization, MoE runners, and DeepSeek model implementations. The review feedback highlights a logic error in a platform check within the TopK selection and identifies several opportunities to improve performance by caching activation module instances instead of re-instantiating them during forward passes.
6403f6f to
1b9f2fa
Compare
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
31f8331 to
fd5c567
Compare
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
fd5c567 to
c02c466
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
c02c466 to
0f96805
Compare
|
/rerun-failed-ci |
0f96805 to
a92a957
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
a92a957 to
397c3a0
Compare
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Motivation
Enable SGLang to run DeepSeek models on Moore Threads MUSA GPUs. This PR adds MUSA backend support across the inference stack, including layers, quantization, MoE, attention, speculative decoding, and custom op registration.
Modifications
Core Infrastructure:
utils/common.py): Register custom ops on theMUSAdispatch key indirect_register_custom_op; extendget_device_sm()to detect MUSA SM versions viatorch.cuda.get_device_capability()(works through torchada).server_args.py): Addmusaas a recognized device in the CLI help string.Layer Support:
activation.py): Addforward_musaforSiluAndMul— usesforward_nativeunder piecewise CUDA graph, otherwise uses lazily-initializednn.SwishGLUfor better MUSA performance.layernorm.py): Addforward_musaforRMSNormwith fusedadd_rmsnormand nativenn.functional.rms_normfallback; enable flashinfer layernorm import on MUSA.fp8.py,fp8_kernel.py,fp8_utils.py,unquant.py):sgl_per_token_quant_fp8fromsgl_kernelon MUSA; force v2 per-token group quant path by default.deep_gemm_fp8_fp8_bf16_nt).forward_musatoUnquantizedFusedMoEMethoddelegating toforward_cuda.sampler.py): Reformat import block (no functional change).MoE:
fused_moe.py,moe_align_block_size.py): Importmoe_sum_reduceon MUSA; use pre-instantiatednn.SwishGLU/nn.GELUfor activation; skip pre-allocatingintermediate_cache2on MUSA; enablemoe_sum_reducefor combine; importmoe_align_block_sizeon MUSA.moe_runner/triton.py): Use pre-instantiatednn.SwishGLU/nn.GELUfor activation; skip pre-allocatingintermediate_cache2on MUSA.moe_runner/deep_gemm.py): UseDEEPGEMM_NEED_TMA_ALIGNED_SCALESflag to skip TMA alignment on MUSA; skipget_mn_major_tma_aligned_tensorfor scale tensors on MUSA.ep_moe/kernels.py): AddATOMIC_ADD_SEM="relaxed"for Tritontl.atomic_addon MUSA for better performance; enableper_token_group_quant_fp8import on MUSA.topk.py): Importmoe_fused_gateandtopk_softmaxfromsgl_kernelon MUSA; enablebiased_grouped_topk_gpupath.DeepSeek Model:
deepseek_weight_loader.py): Enable FP8 block quant weight loading on MUSA; dequantizew_kc/w_vcto bfloat16 at load time (temporary until FP8 bmm kernel is available on MUSA).forward_mha.py,forward_mla.py): Enable MHA fused QK-NOPE path on MUSA; add MLAtorch.bmmpath forw_vcprojection on MUSA.deepseek_v2.py): Importdsv3_fused_a_gemmanddsv3_router_gemmfromsgl_kernel; enablealt_stream, shared expert fusion (SM >= 31), androuted_scaling_factorhandling on MUSA.utils.py): Export_is_musaflag for deepseek common modules.Speculative Decoding:
eagle_utils.py,eagle_worker.py): Importbuild_tree_kernel_efficientandverify_tree_greedyon MUSA; disabletorch.compileon MUSA.spec_utils.py): Refactor Triton boolean expression to avoid chainedoroperators (MUSA Triton limitation).Other:
scheduler_metrics_mixin.py): Addmusagraph label.environ.py): AddSGLANG_DEEPGEMM_SANITY_CHECKenv var.configurer.py,entrypoint.py,compile_utils.py): Enable DeepGEMM on MUSA SM >= 31; skip JIT compilation hook on MUSA (returnsnullcontext()); introduceDEEPGEMM_NEED_TMA_ALIGNED_SCALESflag; migrate sanity check flag to centralizedenvsconfig.Accuracy Tests
Note: The accuracy tests below are run with
--disable-piecewise-cuda-graphbecause some MUSA-optimized kernels cannot currently adapt to the piecewise CUDA graph mechanism. We will graduallyfix this to support piecewise CUDA graphs on MUSA in future updates.
DeepSeek-Coder-V2-Lite-Instruct
run command:
result:
DeepSeek-V2-Lite-Chat-FP8
run command:
result:
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci