Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head#24775
Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head#24775
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
Co-Authored-By: Cheng Wan <chwan@rice.edu>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>
Co-authored-by: Cheng Wan <chwan@rice.edu>
Co-authored-by: Cheng Wan <chwan@rice.edu>
78e5ce3 to
abeb7f8
Compare
|
/rerun-stage stage-c-test-dsv4-4-gpu-b200 |
|
/rerun-stage stage-c-test-dsv4-8-gpu-h200 |
|
❌ Stage NVIDIA stages:
AMD stages:
Other stages will be added soon. For now, use |
|
❌ Stage NVIDIA stages:
AMD stages:
Other stages will be added soon. For now, use |
|
/rerun-test registered/4-gpu-models/test_deepseek_v4_flash_fp4_b200.py |
|
❌ Known suites: |
|
/rerun-stage stage-c-test-dsv4-4-gpu-b200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-dsv4-8-gpu-h20 |
|
/rerun-stage stage-c-test-dsv4-8-gpu-h200 |
|
❌ Stage NVIDIA stages:
AMD stages:
Other stages will be added soon. For now, use |
|
✅ Triggered |
T.alloc_fragment does not guarantee zero initialization. The sumsq_per_pos accumulator must be explicitly cleared before the pipelined loop to avoid garbage values corrupting the RMSNorm computation, which caused all-zero model output. Co-authored-by: Cheng Wan <chwan@rice.edu>
|
/rerun-stage stage-c-test-dsv4-8-gpu-h200 |
|
/rerun-stage stage-c-test-dsv4-4-gpu-b200 |
|
✅ Triggered |
|
✅ Triggered |
|
all tests related to dpskv4 have passed |
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
Summary
tf32_hc_prenorm_gemmfor mhc_pre GEMM whenSGLANG_OPT_DEEPGEMM_HC_PRENORMis enabledmhc_pre_big_fusekernel (eliminates separate norm kernel launch + HBM round-trip)hc_head(fuses RMSNorm + Linear + Sigmoid-gate + weighted-sum into one kernel)Ported from:
Microbench
Standalone microbench of
mhc_pre(with real sglangRMSNormkernel as baseline) andhc_head, DSV4 params: hidden=7168, hc_mult=4. CUDA event timing, 100 iters, trimmed top/bottom 10%.norm + mhc_pre (called 2x per decoder layer, both prefill and decode):
hc_head (called 1x per forward on last PP rank):
Limitations
mhc_preandhc_headin isolation, not an end-to-end serving benchmark. Real-world gains depend on how much time these kernels contribute to total per-token latency.