Conversation
There was a problem hiding this comment.
Code Review
This pull request unifies DeepSeek-V4 (dsv4) state handling with Sliding Window Attention (swa) by removing specialized dsv4 logic and types across the disaggregation modules. Feedback suggests clarifying a comment in mooncake/conn.py to specify that the restriction on different Tensor Parallel (TP) sizes applies only to non-MLA models, as the current wording is misleading following the unification.
|
/tag-and-rerun-ci |
|
/rerun-test test/registered/disaggregation/test_disaggregation_basic.py::TestDisaggregationAccuracy test/registered/disaggregation/test_disaggregation_basic.py::TestDisaggregationMooncakeSpec test/registered/disaggregation/test_disaggregation_xpu.py::TestDisaggregationNixlBasic test/registered/distributed/test_disaggregation_different_tp.py test/registered/distributed/test_disaggregation_pp.py |
|
🚀 🚀 🚀 |
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
Motivation
PR #23882 introduced an independent
state_type="dsv4"discriminator and a dedicated NIXL transport path (_send_state_pages_flat) for V4's heterogeneous state pool. PR #24878 then routed V4 mooncake through the existing["swa", "nsa"]branch's_send_kvcache_generic, proving empirically that V4's heterogeneous state list (SWA + compress + indexer ring buffers) works correctly with the same generic transfer path used by SWA.The independent
state_type="dsv4"is therefore redundant. Its sole non-trivial consumer — NIXL's_send_state_pages_flat— also hard-assertssrc_state_item_lens[i] == dst_state_item_lens[i]per entry, which doesn't hold under MTP (decode-side indexer pool carries an extra EAGLE draft layer). Removing the discriminator routes V4 + NIXL through the more permissive generic path on both backends.Empirically this also fixes a silent V4 + NIXL + MTP regression (gsm8k: 0.890 → 0.970).
Accuracy
1P+1D V4-Flash, TP=4, gsm8k 200 examples.
cc: @ShangmingCai @ch-wan @hnyls2002