Fix reduce_scatterv producer contract for SUM_LEN#24785
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request modifies the should_use_reduce_scatter method in communicator.py to permit reduce-scatter operations when should_use_dp_reduce_scatterv() is true, expanding the conditions beyond the previous requirement of maximum length padding. I have no feedback to provide.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
mmangkad
approved these changes
May 10, 2026
Contributor
mmangkad
left a comment
There was a problem hiding this comment.
Looks good
ChatCompletionSampler initialized with self.system_message=None self.temperature=1.0 self.max_tokens=512 self.reasoning_effort=None self.extra_body={'chat_template_kwargs': {'thinking': False}}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1299/1299 [01:08<00:00, 19.00it/s]
Total latency: 69.474 s
Score: 0.965
Output throughput: 2176.574 token/s
[METRIC] gsm8k_score=0.9645881447267128 labels={"model": "moonshotai/Kimi-K2.6", "eval": "gsm8k"}
[METRIC] gsm8k_latency=69.47386905899998 labels={"model": "moonshotai/Kimi-K2.6", "eval": "gsm8k"}
Writing report to /tmp/gsm8k_moonshotai_Kimi-K2.6.html
{'score:std': np.float64(0.18481844004154702), 'score': np.float64(0.9645881447267128), 'latency': 69.47386905899998, 'output_throughput': 2176.573754249705}
Writing results to /tmp/gsm8k_moonshotai_Kimi-K2.6.json
Collaborator
|
I can also confirm this fix |
b8zhong
approved these changes
May 10, 2026
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 11, 2026
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Related to #23554.
_scatter_hidden_states()uses DPreduce_scattervwhenevershould_use_dp_reduce_scatterv()is true. However,LayerCommunicator.should_use_reduce_scatter()only reported true for_scatter_hidden_statesinMAX_LENmode. InSUM_LENprefill, producers such as dense MLP layers could still perform their TP all-reduce before the communicator performedreduce_scatterv, which double-reduced the hidden states and caused long-prompt accuracy regressions.Kimi K2.6 exposes this because it has an early dense MLP layer (
first_k_dense_replace=1), so the dense producer path must follow the same reduce-scatter contract as the communicator postprocess path.Modifications
Make
should_use_reduce_scatter()return true for_scatter_hidden_stateswhenevershould_use_dp_reduce_scatterv()is true. This aligns the producer contract with the communicator path: if the communicator will run DPreduce_scatterv, the producer skips its internal TP all-reduce.Accuracy Tests
GSM8K on Kimi K2.6, GB300, TP8/DP8/EP8, DP attention,
num_examples=1319,num_threads=1319,num_shots=20,max_tokens=16384.Bad baseline before this fix:
0.5642802155504234, invalid extracted answers422, empty responses410.This PR:
0.9753656658968437, correct1267, wrong32, invalid extracted answers0, empty responses0.Full final GSM8K log:
Speed Tests and Profiling
This is a correctness fix for the producer/communicator reduce-scatter contract, not a performance-oriented change. The GSM8K validation run above reported total latency
343.077 sand output throughput3616.264 token/s.Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci