[Fix] Disable FlashInfer allreduce fusion under deterministic inference#24629
Merged
Jiminator merged 2 commits intosgl-project:mainfrom May 11, 2026
Conversation
The model-specific auto-enable in _handle_model_specific_adjustments turns on enable_flashinfer_allreduce_fusion on SM90/SM100 with TP>1 for several MoE arches (DeepseekV3, GptOss, Glm4Moe, Qwen3MoE, Qwen3Next, KimiK2.5, Qwen3.5 MoE/non-MoE). The fused kernel is non-deterministic across batch shapes, which violates the determinism contract enforced by --enable-deterministic-inference and breaks the nightly-test-general-4-gpu-h100 :: TestFlashInferDeterministic .test_prefix_with_logprobs CI lane (regressed by PR sgl-project#22664 since 2026-04-19, still failing as of 2026-05-07). Mirror the existing aiter handling and additionally set enforce_disable_flashinfer_allreduce_fusion=True so any downstream re-check honors the deterministic mode.
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the _handle_deterministic_inference method in server_args.py to automatically disable flashinfer_allreduce_fusion when deterministic inference is enabled. The review feedback suggests improving the log message consistency by adding the argument prefix and moving the enforcement flag assignment to an earlier stage in the initialization process to ensure it correctly influences the logic flow and avoids redundancy.
- Hoist enforce_disable_flashinfer_allreduce_fusion=True to before _handle_model_specific_adjustments so the auto-enable's enforce check (line 2378) actually drives the override, instead of being set redundantly afterwards. - Add the leading '--' prefix to the warning so it matches the existing aiter-fusion warning style. Re-verified: TestFlashInferDeterministic.test_prefix_with_logprobs still passes locally (94.6s, OK, logprobs identical across batch sizes) on Qwen/Qwen3-Next-80B-A3B-Instruct, --tp 4, H100.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
BBuf
approved these changes
May 8, 2026
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 11, 2026
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PR #22664 (commit
c6a45fab64) addedQwen3NextForCausalLMto the model-arch list inServerArgs._handle_model_specific_adjustmentsthat auto-enablesenable_flashinfer_allreduce_fusionon SM90/SM100 withtp_size > 1. The fused FlashInfer/TRTLLM allreduce kernel chooses different reduction shapes/orderings depending on batch size, which breaks bit-exact determinism.This violates the
--enable-deterministic-inferencecontract, and has been failing thenightly-test-general-4-gpu-h100 :: TestFlashInferDeterministic.test_prefix_with_logprobsCI lane every dispatched run from 2026-04-19 through today (e.g. run 24971499389, run 25469734855). The same auto-enable list also coversDeepseekV3,GptOss,Glm4Moe,Qwen3MoE,KimiK2.5,Qwen3.5 MoE/non-MoE, so any other deterministic-inference test on those arches with TP>1 on H100/H200/B200 is silently affected as well._handle_deterministic_inferencecurrently force-disables only the aiter allredxruce fusion path; the FlashInfer counterpart is missing.Modifications
python/sglang/srt/server_args.py:_handle_deterministic_inference, mirror the existingenable_aiter_allreduce_fusionhandling — whenenable_deterministic_inferenceis set, warn and clearenable_flashinfer_allreduce_fusion.__post_init__, setenforce_disable_flashinfer_allreduce_fusion = Truebefore_handle_model_specific_adjustmentsruns, so the existing enforce check at the end of that handler (already present, line ~2378) is the single source of truth that drives the override. This addresses the concern that setting the enforce flag inside_handle_deterministic_inferencewould be redundant given_handle_model_specific_adjustmentsruns earlier in__post_init__.No change to default (non-deterministic) behavior;
enable_flashinfer_allreduce_fusionis still auto-enabled for the listed MoE arches when deterministic inference is not requested.Accuracy Tests
Local repro on 8× H100 80GB (driver 580.126.20),
--tp 4,Qwen/Qwen3-Next-80B-A3B-Instruct, command identical to CI:enable_flashinfer_allreduce_fusion4839cecbb0(parent of #22664)FalseRan 1 test in 118.4s / OK / ✓✓✓ Logprobs are identical across all batch sizes! ✓✓✓c6a45fab64(#22664, introducing)TrueRan 1 test in 102.6s / FAILED (errors=1) / ✗✗✗ Some logprobs differ across batch sizes! ✗✗✗, 244 per-sample mismatches5b589ed2e7)TrueFalse(forced via enforce flag)Ran 1 test in 94.6s / OK / ✓✓✓ Logprobs are identical across all batch sizes! ✓✓✓Failing-vs-passing diffs reproduce the exact CI numerical fingerprint, e.g.
Logprob mismatch at position 0: -2.3552... vs -2.3723... (diff 0.017068...), the same value seen in the failing nightly logs.Speed Tests and Profiling
Not applicable — change only fires when the user opts into
--enable-deterministic-inference, where deterministic correctness already supersedes peak performance. No effect on the default (non-deterministic) path.Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci