Skip to content

[Fix] Disable FlashInfer allreduce fusion under deterministic inference#24629

Merged
Jiminator merged 2 commits intosgl-project:mainfrom
Jiminator:fix/deterministic-disable-flashinfer-allreduce-fusion
May 11, 2026
Merged

[Fix] Disable FlashInfer allreduce fusion under deterministic inference#24629
Jiminator merged 2 commits intosgl-project:mainfrom
Jiminator:fix/deterministic-disable-flashinfer-allreduce-fusion

Conversation

@Jiminator
Copy link
Copy Markdown
Collaborator

@Jiminator Jiminator commented May 7, 2026

Motivation

PR #22664 (commit c6a45fab64) added Qwen3NextForCausalLM to the model-arch list in ServerArgs._handle_model_specific_adjustments that auto-enables enable_flashinfer_allreduce_fusion on SM90/SM100 with tp_size > 1. The fused FlashInfer/TRTLLM allreduce kernel chooses different reduction shapes/orderings depending on batch size, which breaks bit-exact determinism.

This violates the --enable-deterministic-inference contract, and has been failing the nightly-test-general-4-gpu-h100 :: TestFlashInferDeterministic.test_prefix_with_logprobs CI lane every dispatched run from 2026-04-19 through today (e.g. run 24971499389, run 25469734855). The same auto-enable list also covers DeepseekV3, GptOss, Glm4Moe, Qwen3MoE, KimiK2.5, Qwen3.5 MoE/non-MoE, so any other deterministic-inference test on those arches with TP>1 on H100/H200/B200 is silently affected as well.

_handle_deterministic_inference currently force-disables only the aiter allredxruce fusion path; the FlashInfer counterpart is missing.

Modifications

python/sglang/srt/server_args.py:

  1. In _handle_deterministic_inference, mirror the existing enable_aiter_allreduce_fusion handling — when enable_deterministic_inference is set, warn and clear enable_flashinfer_allreduce_fusion.
  2. In __post_init__, set enforce_disable_flashinfer_allreduce_fusion = True before _handle_model_specific_adjustments runs, so the existing enforce check at the end of that handler (already present, line ~2378) is the single source of truth that drives the override. This addresses the concern that setting the enforce flag inside _handle_deterministic_inference would be redundant given _handle_model_specific_adjustments runs earlier in __post_init__.

No change to default (non-deterministic) behavior; enable_flashinfer_allreduce_fusion is still auto-enabled for the listed MoE arches when deterministic inference is not requested.

Accuracy Tests

Local repro on 8× H100 80GB (driver 580.126.20), --tp 4, Qwen/Qwen3-Next-80B-A3B-Instruct, command identical to CI:

CUDA_VISIBLE_DEVICES=0,1,2,3 SGLANG_IS_IN_CI=true \
python test/registered/core/test_qwen3_next_deterministic.py \
  TestFlashInferDeterministic.test_prefix_with_logprobs -v
Commit enable_flashinfer_allreduce_fusion Result
4839cecbb0 (parent of #22664) False PASS — Ran 1 test in 118.4s / OK / ✓✓✓ Logprobs are identical across all batch sizes! ✓✓✓
c6a45fab64 (#22664, introducing) True FAIL — Ran 1 test in 102.6s / FAILED (errors=1) / ✗✗✗ Some logprobs differ across batch sizes! ✗✗✗, 244 per-sample mismatches
HEAD (5b589ed2e7) True FAIL (same signature)
HEAD + this PR False (forced via enforce flag) PASS — Ran 1 test in 94.6s / OK / ✓✓✓ Logprobs are identical across all batch sizes! ✓✓✓

Failing-vs-passing diffs reproduce the exact CI numerical fingerprint, e.g. Logprob mismatch at position 0: -2.3552... vs -2.3723... (diff 0.017068...), the same value seen in the failing nightly logs.

Speed Tests and Profiling

Not applicable — change only fires when the user opts into --enable-deterministic-inference, where deterministic correctness already supersedes peak performance. No effect on the default (non-deterministic) path.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

The model-specific auto-enable in _handle_model_specific_adjustments
turns on enable_flashinfer_allreduce_fusion on SM90/SM100 with TP>1 for
several MoE arches (DeepseekV3, GptOss, Glm4Moe, Qwen3MoE, Qwen3Next,
KimiK2.5, Qwen3.5 MoE/non-MoE). The fused kernel is non-deterministic
across batch shapes, which violates the determinism contract enforced
by --enable-deterministic-inference and breaks the
nightly-test-general-4-gpu-h100 :: TestFlashInferDeterministic
.test_prefix_with_logprobs CI lane (regressed by PR sgl-project#22664 since
2026-04-19, still failing as of 2026-05-07).

Mirror the existing aiter handling and additionally set
enforce_disable_flashinfer_allreduce_fusion=True so any downstream
re-check honors the deterministic mode.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the _handle_deterministic_inference method in server_args.py to automatically disable flashinfer_allreduce_fusion when deterministic inference is enabled. The review feedback suggests improving the log message consistency by adding the argument prefix and moving the enforcement flag assignment to an earlier stage in the initialization process to ensure it correctly influences the logic flow and avoids redundancy.

Comment thread python/sglang/srt/server_args.py Outdated
Comment thread python/sglang/srt/server_args.py Outdated
- Hoist enforce_disable_flashinfer_allreduce_fusion=True to before
  _handle_model_specific_adjustments so the auto-enable's enforce
  check (line 2378) actually drives the override, instead of being
  set redundantly afterwards.
- Add the leading '--' prefix to the warning so it matches the
  existing aiter-fusion warning style.

Re-verified: TestFlashInferDeterministic.test_prefix_with_logprobs
still passes locally (94.6s, OK, logprobs identical across batch sizes)
on Qwen/Qwen3-Next-80B-A3B-Instruct, --tp 4, H100.
@Jiminator Jiminator marked this pull request as ready for review May 8, 2026 05:33
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Jiminator Jiminator changed the title Disable FlashInfer allreduce fusion under deterministic inference [Fix] Disable FlashInfer allreduce fusion under deterministic inference May 8, 2026
@Jiminator
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label May 8, 2026
@Jiminator Jiminator merged commit e9a15b9 into sgl-project:main May 11, 2026
374 of 424 checks passed
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 11, 2026
* main: (87 commits)
  [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629)
  fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217)
  Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742)
  Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785)
  [NPU]Documentation update for communications quantization feature (sgl-project#24668)
  [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851)
  [Model] Add MiniCPM-V 4.6 support (sgl-project#24855)
  Support Intern-S2-Preview (sgl-project#24875)
  [PD] Unify dsv4 dispatch with swa (sgl-project#24888)
  Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775)
  Fix PD bootstrap failure handling (sgl-project#24772)
  [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881)
  [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878)
  [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859)
  [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696)
  [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826)
  [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456)
  [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854)
  [diffusion] CI: add cache-dit CI tests (sgl-project#19213)
  [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767)
  ...

# Conflicts:
#	python/sglang/srt/utils/common.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants