Skip to content

Qwen3next flashinfer allreduce auto enable#22664

Merged
BBuf merged 2 commits into
mainfrom
bbuf/qwen3next-flashinfer-allreduce-auto-enable
Apr 18, 2026
Merged

Qwen3next flashinfer allreduce auto enable#22664
BBuf merged 2 commits into
mainfrom
bbuf/qwen3next-flashinfer-allreduce-auto-enable

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Apr 13, 2026

Made with @codex

Summary

Enable FlashInfer allreduce fusion by default for Qwen3NextForCausalLM on supported single-node SM90/SM100 TP runs.

Why

Qwen/Qwen3-Coder-Next was running with enable_flashinfer_allreduce_fusion=false on H100, and profiler traces showed prefill time dominated by unfused cross-device reduce kernels.

Change

  • add Qwen3NextForCausalLM to the existing FlashInfer allreduce auto-enable whitelist

H100 Evidence

Model: Qwen/Qwen3-Coder-Next

Command:

python -m sglang.launch_server --model-path Qwen/Qwen3-Coder-Next --tp 4 --port 31080

Server args:

  • baseline: enable_flashinfer_allreduce_fusion=false
  • patch: enable_flashinfer_allreduce_fusion=true

Benchmark (sglang.bench_serving, random 2048/256, 128 prompts, max_concurrency=32):

Metric Baseline Patch Delta
Request throughput (req/s) 5.49 9.41 +71.4%
Mean TTFT (ms) 456.24 167.54 -63.3%
Mean TPOT (ms) 50.41 25.49 -49.4%

Full accuracy:

Eval Baseline Patch Delta
MMLU (14042) 0.8745905 0.8714571 -0.31 pp
GSM8K (1314) 0.9627093 0.9687976 +0.61 pp

Profiler (TP-0 EXTEND):

  • baseline: cross_device_reduce_2stage 136.171 ms (21.89%)
  • patch: allreduce_fusion_kernel_oneshot_lamport 57.661 ms (10.41%)

This change activates the fused allreduce path and removes the previous dominant unfused reduce hotspot.

Validation

  • H100 before/after throughput benchmark
  • H100 before/after sglang.profiler
  • H100 before/after full MMLU and full GSM8K server eval

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 13, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 13, 2026

/rerun-failed-ci

@ispobock
Copy link
Copy Markdown
Collaborator

/rerun-test test_qwen3_next_models.py test_qwen3_next_models_mtp.py

@github-actions
Copy link
Copy Markdown
Contributor

4-gpu-h100 (2 tests): View workflow run

cd test/ && python3 registered/4-gpu-models/test_qwen3_next_models.py
cd test/ && python3 registered/4-gpu-models/test_qwen3_next_models_mtp.py

@ispobock
Copy link
Copy Markdown
Collaborator

cc: @yizhang2077

@BBuf BBuf merged commit c6a45fa into main Apr 18, 2026
220 of 263 checks passed
@BBuf BBuf deleted the bbuf/qwen3next-flashinfer-allreduce-auto-enable branch April 18, 2026 14:32
jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 7, 2026
Bisects nightly-test-general-4-gpu-h100 :: TestFlashInferDeterministic.test_prefix_with_logprobs
to commit c6a45fa / PR sgl-project#22664 'Qwen3next flashinfer allreduce auto enable'.

Last pass: 9c47bba (2026-04-18). First fail: 2a327f0 (2026-04-19).
Still failing on main as of 2026-05-07.
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 7, 2026
Run TestFlashInferDeterministic.test_prefix_with_logprobs at parent
commit 4839cec: PASS (118.4s, '+++ identical across all batch sizes').
Run same test at c6a45fa (PR sgl-project#22664): FAIL (102.6s, 244 per-sample
mismatches with the same -2.355271339416504 vs -2.3723394870758057
fingerprint seen in CI run 24971499389).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants