Enable `flashinfer::trtllm_allreduce_fusion` with PDL by b8zhong · Pull Request #23765 · sgl-project/sglang

b8zhong · 2026-04-26T13:47:31Z

Motivation

For BS = 1:

With PDL:

Thus, the latency of AR can be (nearly) fully hidden here.

Modifications

With regards to read after write hazards in the dependent kernel, it will only happen when:

PDL is issued without griddepcontrol.wait, which never is in the case in: fp4_quantize in all backends in Flashinfer, and fused_a_gemm. If there is no PDL signal (like in per token quant 8bit v2), then it will fully serialize like before (for ex, see the cvt_fp16_to_fp4 in the sgl-kernel one, which does not issue either PDL or GDC await. However, when we switch to #23745 (comment) for quantize, it will work properly.

Accuracy Tests

python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-V3.2-NVFP4 \
  --tp 4 \
  --ep 4 \
  --quantization modelopt_fp4 \
  --tool-call-parser deepseekv32 \
  --reasoning-parser deepseek-v3 \
  --port 30020 \
  --cuda-graph-bs 1 2 4 8 16 32 64 128 160 192 224 256 288 320 352 384 416 448 480 512 \
  --max-running-requests 512

python3 -m sglang.test.run_eval --port 30020 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3

Repeat: 8, mean: 0.834██████████████████████████████████████████████████████████████████████████████████████████████████████████████▌       | 187/198 [25:35<00:50,  4.55s/it]
Scores: ['0.813', '0.859', '0.869', '0.813', '0.869', '0.798', '0.843', '0.808']

Speed Tests and Profiling

109 -> 111.20 tps

gemini-code-assist

Code Review

This pull request updates the flashinfer_allreduce_residual_rmsnorm function in flashinfer_comm_fusion.py by adding the trigger_completion_at_end argument to an internal call. The review feedback correctly points out that this argument is hardcoded to False, which ignores the value provided via the function's parameter, and suggests passing the parameter variable instead.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

nvpohanh

Thanks! LGTM
cc @nvjullin @wenscarl

) Co-authored-by: b8zhong <b8zhong@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

b8zhong requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 26, 2026 13:47

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/flashinfer_comm_fusion.py Outdated

b8zhong added the run-ci label Apr 26, 2026

b8zhong and others added 2 commits April 26, 2026 10:06

upd

2c28a62

Update python/sglang/srt/layers/flashinfer_comm_fusion.py

01ec4ab

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

b8zhong force-pushed the brayden/trtllm-ar-pdl branch from 0c630c2 to 01ec4ab Compare April 26, 2026 14:06

nvpohanh approved these changes Apr 28, 2026

View reviewed changes

BBuf mentioned this pull request Apr 29, 2026

SGLang AI Agent Performance Optimization PRs (2026-01-29 to 2026-04-29) BBuf/AI-Infra-Auto-Driven-SKILLS#46

Open

BBuf approved these changes May 8, 2026

View reviewed changes

BBuf merged commit 5fa3bb2 into main May 8, 2026
837 of 901 checks passed

BBuf deleted the brayden/trtllm-ar-pdl branch May 8, 2026 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable `flashinfer::trtllm_allreduce_fusion` with PDL#23765

Enable `flashinfer::trtllm_allreduce_fusion` with PDL#23765
BBuf merged 2 commits into
mainfrom
brayden/trtllm-ar-pdl

b8zhong commented Apr 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

nvpohanh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

b8zhong commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

nvpohanh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b8zhong commented Apr 26, 2026 •

edited

Loading