Skip to content

Enable flashinfer::trtllm_allreduce_fusion with PDL#23765

Merged
BBuf merged 2 commits into
mainfrom
brayden/trtllm-ar-pdl
May 8, 2026
Merged

Enable flashinfer::trtllm_allreduce_fusion with PDL#23765
BBuf merged 2 commits into
mainfrom
brayden/trtllm-ar-pdl

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented Apr 26, 2026

Motivation

For BS = 1:
Screenshot 2026-04-26 at 9 15 32 AM

With PDL:
Screenshot 2026-04-26 at 9 16 46 AM

Thus, the latency of AR can be (nearly) fully hidden here.

Modifications

With regards to read after write hazards in the dependent kernel, it will only happen when:

PDL is issued without griddepcontrol.wait, which never is in the case in: fp4_quantize in all backends in Flashinfer, and fused_a_gemm. If there is no PDL signal (like in per token quant 8bit v2), then it will fully serialize like before (for ex, see the cvt_fp16_to_fp4 in the sgl-kernel one, which does not issue either PDL or GDC await. However, when we switch to #23745 (comment) for quantize, it will work properly.

Accuracy Tests

python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-V3.2-NVFP4 \
  --tp 4 \
  --ep 4 \
  --quantization modelopt_fp4 \
  --tool-call-parser deepseekv32 \
  --reasoning-parser deepseek-v3 \
  --port 30020 \
  --cuda-graph-bs 1 2 4 8 16 32 64 128 160 192 224 256 288 320 352 384 416 448 480 512 \
  --max-running-requests 512
python3 -m sglang.test.run_eval --port 30020 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3
Repeat: 8, mean: 0.834██████████████████████████████████████████████████████████████████████████████████████████████████████████████▌       | 187/198 [25:35<00:50,  4.55s/it]
Scores: ['0.813', '0.859', '0.869', '0.813', '0.869', '0.798', '0.843', '0.808']

Speed Tests and Profiling

109 -> 111.20 tps

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the flashinfer_allreduce_residual_rmsnorm function in flashinfer_comm_fusion.py by adding the trigger_completion_at_end argument to an internal call. The review feedback correctly points out that this argument is hardcoded to False, which ignores the value provided via the function's parameter, and suggests passing the parameter variable instead.

Comment thread python/sglang/srt/layers/flashinfer_comm_fusion.py Outdated
@b8zhong b8zhong added the run-ci label Apr 26, 2026
b8zhong and others added 2 commits April 26, 2026 10:06
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@b8zhong b8zhong force-pushed the brayden/trtllm-ar-pdl branch from 0c630c2 to 01ec4ab Compare April 26, 2026 14:06
Copy link
Copy Markdown
Collaborator

@nvpohanh nvpohanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM
cc @nvjullin @wenscarl

@BBuf BBuf merged commit 5fa3bb2 into main May 8, 2026
837 of 901 checks passed
@BBuf BBuf deleted the brayden/trtllm-ar-pdl branch May 8, 2026 02:41
Dogacel pushed a commit to Dogacel/sglang-fork that referenced this pull request May 8, 2026
)

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
)

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
)

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants