[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api by wenscarl · Pull Request #12787 · sgl-project/sglang

wenscarl · 2025-11-06T19:41:07Z

Motivation

Upstreaming the new trtllm_mnnvl_fused_allreduce_add_rmsnorm backend with unified fusion api. Depends on flashinfer-ai/flashinfer#2118 and flashinfer-ai/flashinfer#2130

This will reduce latency for multi-GPU decode for NVL systems.

Modifications

Accuracy Tests

Benchmarking and Profiling

python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-4-Scout-17B-16E --batch 16 --input-len 256 --output-len 32 \
  --tp 4 \
  --kv-cache-dtype fp8_e4m3 \
  --load-format dummy \
  --trust-remote-code \
  --attention-backend triton \
  --moe-runner-backend cutlass \
  --enable-flashinfer-allreduce-fusion
  (or --flashinfer-allreduce-fusion-backend auto/trtllm/mnnvl)

Single-Node Benchmark (BS=16)

Prefill Performance

Engine	Prefill Latency (s)	Prefill Throughput (tok/s)
auto(mnnvl at GB200)	0.12081	33,903.74
trtllm	0.14941	27,414.07
symm-mem	0.12075	33,921.57

Decode Performance (Median)

Engine	Median Decode Latency (s)	Median Decode Throughput (tok/s)
auto(mnnvl at GB200)	0.01204	1,328.61
trtllm	0.01240	1,290.71
symm-mem	0.01345	1,189.55

End-to-End Performance

Engine	Total Latency (s)	Total Throughput (tok/s)
auto(mnnvl at GB200)	0.494	9,323.37
trtllm	0.534	8,628.06
symm-mem	0.539	8,551.91

Multinode Disagg(TP=8, GB200 w/ 4 gpus per node, DSR1-nvfp4)
symm-mem:

Concurrency	Output Token TPT (tok/s)	Median TPOT (ms)
4	282.70	8.87
8	595.17	9.12
32	1474.07	12.08
64	1994.36	13.32

mnnvl all-reduce-fusion:

Concurrency	Output Token TPT (tok/s)	Median TPOT (ms)
4	287.90	6.42 (-27%)
8	656.61 (+10%)	7.38 (-19%)
32	1512.82 (+2.6%)	9.52 (-21%)
64	1644.84 (-17%)	10.23 (-23%)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Fridge003 · 2025-11-10T04:36:25Z

@wenscarl Is this PR ready for review?

wenscarl · 2025-11-10T13:55:16Z

@wenscarl Is this PR ready for review?

There is still some issue with the kernel in flashinfer.

…ration

Fridge003 · 2026-01-18T06:31:48Z

Hi @wenscarl . Flashinfer has been updated to 0.6.1 so this PR can be unlocked

…ration

python/sglang/srt/server_args.py

…ration

wenscarl · 2026-03-06T19:27:46Z

@mmangkad

Hi @wenscarl, didn't see this PR before opening #19586 (which also adds standalone TP AR). I see the mnnvl SM100 restriction here, but I tested it on SM90 and it works for both standalone and fusion. Is this restriction still necessary today?
The restriction should still apply. We don't officially support Hopper even in some case when the symmetric memory is available and the code can still run. If you grab a H100 node with nvlink and mnnvl likely works but the kernel may not be optimal.

mmangkad · 2026-03-06T19:38:47Z

@mmangkad

Hi @wenscarl, didn't see this PR before opening #19586 (which also adds standalone TP AR). I see the mnnvl SM100 restriction here, but I tested it on SM90 and it works for both standalone and fusion. Is this restriction still necessary today?
The restriction should still apply. We don't officially support Hopper even in some case when the symmetric memory is available and the code can still run. If you grab a H100 node with nvlink and mnnvl likely works but the kernel may not be optimal.

Yeah, I figured it only works on SM90 in single-node setups. In that case, mnnvl can be on par with or even slightly better than trtllm. I also tried SM90 multi-node and confirmed it doesn’t work.

python/sglang/srt/server_args.py

python/sglang/srt/layers/flashinfer_comm_fusion.py

python/sglang/srt/server_args.py

Fridge003 · 2026-03-07T04:33:40Z

@wenscarl The benchmark data looks great. But since we enable auto backend by default, can you add a line for auto backend in the table? Since I'm not really sure whether auto backend can correctly be dispatch to the right backend (trtllm/mnnvl)

Is the single node benchmark tested on gb200 or b200? Looks like mnnvl is always fastest, can we enable mnnvl by default?

wenscarl · 2026-03-09T19:24:26Z

@wenscarl The benchmark data looks great. But since we enable auto backend by default, can you add a line for auto backend in the table? Since I'm not really sure whether auto backend can correctly be dispatch to the right backend (trtllm/mnnvl)

In my testing on GB200, auto is mnnvl. Replaced the mnnvl by auto in the tables.

Is the single node benchmark tested on gb200 or b200? Looks like mnnvl is always fastest, can we enable mnnvl by default?
GB200, the auto is determined by flashinfer which auto detect arch.

Fridge003 · 2026-03-11T20:21:02Z

/tag-and-rerun-ci

…ration

samuellees · 2026-03-17T14:54:37Z

/tag-and-rerun-ci

… fusion api (sgl-project#12787)

wenscarl and others added 8 commits October 30, 2025 19:09

Add mm_fp4 trtllm backend

b0244e1

Merge branch 'sgl-project:main' into mm_fp4_trtllm

69e7be5

Use str env var

153f7b1

Address comment.

d3e2ed6

Merge branch 'main' into mm_fp4_trtllm

a4ad51e

Merge branch 'main' into mm_fp4_trtllm

4488840

Fix typo.

d63fd56

Wip

94d32b6

github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization labels Nov 6, 2025

Fridge003 added high priority and removed documentation Improvements or additions to documentation labels Nov 6, 2025

wip

77e2462

github-actions bot added the documentation Improvements or additions to documentation label Nov 19, 2025

anurlybayev added nvidia blackwell SM100/SM120 labels Dec 11, 2025

anurlybayev assigned wenscarl Dec 11, 2025

anurlybayev added Grace Blackwell and removed blackwell SM100/SM120 labels Dec 11, 2025

wenscarl added 2 commits December 18, 2025 20:26

e2e

9c570b0

Merge remote-tracking branch 'origin/main' into trtllm_mnnvl_ar_integ…

f3d367e

…ration

Fridge003 mentioned this pull request Jan 18, 2026

Update flashinfer to 0.6.1 #15551

Merged

6 tasks

b8zhong mentioned this pull request Jan 21, 2026

[Tracking] GLM 4.5/4.6/4.7 Blackwell performance optimization #17526

Closed

10 tasks

Merge remote-tracking branch 'origin/main' into trtllm_mnnvl_ar_integ…

29b7f22

…ration

wenscarl mentioned this pull request Feb 11, 2026

[Feature] Integrate new flashinfer optimizations for DeepSeekV3 #14453

Open

Unified fusion api

4da0b52

wenscarl marked this pull request as ready for review February 11, 2026 20:28

hlu1 reviewed Mar 3, 2026

View reviewed changes

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

wenscarl added 2 commits March 5, 2026 10:56

Remove nnodes==1.

a4a04cc

Upd

c12cd3b

wenscarl requested a review from hlu1 March 5, 2026 19:48

wenscarl added 2 commits March 6, 2026 03:45

Upd doc

ab46fd8

Merge remote-tracking branch 'origin/main' into trtllm_mnnvl_ar_integ…

aafb113

…ration

wenscarl requested a review from nvpohanh March 6, 2026 19:24

Fridge003 reviewed Mar 7, 2026

View reviewed changes

Address comments

d15415d

wenscarl requested a review from Fridge003 March 9, 2026 19:22

Fridge003 mentioned this pull request Mar 10, 2026

[Bugfix] Work around FlashInfer unified transport issue on GB #20039

Merged

github-actions bot added the run-ci label Mar 11, 2026

Merge branch 'main' into trtllm_mnnvl_ar_integration

e03e935

nvpohanh mentioned this pull request Mar 12, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

yushengsu-thu mentioned this pull request Mar 15, 2026

[LoRA][III] Add LoRA support for MoE layers and enable TP #14105

Merged

6 tasks

wenscarl and others added 4 commits March 16, 2026 06:57

Merge remote-tracking branch 'origin/main' into trtllm_mnnvl_ar_integ…

cd77ec2

…ration

Minor fix

adaf673

Merge branch 'main' into trtllm_mnnvl_ar_integration

dfb4366

Merge branch 'main' into trtllm_mnnvl_ar_integration

537d595

Fridge003 approved these changes Mar 17, 2026

View reviewed changes

Fridge003 merged commit d35fea1 into sgl-project:main Mar 17, 2026
273 of 292 checks passed

Kangyan-Zhou mentioned this pull request Mar 17, 2026

Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" #20792

Merged

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce…

2878cff

… fusion api (sgl-project#12787)

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce…

0aa01d5

… fusion api (sgl-project#12787)

Conversation

wenscarl commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Prefill Performance

Decode Performance (Median)

End-to-End Performance

Checklist

Uh oh!

Fridge003 commented Nov 10, 2025

Uh oh!

wenscarl commented Nov 10, 2025

Uh oh!

Fridge003 commented Jan 18, 2026

Uh oh!

Uh oh!

wenscarl commented Mar 6, 2026

Uh oh!

mmangkad commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Mar 7, 2026

Uh oh!

wenscarl commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Mar 11, 2026

Uh oh!

samuellees commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wenscarl commented Nov 6, 2025 •

edited

Loading

wenscarl commented Mar 9, 2026 •

edited

Loading