Skip to content

[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api#12787

Merged
Fridge003 merged 30 commits intosgl-project:mainfrom
wenscarl:trtllm_mnnvl_ar_integration
Mar 17, 2026
Merged

[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api#12787
Fridge003 merged 30 commits intosgl-project:mainfrom
wenscarl:trtllm_mnnvl_ar_integration

Conversation

@wenscarl
Copy link
Collaborator

@wenscarl wenscarl commented Nov 6, 2025

Motivation

Upstreaming the new trtllm_mnnvl_fused_allreduce_add_rmsnorm backend with unified fusion api. Depends on flashinfer-ai/flashinfer#2118 and flashinfer-ai/flashinfer#2130

This will reduce latency for multi-GPU decode for NVL systems.

Modifications

Accuracy Tests

Benchmarking and Profiling

python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-4-Scout-17B-16E --batch 16 --input-len 256 --output-len 32 \
  --tp 4 \
  --kv-cache-dtype fp8_e4m3 \
  --load-format dummy \
  --trust-remote-code \
  --attention-backend triton \
  --moe-runner-backend cutlass \
  --enable-flashinfer-allreduce-fusion
  (or --flashinfer-allreduce-fusion-backend auto/trtllm/mnnvl)

Single-Node Benchmark (BS=16)

Prefill Performance

Engine Prefill Latency (s) Prefill Throughput (tok/s)
auto(mnnvl at GB200) 0.12081 33,903.74
trtllm 0.14941 27,414.07
symm-mem 0.12075 33,921.57

Decode Performance (Median)

Engine Median Decode Latency (s) Median Decode Throughput (tok/s)
auto(mnnvl at GB200) 0.01204 1,328.61
trtllm 0.01240 1,290.71
symm-mem 0.01345 1,189.55

End-to-End Performance

Engine Total Latency (s) Total Throughput (tok/s)
auto(mnnvl at GB200) 0.494 9,323.37
trtllm 0.534 8,628.06
symm-mem 0.539 8,551.91

Multinode Disagg(TP=8, GB200 w/ 4 gpus per node, DSR1-nvfp4)
symm-mem:

Concurrency Output Token TPT (tok/s) Median TPOT (ms)
4 282.70 8.87
8 595.17 9.12
32 1474.07 12.08
64 1994.36 13.32

mnnvl all-reduce-fusion:

Concurrency Output Token TPT (tok/s) Median TPOT (ms)
4 287.90 6.42 (-27%)
8 656.61 (+10%) 7.38 (-19%)
32 1512.82 (+2.6%) 9.52 (-21%)
64 1644.84 (-17%) 10.23 (-23%)

Checklist

@github-actions github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization labels Nov 6, 2025
@Fridge003 Fridge003 added high priority and removed documentation Improvements or additions to documentation labels Nov 6, 2025
@Fridge003
Copy link
Collaborator

@wenscarl Is this PR ready for review?

@wenscarl
Copy link
Collaborator Author

@wenscarl Is this PR ready for review?

There is still some issue with the kernel in flashinfer.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 19, 2025
@anurlybayev anurlybayev added Grace Blackwell and removed blackwell SM100/SM120 labels Dec 11, 2025
@Fridge003
Copy link
Collaborator

Hi @wenscarl . Flashinfer has been updated to 0.6.1 so this PR can be unlocked

@wenscarl wenscarl marked this pull request as ready for review February 11, 2026 20:28
@wenscarl wenscarl requested a review from hlu1 March 5, 2026 19:48
@wenscarl wenscarl requested a review from nvpohanh March 6, 2026 19:24
@wenscarl
Copy link
Collaborator Author

wenscarl commented Mar 6, 2026

@mmangkad

Hi @wenscarl, didn't see this PR before opening #19586 (which also adds standalone TP AR). I see the mnnvl SM100 restriction here, but I tested it on SM90 and it works for both standalone and fusion. Is this restriction still necessary today?
The restriction should still apply. We don't officially support Hopper even in some case when the symmetric memory is available and the code can still run. If you grab a H100 node with nvlink and mnnvl likely works but the kernel may not be optimal.

@mmangkad
Copy link
Contributor

mmangkad commented Mar 6, 2026

@mmangkad

Hi @wenscarl, didn't see this PR before opening #19586 (which also adds standalone TP AR). I see the mnnvl SM100 restriction here, but I tested it on SM90 and it works for both standalone and fusion. Is this restriction still necessary today?
The restriction should still apply. We don't officially support Hopper even in some case when the symmetric memory is available and the code can still run. If you grab a H100 node with nvlink and mnnvl likely works but the kernel may not be optimal.

Yeah, I figured it only works on SM90 in single-node setups. In that case, mnnvl can be on par with or even slightly better than trtllm. I also tried SM90 multi-node and confirmed it doesn’t work.

@Fridge003
Copy link
Collaborator

@wenscarl The benchmark data looks great. But since we enable auto backend by default, can you add a line for auto backend in the table? Since I'm not really sure whether auto backend can correctly be dispatch to the right backend (trtllm/mnnvl)

Is the single node benchmark tested on gb200 or b200? Looks like mnnvl is always fastest, can we enable mnnvl by default?

@wenscarl wenscarl requested a review from Fridge003 March 9, 2026 19:22
@wenscarl
Copy link
Collaborator Author

wenscarl commented Mar 9, 2026

@wenscarl The benchmark data looks great. But since we enable auto backend by default, can you add a line for auto backend in the table? Since I'm not really sure whether auto backend can correctly be dispatch to the right backend (trtllm/mnnvl)

In my testing on GB200, auto is mnnvl. Replaced the mnnvl by auto in the tables.

Is the single node benchmark tested on gb200 or b200? Looks like mnnvl is always fastest, can we enable mnnvl by default?
GB200, the auto is determined by flashinfer which auto detect arch.

@Fridge003
Copy link
Collaborator

/tag-and-rerun-ci

@samuellees
Copy link
Contributor

/tag-and-rerun-ci

@Fridge003 Fridge003 merged commit d35fea1 into sgl-project:main Mar 17, 2026
273 of 292 checks passed
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation Grace Blackwell high priority nvidia quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants