[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api#12787
Conversation
|
@wenscarl Is this PR ready for review? |
There is still some issue with the kernel in flashinfer. |
|
Hi @wenscarl . Flashinfer has been updated to 0.6.1 so this PR can be unlocked |
|
Yeah, I figured it only works on SM90 in single-node setups. In that case, mnnvl can be on par with or even slightly better than trtllm. I also tried SM90 multi-node and confirmed it doesn’t work. |
|
@wenscarl The benchmark data looks great. But since we enable auto backend by default, can you add a line for auto backend in the table? Since I'm not really sure whether auto backend can correctly be dispatch to the right backend (trtllm/mnnvl) Is the single node benchmark tested on gb200 or b200? Looks like mnnvl is always fastest, can we enable mnnvl by default? |
In my testing on GB200, auto is mnnvl. Replaced the mnnvl by auto in the tables.
|
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
Motivation
Upstreaming the new
trtllm_mnnvl_fused_allreduce_add_rmsnormbackend with unified fusion api. Depends on flashinfer-ai/flashinfer#2118 and flashinfer-ai/flashinfer#2130This will reduce latency for multi-GPU decode for NVL systems.
Modifications
Accuracy Tests
Benchmarking and Profiling
Single-Node Benchmark (BS=16)
Prefill Performance
Decode Performance (Median)
End-to-End Performance
Multinode Disagg(TP=8, GB200 w/ 4 gpus per node, DSR1-nvfp4)
symm-mem:
mnnvl all-reduce-fusion:
Checklist