[NVIDIA] Add flashinfer all-to-all MOE dispatcher#14668
[NVIDIA] Add flashinfer all-to-all MOE dispatcher#14668Fridge003 merged 15 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
eafb94e to
61a262e
Compare
|
Hi, I am wondering whether this will be helpful for flashinfer_trtllm moe, and whether there will be support for it? Thanks! |
@fzyzcjy It looks like it should work based on NVIDIA/TensorRT-LLM@e4bf29b |
ch-wan
left a comment
There was a problem hiding this comment.
LGTM in general. Can we add a unittest for this new feature? Also, we need to update this document: https://github.com/sgl-project/sglang/blob/94e1251131ca27260cb0e8938aeb7b4a4e630b19/docs/advanced_features/expert_parallelism.md#backends-for-all-to-all-communication
|
/tag-and-rerun-ci |
Hi, is there any updates about that? if I understand correctly trtllm moe should be used for decode, and thus this feature is most useful when combined with that |
| @@ -0,0 +1,322 @@ | |||
| import unittest | |||
There was a problem hiding this comment.
Can we move this test to test/srt/ep and register it at nightly test.
Can open a following PR for this
@fzyzcjy I tried this out. The problem is that currently flashinfer doesn't support doing trtllm moe without fused routing. For all-to-all, we need to do the routing separately first, then do the communication, then run moe. Here is my WIP branch for the sglang changes which would allow it otherwise: https://github.com/trevor-m/sglang/tree/trtllm-a2a-wip Edit: I found that flashinfer has a separate api for this: "trtllm_fp4_block_scale_routed_moe". I will try it out. I looked at our WideEP configs and it looks like flashinfer_cutedsl moe is actually what we use for decode (flashinfer_cutlass is used for prefill). Let me see if that can be enabled with this all-to-all. |
Draft PR since flashinfer-ai/flashinfer#2102 is not yet merged into flashinfer.flashinfer-ai/flashinfer#2102 is now merged.
Motivation
This PR integrates the latest TRT-LLM moe all-to-all kernels into sglang (AKA nvlink one sided allotall or mnnvlthroughput alltoall):
Currently I have tested it with the flashinfer_cutlass moe runner backend and use fp4 quantize before communication. It also allows flashinfer_cutlass moe to write directly to the workspace buffer.
Remaining issues:
Modifications
--moe-a2a-backend=flashinferAccuracy Tests
Benchmarking and Profiling
Single node results are below. Working on multinode benchmarking
Single node 4xGB200 results and profiles
Dispatch (bs=512) with --moe-a2a-backend=flashinfer
Dispatch (bs=512) with --moe-a2a-backend=none (FP4 allgather)
Combine (bs=512) with --moe-a2a-backend=flashinfer
Combine (bs-512) --moe-a2a-backend=none (reduce-scatter)
Checklist