[NVIDIA] Add flashinfer all-to-all MOE dispatcher by trevor-m · Pull Request #14668 · sgl-project/sglang

trevor-m · 2025-12-08T20:36:17Z

~~Draft PR since flashinfer-ai/flashinfer#2102 is not yet merged into flashinfer.~~

flashinfer-ai/flashinfer#2102 is now merged.

Motivation

This PR integrates the latest TRT-LLM moe all-to-all kernels into sglang (AKA nvlink one sided allotall or mnnvlthroughput alltoall):

NVLINK one-sided comm AllToAll strategy for throughput scenarios.

This implementation utilizes symmetric memory to enable peer-to-peer access between GPUs over NVLink.
The kernels only take the role as one side of the communication: the dispatch kernel puts the data
into peer ranks' symmetric memory from local buffer, while the combine kernel gets the data from peer
ranks' symmetric memory and reduces the data into local buffer. It is the most efficient implementation
by now, but requires symmetric memory size proportional to max_num_tokens * n_ranks, which may not
scale well for very large-scale parallelization.

Currently I have tested it with the flashinfer_cutlass moe runner backend and use fp4 quantize before communication. It also allows flashinfer_cutlass moe to write directly to the workspace buffer.

Remaining issues:

Need to improve max_num_tokens - should multiply the per_rank value by ep_size. Or calculate this automatically based on server args.
Currently flashinfer a2a doesn't support when ranks have 0 tokens, so we need to create a dummy token in those cases. Doing the tensor allocation prevents us from using streams to overlap with shared experts.

Modifications

Add --moe-a2a-backend=flashinfer

Accuracy Tests

SGLANG_MOE_NVFP4_DISPATCH=1 python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --quantization modelopt_fp4 --tp 4 --moe-runner-backend flashinfer_cutlass --ep-size 4 --dp 4 --enable-dp-attention --mem-fraction-static 0.85 --max-running-requests 2048 --stream-interval 5 --enable-dp-lm-head --attention-backend triton --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 --disable-radix-cache --moe-a2a-backend flashinfer
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 2048 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 2048
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=30000
Accuracy: 0.955
Invalid: 0.000
Latency: 166.179 s
Output throughput: 874.407 token/s

Benchmarking and Profiling

Single node results are below. Working on multinode benchmarking

Single node 4xGB200 results and profiles

SGLANG_MOE_NVFP4_DISPATCH=1 python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --quantization modelopt_fp4 --tp 4 --moe-runner-backend flashinfer_cutlass --ep-size 4 --dp 4 --enable-dp-attention --mem-fraction-static 0.85 --max-running-requests 2048 --stream-interval 5 --enable-dp-lm-head --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 --disable-radix-cache --moe-a2a-backend flashinfer --disable-shared-experts-fusion
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 2048 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 2048
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 2048      
Successful requests:                     2048      
Benchmark duration (s):                  105.78    
Total input tokens:                      2097152   
Total input text tokens:                 2097152   
Total input vision tokens:               0         
Total generated tokens:                  2097152   
Total generated tokens (retokenized):    2093741   
Request throughput (req/s):              19.36     
Input token throughput (tok/s):          19826.37  
Output token throughput (tok/s):         19826.37  
Peak output token throughput (tok/s):    41273.00  
Peak concurrent requests:                2048      
Total token throughput (tok/s):          39652.75  
Concurrency:                             2033.67   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   105036.00 
Median E2E Latency (ms):                 105036.65 
---------------Time to First Token----------------
Mean TTFT (ms):                          19192.10  
Median TTFT (ms):                        18945.35  
P99 TTFT (ms):                           35753.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.91     
Median TPOT (ms):                        84.31     
P99 TPOT (ms):                           96.69     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           83.91     
Median ITL (ms):                         64.86     
P95 ITL (ms):                            134.52    
P99 ITL (ms):                            187.70    
Max ITL (ms):                            6041.95   
==================================================

With SGLANG_MOE_NVFP4_DISPATCH=0 (BF16 dispatch)
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 2048      
Successful requests:                     2048      
Benchmark duration (s):                  113.61    
Total input tokens:                      2097152   
Total input text tokens:                 2097152   
Total input vision tokens:               0         
Total generated tokens:                  2097152   
Total generated tokens (retokenized):    2094097   
Request throughput (req/s):              18.03     
Input token throughput (tok/s):          18459.37  
Output token throughput (tok/s):         18459.37  
Peak output token throughput (tok/s):    39440.00  
Peak concurrent requests:                2048      
Total token throughput (tok/s):          36918.74  
Concurrency:                             2032.17   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   112730.76 
Median E2E Latency (ms):                 112745.83 
---------------Time to First Token----------------
Mean TTFT (ms):                          22471.12  
Median TTFT (ms):                        22192.53  
P99 TTFT (ms):                           42519.44  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.23     
Median TPOT (ms):                        88.62     
P99 TPOT (ms):                           104.26    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           88.23     
Median ITL (ms):                         66.35     
P95 ITL (ms):                            139.80    
P99 ITL (ms):                            192.80    
Max ITL (ms):                            7388.84   
==================================================

With --moe-a2a-backend=none
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 2048      
Successful requests:                     2048      
Benchmark duration (s):                  107.98    
Total input tokens:                      2097152   
Total input text tokens:                 2097152   
Total input vision tokens:               0         
Total generated tokens:                  2097152   
Total generated tokens (retokenized):    2093493   
Request throughput (req/s):              18.97     
Input token throughput (tok/s):          19420.79  
Output token throughput (tok/s):         19420.79  
Peak output token throughput (tok/s):    43370.00  
Peak concurrent requests:                2048      
Total token throughput (tok/s):          38841.58  
Concurrency:                             2036.25   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   107365.61 
Median E2E Latency (ms):                 107376.61 
---------------Time to First Token----------------
Mean TTFT (ms):                          19164.99  
Median TTFT (ms):                        18753.23  
P99 TTFT (ms):                           35803.22  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          86.22     
Median TPOT (ms):                        86.69     
P99 TPOT (ms):                           98.54     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           86.22     
Median ITL (ms):                         66.88     
P95 ITL (ms):                            149.64    
P99 ITL (ms):                            224.95    
Max ITL (ms):                            5954.79   
==================================================

Dispatch (bs=512) with --moe-a2a-backend=flashinfer

Dispatch (bs=512) with --moe-a2a-backend=none (FP4 allgather)

Combine (bs=512) with --moe-a2a-backend=flashinfer

Combine (bs-512) --moe-a2a-backend=none (reduce-scatter)

(bs=512) with --moe-a2a-backend=none (reduce-scatter)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-08T20:36:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/layers/moe/token_dispatcher/flashinfer.py

ch-wan

We also need to update this document: https://github.com/sgl-project/sglang/blob/94e1251131ca27260cb0e8938aeb7b4a4e630b19/docs/advanced_features/expert_parallelism.md#backends-for-all-to-all-communication

python/sglang/srt/layers/moe/token_dispatcher/flashinfer.py

python/sglang/srt/layers/quantization/modelopt_quant.py

fzyzcjy · 2025-12-18T07:36:17Z

Hi, I am wondering whether this will be helpful for flashinfer_trtllm moe, and whether there will be support for it? Thanks!

trevor-m · 2025-12-18T07:49:18Z

Hi, I am wondering whether this will be helpful for flashinfer_trtllm moe, and whether there will be support for it? Thanks!

@fzyzcjy It looks like it should work based on NVIDIA/TensorRT-LLM@e4bf29b
I can give it a try

ch-wan

LGTM in general. Can we add a unittest for this new feature? Also, we need to update this document: https://github.com/sgl-project/sglang/blob/94e1251131ca27260cb0e8938aeb7b4a4e630b19/docs/advanced_features/expert_parallelism.md#backends-for-all-to-all-communication

Fridge003 · 2025-12-29T03:18:27Z

/tag-and-rerun-ci

fzyzcjy · 2026-01-12T14:26:30Z

@trevor-m: I can give it a try

Hi, is there any updates about that? if I understand correctly trtllm moe should be used for decode, and thus this feature is most useful when combined with that

Fridge003 · 2026-01-17T07:17:58Z

python/sglang/test/test_flashinfer_dispatcher.py

@@ -0,0 +1,322 @@
+import unittest


Can we move this test to test/srt/ep and register it at nightly test.
Can open a following PR for this

trevor-m · 2026-01-22T23:36:25Z

@trevor-m: I can give it a try

Hi, is there any updates about that? if I understand correctly trtllm moe should be used for decode, and thus this feature is most useful when combined with that

@fzyzcjy I tried this out. The problem is that currently flashinfer doesn't support doing trtllm moe without fused routing. For all-to-all, we need to do the routing separately first, then do the communication, then run moe. Here is my WIP branch for the sglang changes which would allow it otherwise: https://github.com/trevor-m/sglang/tree/trtllm-a2a-wip

Edit: I found that flashinfer has a separate api for this: "trtllm_fp4_block_scale_routed_moe". I will try it out.

I looked at our WideEP configs and it looks like flashinfer_cutedsl moe is actually what we use for decode (flashinfer_cutlass is used for prefill). Let me see if that can be enabled with this all-to-all.

trevor-m requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 8, 2025 20:36

github-actions bot added quant LLM Quantization deepseek labels Dec 8, 2025

djns99 approved these changes Dec 8, 2025

View reviewed changes

python/sglang/srt/layers/moe/token_dispatcher/flashinfer.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/moe/token_dispatcher/flashinfer.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/moe/token_dispatcher/flashinfer.py Show resolved Hide resolved

Fridge003 assigned ch-wan and Fridge003 Dec 8, 2025

trevor-m force-pushed the a2a branch from 2365c80 to 66a5faa Compare December 9, 2025 01:11

trevor-m requested review from fzyzcjy and zhyncs as code owners December 10, 2025 00:49

anurlybayev added nvidia Grace Blackwell labels Dec 10, 2025

github-actions bot added the documentation Improvements or additions to documentation label Dec 11, 2025

ch-wan reviewed Dec 12, 2025

View reviewed changes

trevor-m changed the title ~~Draft: Add flashinfer all-to-all MOE dispatcher~~ [NVIDIA] Add flashinfer all-to-all MOE dispatcher Dec 15, 2025

trevor-m force-pushed the a2a branch 2 times, most recently from eafb94e to 61a262e Compare December 17, 2025 04:34

ch-wan reviewed Dec 18, 2025

View reviewed changes

Fridge003 mentioned this pull request Dec 21, 2025

Update flashinfer to 0.6.1 #15551

Merged

6 tasks

trevor-m force-pushed the a2a branch from 164aadf to e6e2834 Compare December 22, 2025 19:54

ch-wan approved these changes Dec 22, 2025

View reviewed changes

github-actions bot added the run-ci label Dec 29, 2025

trevor-m added 10 commits January 6, 2026 10:46

Flashinfer a2a

8d86be1

Use workspace helper function.

b0a00db

Only enable combine payload tensor in workspace for flashinfer cutlass

7cdc445

improvements

a80d63b

Remove sanitize and improve args

d69b6f6

Cleanup

51c2c30

Address review comments

c5a16c8

Preallocate dummy token

2963e3c

Add unit test

9ddbcdd

add doc, format unittest

24e41b0

trevor-m force-pushed the a2a branch from c6b6964 to 24e41b0 Compare January 6, 2026 18:46

Merge branch 'main' into a2a

e1e6ba3

trevor-m enabled auto-merge (squash) January 6, 2026 23:25

trevor-m disabled auto-merge January 7, 2026 00:46

Merge branch 'main' into a2a

6e85092

Fridge003 reviewed Jan 17, 2026

View reviewed changes

Merge branch 'main' into a2a

a362d1c

Fridge003 added 2 commits January 23, 2026 12:43

Merge branch 'main' into a2a

0ae7273

Merge branch 'main' into a2a

320ce66

Fridge003 merged commit 2c2c4e4 into sgl-project:main Jan 24, 2026
279 of 292 checks passed

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[NVIDIA] Add flashinfer all-to-all MOE dispatcher (sgl-project#14668)

34745a7

trevor-m mentioned this pull request Feb 24, 2026

[Feature] Support other moe runners with flashinfer a2a backend #19270

Open

2 tasks

Fridge003 mentioned this pull request Mar 2, 2026

[Feature] GB200/GB300 Performance Optimization Plan #19650

Open

Conversation

trevor-m commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Single node 4xGB200 results and profiles

Dispatch (bs=512) with --moe-a2a-backend=flashinfer

Dispatch (bs=512) with --moe-a2a-backend=none (FP4 allgather)

Combine (bs=512) with --moe-a2a-backend=flashinfer

Combine (bs-512) --moe-a2a-backend=none (reduce-scatter)

Checklist

Uh oh!

gemini-code-assist bot commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Dec 18, 2025

Uh oh!

trevor-m commented Dec 18, 2025

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Dec 29, 2025

Uh oh!

fzyzcjy commented Jan 12, 2026

Uh oh!

Fridge003 Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

trevor-m commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

trevor-m commented Dec 8, 2025 •

edited

Loading

trevor-m commented Jan 22, 2026 •

edited

Loading