Add SwapAB Optimization for triton fused_moe_kernel on SM90. by Insideyyy · Pull Request #15712 · sgl-project/sglang

Insideyyy · 2025-12-24T02:40:49Z

Motivation

In case of a small M dimension and using fp8_w8a8 on SM90, SwapAB brings significant benefit by transposing input A, B to make better use of WGMMA.

Modifications

SwapAB is enabled under all the following conditions:

use_fp8_w8a8
SM90 is supported
config["BLOCK_SIZE_M"] < 64 and config["BLOCK_SIZE_N"] >= 64

If SwapAB is enabled, a, b, a_scale, b_scale, accumulator will be transposed before tl.dot() and accumulator will be transposed back after k iterations.

Accuracy Tests

Before this PR:


$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 8 --num-shots 8 --port 8188  
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:52<00:00,  3.82it/s]
Accuracy: 0.965
Invalid: 0.000
Latency: 52.392 s
Output throughput: 476.100 token/s

$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 32 --num-shots 8 --port 8188
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:26<00:00,  7.46it/s]
Accuracy: 0.945
Invalid: 0.000
Latency: 26.948 s
Output throughput: 902.068 token/s

$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 8188  
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:13<00:00, 14.32it/s]
Accuracy: 0.965
Invalid: 0.000
Latency: 14.052 s
Output throughput: 1760.288 token/s

After this PR:

$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 8 --num-shots 8 --port 8188 
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:48<00:00,  4.17it/s]
Accuracy: 0.970
Invalid: 0.000
Latency: 48.143 s
Output throughput: 517.128 token/s

$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 32 --num-shots 8 --port 8188  
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:24<00:00,  8.21it/s]
Accuracy: 0.940
Invalid: 0.000
Latency: 24.458 s
Output throughput: 1012.738 token/s

$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 8188
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:12<00:00, 15.60it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 12.973 s
Output throughput: 1891.012 token/s

Benchmarking and Profiling

We tested GLM-4.6V-FP8 on H20-3e. MoE configs are tuned using benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py script.

Fused moe module

fused-moe-performance(ms):

batch_size	Before this PR	After this PR	speedup
1	0.058496	0.051936	1.126309304
2	0.069696	0.061696	1.12966805
4	0.104	0.080864	1.286110012
8	0.141856	0.123984	1.144147632
32	0.247232	0.190592	1.297179315
128	0.270624	0.21136	1.280393641
256	0.343616	0.250176	1.373497058
512	0.358432	0.326848	1.096632074
1024	0.50912	0.49472	1.029107374
2048	0.840432	0.841968	0.998175703
4096	1.542496	1.544416	0.998756812

End to end

Setup

server:

MODEL=/root/GLM-4.6V-FP8
python3 -m sglang.launch_server \
        --model-path $MODEL \
        --host 127.0.0.1 \
        --port 8188 \
        --trust-remote-code \
        --mem-fraction-static 0.8 \
        --attention-backend flashinfer \
        --tp-size 4

client:

python3 -m sglang.bench_serving \
        --backend sglang \
        --host 127.0.0.1 \
        --port 8188 \
        --max-concurrency 8 \
        --dataset-name sharegpt \
        --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
        --num-prompt 100

Performance

Before this PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  44.23     
Total input tokens:                      33279     
Total input text tokens:                 33279     
Total input vision tokens:               0         
Total generated tokens:                  21392     
Total generated tokens (retokenized):    21366     
Request throughput (req/s):              2.26      
Input token throughput (tok/s):          752.34    
Output token throughput (tok/s):         483.61    
Peak output token throughput (tok/s):    648.00    
Peak concurrent requests:                14        
Total token throughput (tok/s):          1235.95   
Concurrency:                             7.55      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3341.50   
Median E2E Latency (ms):                 2437.68   
---------------Time to First Token----------------
Mean TTFT (ms):                          124.87    
Median TTFT (ms):                        90.04     
P99 TTFT (ms):                           630.39    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.80     
Median TPOT (ms):                        14.51     
P99 TPOT (ms):                           20.78     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.11     
Median ITL (ms):                         12.51     
P95 ITL (ms):                            14.55     
P99 ITL (ms):                            76.87     
Max ITL (ms):                            734.65    
==================================================

After this PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  38.13     
Total input tokens:                      33279     
Total input text tokens:                 33279     
Total input vision tokens:               0         
Total generated tokens:                  21392     
Total generated tokens (retokenized):    21385     
Request throughput (req/s):              2.62      
Input token throughput (tok/s):          872.76    
Output token throughput (tok/s):         561.01    
Peak output token throughput (tok/s):    720.00    
Peak concurrent requests:                15        
Total token throughput (tok/s):          1433.77   
Concurrency:                             7.53      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2870.34   
Median E2E Latency (ms):                 2057.39   
---------------Time to First Token----------------
Mean TTFT (ms):                          95.47     
Median TTFT (ms):                        86.83     
P99 TTFT (ms):                           149.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.94     
Median TPOT (ms):                        12.81     
P99 TPOT (ms):                           18.61     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.03     
Median ITL (ms):                         11.16     
P95 ITL (ms):                            13.22     
P99 ITL (ms):                            74.15     
Max ITL (ms):                            80.46     
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-24T02:40:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ClawSeven · 2025-12-24T05:49:18Z

/tag-and-rerun-ci

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py

Fridge003 · 2026-01-04T11:53:38Z

/rerun-failed-ci

…gl-project#15712)" This reverts commit ee4d228.

sunxxuns · 2026-01-07T19:29:27Z

this pr didn't pass amd CI, and caused a failure. please avoid such cases for community sharing.

Fridge003 · 2026-01-08T01:21:15Z

Hi @Insideyyy seems this PR will break some AMD CIs
https://github.com/sgl-project/sglang/actions/runs/20688902361/job/59402606297
https://github.com/sgl-project/sglang/actions/runs/20791025573/job/59713014224?pr=11349

So we reverted it temporarily. Can you make a fix

Insideyyy · 2026-01-08T02:28:16Z

@Fridge003 Sorry for causing trouble. I'll make a fix.

Hi @Insideyyy seems this PR will break some AMD CIs https://github.com/sgl-project/sglang/actions/runs/20688902361/job/59402606297 https://github.com/sgl-project/sglang/actions/runs/20791025573/job/59713014224?pr=11349

So we reverted it temporarily. Can you make a fix

Add swap_ab for fused_moe_kernel.

b6bf0df

Insideyyy requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 24, 2025 02:40

ClawSeven mentioned this pull request Dec 24, 2025

[Roadmap] Diffusion LLMs (2025 Q4 & 2026 Q1) #14199

Open

45 tasks

github-actions bot added the run-ci label Dec 24, 2025

Fridge003 reviewed Dec 24, 2025

View reviewed changes

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py Outdated Show resolved Hide resolved

Insideyyy added 2 commits December 24, 2025 11:37

code lint

2f42336

Merge branch 'main' into fused_moe_swapAB

2c359c6

Fridge003 requested changes Dec 29, 2025

View reviewed changes

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py Outdated Show resolved Hide resolved

Enable swap_ab only on H20.

5dfa6e1

Fridge003 approved these changes Jan 7, 2026

View reviewed changes

Fridge003 merged commit ee4d228 into sgl-project:main Jan 7, 2026
317 of 378 checks passed

michaelzhang-ai mentioned this pull request Jan 7, 2026

[AMD] Add MI35x nightly CI tests #16588

Merged

5 tasks

michaelzhang-ai added a commit to michaelzhang-ai/sglang that referenced this pull request Jan 7, 2026

Revert "Add SwapAB Optimization for triton fused_moe_kernel on SM90. (s…

8f85839

…gl-project#15712)" This reverts commit ee4d228.

michaelzhang-ai mentioned this pull request Jan 7, 2026

Revert "Add SwapAB Optimization for triton fused_moe_kernel on SM90." #16676

Merged

Insideyyy mentioned this pull request Jan 8, 2026

[Rework] Add SwapAB Optimization for triton fused_moe_kernel on SM90. #16723

Merged

5 tasks

ZelinMa557 mentioned this pull request Mar 18, 2026

[perf] enable SwapAB for bf16 moe triton kernel #20861

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SwapAB Optimization for triton fused_moe_kernel on SM90.#15712

Add SwapAB Optimization for triton fused_moe_kernel on SM90.#15712
Fridge003 merged 4 commits intosgl-project:mainfrom
Insideyyy:fused_moe_swapAB

Insideyyy commented Dec 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

ClawSeven commented Dec 24, 2025

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Jan 4, 2026

Uh oh!

Uh oh!

sunxxuns commented Jan 7, 2026

Uh oh!

Fridge003 commented Jan 8, 2026

Uh oh!

Insideyyy commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Insideyyy commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Fused moe module

End to end

Setup

Performance

Checklist

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

ClawSeven commented Dec 24, 2025

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Jan 4, 2026

Uh oh!

Uh oh!

sunxxuns commented Jan 7, 2026

Uh oh!

Fridge003 commented Jan 8, 2026

Uh oh!

Insideyyy commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Insideyyy commented Dec 24, 2025 •

edited

Loading