Add SwapAB Optimization for triton fused_moe_kernel on SM90.#15712
Add SwapAB Optimization for triton fused_moe_kernel on SM90.#15712Fridge003 merged 4 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py
Outdated
Show resolved
Hide resolved
|
/rerun-failed-ci |
…gl-project#15712)" This reverts commit ee4d228.
|
this pr didn't pass amd CI, and caused a failure. please avoid such cases for community sharing. |
|
Hi @Insideyyy seems this PR will break some AMD CIs So we reverted it temporarily. Can you make a fix |
|
@Fridge003 Sorry for causing trouble. I'll make a fix.
|
Motivation
In case of a small M dimension and using fp8_w8a8 on SM90, SwapAB brings significant benefit by transposing input A, B to make better use of
WGMMA.Modifications
SwapAB is enabled under all the following conditions:
If SwapAB is enabled,
a,b,a_scale,b_scale,accumulatorwill be transposed beforetl.dot()andaccumulatorwill be transposed back after k iterations.Accuracy Tests
Before this PR:
After this PR:
Benchmarking and Profiling
We tested GLM-4.6V-FP8 on H20-3e. MoE configs are tuned using
benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.pyscript.Fused moe module
fused-moe-performance(ms):
End to end
Setup
server:
client:
Performance
Before this PR:
After this PR:
Checklist