[perf] enable SwapAB for bf16 moe triton kernel#20861
[perf] enable SwapAB for bf16 moe triton kernel#20861ZelinMa557 wants to merge 2 commits intosgl-project:mainfrom
Conversation
Signed-off-by: ZelinMa557 <3388706467@qq.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a performance optimization for the Mixture-of-Experts (MoE) Triton kernel by enabling the "SwapAB" trick for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request successfully introduces the logic to enable the SwapAB trick for bf16 tensors, which is a valuable performance optimization. The invoke_fused_moe_kernel function correctly identifies when swap_ab should be enabled for bf16 inputs. However, a potential issue exists in the fused_moe_kernel where the actual matrix swap (tl.trans) is only applied in a very specific execution path, which might prevent the SwapAB optimization from being fully utilized or correctly applied across all relevant bf16 scenarios.
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py
Outdated
Show resolved
Hide resolved
| use_bf16 = A.dtype == torch.bfloat16 | ||
| if use_fp8_w8a8 or use_bf16: |
There was a problem hiding this comment.
The introduction of use_bf16 and its inclusion in the if condition correctly extends the swap_ab enablement logic to bfloat16 tensors, aligning with the pull request's goal. However, please refer to the comment on lines 579-580 in fused_moe_kernel regarding the limited application scope of swap_ab. While this change correctly sets the swap_ab flag, the kernel might not always apply the actual matrix swap due to its current placement.
Signed-off-by: ZelinMa557 <3388706467@qq.com>
|
@BBuf Hi, sorry for the ping, this pr enabled SwapAB for bf16 moe triton kernel with only a few lines of change. Would you mind taking a quick look when you have a moment? Thanks! |
Motivation
Inspired by #15712, sglang have enabled SwapAB trick for fp8 moe triton kernel, actually we can also use this trick for bf16 on hopper
Modifications
Modified python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py, enable SwapAB when input type is bf16.
Accuracy Tests
start command:
result:

Benchmarking and Profiling
I use qwen3.5-35b-a3b to do benchmark

kernel benchmark:
It shows significant improvement when 256 <= M <= 2048
End to end benchmark command:
result before this pr:


result after this pr: about 2.5% speedup for input/output token throughput
Appendix
I use this command to tune the moe kernel:
config file for swap ab kernel:
config file for normal kernel:
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci