[DeepSeek V3.1/V3.2] Optimize fused moe configs for H20 & H20-3E based on swapab#17133
[DeepSeek V3.1/V3.2] Optimize fused moe configs for H20 & H20-3E based on swapab#17133Fridge003 merged 2 commits intosgl-project:mainfrom
Conversation
…used_moe_triton_sep.py
Summary of ChangesHello @xu-yfei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the performance of Mixture-of-Experts (MoE) models, specifically DeepSeek V3.1/V3.2, on H20 and H20-3E hardware. It achieves this by fine-tuning MoE configurations in light of recent architectural changes and by upgrading the underlying tuning methodology. The updated tuning script now leverages CUDA Graphs for more precise measurements and employs a more comprehensive sampling approach, leading to more reliable and efficient MoE operations. Performance benchmarks provided indicate notable improvements in both TPOT for small token counts and TTFT for H20-3E. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@BBuf @Fridge003 Could you please review this PR? |
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the MoE kernel tuning script by incorporating CUDA graphs for more precise benchmarking and adopting a more resilient strategy for identifying optimal configurations. It also introduces new, optimized configurations for DeepSeek V3.1/V3.2 on H20 and H20-3E hardware, which, as the benchmarks indicate, should yield performance improvements. The changes are generally well-executed, but I have pointed out a critical issue in the tuning script that requires attention to guarantee the accuracy of the benchmark outcomes. Additionally, I've offered a suggestion to enhance code readability.
benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton_sep.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/layers/moe/fused_moe_triton/fused_moe_triton_kernels.py
Show resolved
Hide resolved
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
@xu-yfei Hello, I noticed that in the down config file, USE_TMA is set to true for all batch sizes, which differs from my previous understanding. I thought TMA would only be enabled for larger batch sizes. Could you please explain why this is the case? Thank you |
@dongyibo Wrap the MoE operator calls with CUDA Graph for more accurate performance evaluation. The tuning result is that USE_TMA for all down proj layers is true. |
Motivation
Performance tuning based on the code after fused moe swapab [Rework] Add SwapAB Optimization for triton fused_moe_kernel on SM90. #16723. The optimal configuration of fused MoE changes when swapab is taken into consideration.
Optimize the tuning script
tuning_fused_moe_triton_sep.py: CUDA Graph is used to encapsulate the kernel to avoid inaccurate performance evaluation in small-token scenarios. In addition, a total of 100 sample data are divided into 10 iterations with 10 data executed per iteration, replacing the previous approach of 10 samples total with 1 data executed per iteration and repeated 10 times.Tuned the DeepSeek V3.1/V3.2 TP8 scenarios on H20 and H20-3E devices.
Modifications
Accuracy Tests
Benchmarking and Profiling
For token counts ranging from 1 to 256, compare the performance before and after optimization using TPOT.
For token counts ranging from 512 to 8192, compare the performance before and after optimization using TTFT.
It can be observed that the TPOT for token counts ranging from 1 to 256 has decreased, and the TTFT of H20-3E has also seen a certain reduction.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci