[Perf] Add A100 fused MoE tuned config for E=16,N=3072#37814
[Perf] Add A100 fused MoE tuned config for E=16,N=3072#37814xiaohajiayou wants to merge 4 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
e6189f1 to
7024b0f
Compare
Signed-off-by: xiaohajiayou <923390377@qq.com>
7024b0f to
395e724
Compare
|
@mgoin @pavanimajety Hi, could you please take a look at this PR when you have a chance? |
Purpose
This PR adds a tuned fused MoE config for the A100 local EP shape used by HunyuanImage3:
NVIDIA_A100-SXM4-80GBE=16, N=3072bf16Previously, no tuned config existed for this shape, so the fused MoE kernel fell back to the default Triton configuration.
Motivation
During investigation of the HunyuanImage3 EP regression (vllm-omni#2015), we found that the local MoE shape:
E=16N=3072is exercised under TP=4 + EP on A100, and currently uses the default Triton config.
Profiling indicates that the regression is concentrated in the expert compute path (first Triton expert kernel), suggesting that the default config is suboptimal for this shape.
Tuned config
The configuration was generated using the official benchmark script on A100 with TP=4:
Kernel Benchmark Comparison
Kernel benchmark numbers were obtained with
benchmark_moe.pyby comparing the default fallback path (config file temporarily removed) against the tuned path (config file present) on A100.