Skip to content

[Perf] Add A100 fused MoE tuned config for E=16,N=3072#37814

Open
xiaohajiayou wants to merge 4 commits intovllm-project:mainfrom
xiaohajiayou:perf/hunyuan-a100-moe-config
Open

[Perf] Add A100 fused MoE tuned config for E=16,N=3072#37814
xiaohajiayou wants to merge 4 commits intovllm-project:mainfrom
xiaohajiayou:perf/hunyuan-a100-moe-config

Conversation

@xiaohajiayou
Copy link
Copy Markdown
Contributor

@xiaohajiayou xiaohajiayou commented Mar 22, 2026

Purpose

This PR adds a tuned fused MoE config for the A100 local EP shape used by HunyuanImage3:

  • device: NVIDIA_A100-SXM4-80GB
  • local shape: E=16, N=3072
  • dtype: bf16

Previously, no tuned config existed for this shape, so the fused MoE kernel fell back to the default Triton configuration.

Motivation

During investigation of the HunyuanImage3 EP regression (vllm-omni#2015), we found that the local MoE shape:

  • E=16
  • N=3072

is exercised under TP=4 + EP on A100, and currently uses the default Triton config.

Profiling indicates that the regression is concentrated in the expert compute path (first Triton expert kernel), suggesting that the default config is suboptimal for this shape.

Tuned config

The configuration was generated using the official benchmark script on A100 with TP=4:

python benchmarks/kernels/benchmark_moe.py \
  --model /path/to/HunyuanImage-3.0 \
  --tp-size 4 \
  --tune \
  --trust-remote-code \
  --save-dir ./configs

Kernel Benchmark Comparison

Batch Size Baseline (us) Tuned (us) Speedup
1 842.91 843.05 1.00x
2 1477.72 1480.26 1.00x
4 1968.82 2001.55 0.98x
8 2124.56 2126.99 1.00x
16 2144.33 2135.83 1.00x
24 2163.29 2146.95 1.01x
32 2722.83 2155.35 1.26x
48 2186.90 2172.21 1.01x
64 2769.27 2198.24 1.26x
96 3092.62 2211.22 1.40x
128 2843.98 2737.90 1.04x
256 4189.04 3470.30 1.21x
512 7008.46 6230.43 1.12x
1024 12324.64 11708.29 1.05x
1536 17670.50 15051.41 1.17x
2048 23084.18 19780.58 1.17x
3072 33757.21 29335.96 1.15x
4096 44689.56 38574.08 1.16x
8192 88428.19 78207.76 1.13x

Kernel benchmark numbers were obtained with benchmark_moe.py by comparing the default fallback path (config file temporarily removed) against the tuned path (config file present) on A100.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@xiaohajiayou xiaohajiayou force-pushed the perf/hunyuan-a100-moe-config branch 2 times, most recently from e6189f1 to 7024b0f Compare March 24, 2026 16:32
Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the perf/hunyuan-a100-moe-config branch from 7024b0f to 395e724 Compare March 25, 2026 13:03
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Apr 19, 2026

@mgoin @pavanimajety Hi, could you please take a look at this PR when you have a chance?
It mainly focuses on parameter tuning, and I’ve already run the relevant tests.
Please let me know if it’s ready to be merged, or if there’s anything else I should address before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant