[Perf] Add A100 fused MoE tuned config for E=16,N=3072 by xiaohajiayou · Pull Request #37814 · vllm-project/vllm

xiaohajiayou · 2026-03-22T17:04:01Z

Purpose

This PR adds a tuned fused MoE config for the A100 local EP shape used by HunyuanImage3:

device: NVIDIA_A100-SXM4-80GB
local shape: E=16, N=3072
dtype: bf16

Previously, no tuned config existed for this shape, so the fused MoE kernel fell back to the default Triton configuration.

Motivation

During investigation of the HunyuanImage3 EP regression (vllm-omni#2015), we found that the local MoE shape:

E=16
N=3072

is exercised under TP=4 + EP on A100, and currently uses the default Triton config.

Profiling indicates that the regression is concentrated in the expert compute path (first Triton expert kernel), suggesting that the default config is suboptimal for this shape.

Tuned config

The configuration was generated using the official benchmark script on A100 with TP=4:

python benchmarks/kernels/benchmark_moe.py \
  --model /path/to/HunyuanImage-3.0 \
  --tp-size 4 \
  --tune \
  --trust-remote-code \
  --save-dir ./configs

Kernel Benchmark Comparison

Batch Size	Baseline (us)	Tuned (us)	Speedup
1	842.91	843.05	1.00x
2	1477.72	1480.26	1.00x
4	1968.82	2001.55	0.98x
8	2124.56	2126.99	1.00x
16	2144.33	2135.83	1.00x
24	2163.29	2146.95	1.01x
32	2722.83	2155.35	1.26x
48	2186.90	2172.21	1.01x
64	2769.27	2198.24	1.26x
96	3092.62	2211.22	1.40x
128	2843.98	2737.90	1.04x
256	4189.04	3470.30	1.21x
512	7008.46	6230.43	1.12x
1024	12324.64	11708.29	1.05x
1536	17670.50	15051.41	1.17x
2048	23084.18	19780.58	1.17x
3072	33757.21	29335.96	1.15x
4096	44689.56	38574.08	1.16x
8192	88428.19	78207.76	1.13x

Kernel benchmark numbers were obtained with benchmark_moe.py by comparing the default fallback path (config file temporarily removed) against the tuned path (config file present) on A100.

github-actions · 2026-03-22T17:04:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist · 2026-03-22T17:09:22Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou · 2026-04-19T05:36:18Z

@mgoin @pavanimajety Hi, could you please take a look at this PR when you have a chance?
It mainly focuses on parameter tuning, and I’ve already run the relevant tests.
Please let me know if it’s ready to be merged, or if there’s anything else I should address before merging.

xiaohajiayou requested review from mgoin and pavanimajety as code owners March 22, 2026 17:04

xiaohajiayou mentioned this pull request Mar 22, 2026

[Bug]: HunyuanImage3 EP regression vllm-project/vllm-omni#2075

Closed

1 task

xiaohajiayou force-pushed the perf/hunyuan-a100-moe-config branch 2 times, most recently from e6189f1 to 7024b0f Compare March 24, 2026 16:32

[Perf] Add A100 fused MoE tuned config for E=16,N=3072

395e724

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou force-pushed the perf/hunyuan-a100-moe-config branch from 7024b0f to 395e724 Compare March 25, 2026 13:03

xiaohajiayou and others added 3 commits March 26, 2026 13:50

Merge branch 'main' into perf/hunyuan-a100-moe-config

db2dada

Merge branch 'main' into perf/hunyuan-a100-moe-config

12bdbe9

Merge branch 'main' into perf/hunyuan-a100-moe-config

9e4f459

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Add A100 fused MoE tuned config for E=16,N=3072#37814

[Perf] Add A100 fused MoE tuned config for E=16,N=3072#37814
xiaohajiayou wants to merge 4 commits intovllm-project:mainfrom
xiaohajiayou:perf/hunyuan-a100-moe-config

xiaohajiayou commented Mar 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

gemini-code-assist Bot commented Mar 22, 2026

Uh oh!

xiaohajiayou commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xiaohajiayou commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Motivation

Tuned config

Kernel Benchmark Comparison

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

gemini-code-assist Bot commented Mar 22, 2026

Uh oh!

xiaohajiayou commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaohajiayou commented Mar 22, 2026 •

edited

Loading

xiaohajiayou commented Apr 19, 2026 •

edited

Loading