QWEN3 Thinking Fused MoE kernels Optimization configs#24330
QWEN3 Thinking Fused MoE kernels Optimization configs#24330houseroad merged 5 commits intovllm-project:mainfrom
Conversation
8612a69 to
3916916
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces optimized configurations for Fused MoE kernels, specifically for Qwen3 models on NVIDIA H100, B200, and GB200 GPUs. The changes consist of adding new and updating existing JSON configuration files with tuned parameters for fp8_w8a8 data types. The provided benchmark results demonstrate significant performance improvements in output token throughput and latency, which is a great contribution. The new configurations appear valid and consistent with the expected parameters for Triton kernels. Overall, this is a solid performance optimization.
houseroad
left a comment
There was a problem hiding this comment.
Let's only check in the B200 files?
houseroad
left a comment
There was a problem hiding this comment.
Thanks for the tuned fused moe config!
Signed-off-by: Saman Keon <samanamp@outlook.com>
Signed-off-by: Saman Keon <samanamp@outlook.com>
Head branch was pushed to by a user without write access
1211e2e to
64c26a7
Compare
…4330) Signed-off-by: Saman Keon <samanamp@outlook.com>
…4330) Signed-off-by: Saman Keon <samanamp@outlook.com>
…4330) Signed-off-by: Saman Keon <samanamp@outlook.com>
Purpose
Optimize QWEN3 Thinking Fused MoE kernels Optimization configs
We see 13.7% in
Output token throughput (tok/s). 12% inMedian TPOT (ms)and 17% improvement in P99 TPOT (ms).Test Plan
Server side:
Bench:
Test Result
BEFORE:
AFTER:
We see 13.7% in
Output token throughput (tok/s). 12% inMedian TPOT (ms)and 17% improvement in P99 TPOT (ms).Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.