[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X#25703
Conversation
Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request adds new Triton MoE tuning configurations for the AMD MI300X GPU, specifically for models with GLM-4.5 dimensions. While adding optimized hardware-specific configurations is valuable for performance, the pull request description lacks a test plan and, more importantly, test results. For performance-sensitive changes like this, it's crucial to provide benchmark data to validate that these new configurations indeed offer an improvement and do not introduce regressions. Please update the pull request with details on how these configurations were generated and the performance impact observed.
| "GROUP_SIZE_M": 4, | ||
| "num_warps": 1, |
There was a problem hiding this comment.
The configuration for a batch size of 32 appears unusual, particularly when compared to the one for batch size 48, which uses identical block dimensions.
num_warps: 1: Using a single warp is often suboptimal as it limits parallelism and the ability to hide memory latency. For batch size 48,num_warpsis set to 2.GROUP_SIZE_M: 4: This enables grouping for better L2 cache reuse, which is generally more effective for larger batch sizes. However, for batch size 48, grouping is disabled (GROUP_SIZE_M: 1).
This combination for batch size 32 is counter-intuitive and might be a typo, potentially leading to performance degradation. Could you please verify these parameters and provide benchmark data? I'd suggest aligning it with the configuration for batch size 48 if this was an oversight.
| "GROUP_SIZE_M": 4, | |
| "num_warps": 1, | |
| "GROUP_SIZE_M": 1, | |
| "num_warps": 2, |
vllm-project#25703) Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com>
#25703) Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
vllm-project#25703) Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com>
vllm-project#25703) Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com>
vllm-project#25703) Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com>
vllm-project#25703) Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com>
Purpose
Triton MoE configs for AMD MI300 for GLM-4.5 model and other models that share the same dimensions in moe kernels.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.