[ROCm][CI] Fix GPT-OSS Quark MXFP4+FP8 MoE startup#41330
[ROCm][CI] Fix GPT-OSS Quark MXFP4+FP8 MoE startup#41330AndreasKaratzas wants to merge 1 commit intovllm-project:mainfrom
Conversation
…dding required by the Triton CDNA4 MXFP4 scale layout Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
cc @Rohan138 |
There was a problem hiding this comment.
Code Review
This pull request introduces the maybe_roundup_sizes method to the Quark MoE implementation to ensure proper padding for MXFP4 weights when using Triton kernels. The feedback suggests using a more idiomatic Python approach with super() to call the grandparent class method instead of referencing the base class directly, which improves code style and MRO handling.
| FusedMoEMethodBase.maybe_roundup_sizes( | ||
| self, | ||
| hidden_size=hidden_size, | ||
| intermediate_size_per_partition=intermediate_size_per_partition, | ||
| act_dtype=act_dtype, | ||
| moe_parallel_config=moe_parallel_config, | ||
| ) |
There was a problem hiding this comment.
The use of FusedMoEMethodBase.maybe_roundup_sizes(self, ...) is a bit unconventional for calling a grandparent's method. While it works, using super(QuarkOCP_MX_MoEMethod, self).maybe_roundup_sizes(...) is the standard and more idiomatic way in Python to explicitly skip the immediate parent's implementation and call the next one in the MRO.
super(QuarkOCP_MX_MoEMethod, self).maybe_roundup_sizes(
hidden_size=hidden_size,
intermediate_size_per_partition=intermediate_size_per_partition,
act_dtype=act_dtype,
moe_parallel_config=moe_parallel_config,
)
So this part is actually incorrect, coming from #39801 ... Similar to #41175, we need to add back |
|
Closing in favor of #39136 |
Fix GPT-OSS Quark MXFP4+FP8 MoE startup on ROCm/gfx950 by applying the padding required by the Triton/CDNA4 MXFP4 scale layout, and align the GPT-OSS Quark monolithic MoE method with the current
MoERunnercall signature. The GPQA ROCm config foramd/gpt-oss-20b-MoE-Quant-W-MXFP4-A-FP8-KV-FP8failed during model startup while processing MoE weights:The failing path is
QuarkOCP_MX_MoEMethod_OSS, used for GPT-OSS Quark weights with MXFP4 weights and static FP8 activations. This subclass still swizzles MXFP4 weights and scales through Triton kernels inprocess_weights_after_loading().However, the base
QuarkOCP_MX_MoEMethodtreatsw_mxfp4_a_fp8as an emulation path for sizing purposes, so it does not apply the MXFP4 backend padding. On ROCm/CDNA4, Triton scale swizzling requires aligned dimensions. For the 20B TP=2 case, the unpadded dimensions produce scale shapes based onhidden_size=2880andintermediate_size_per_partition=1440;hidden_size / 32 = 90, which is not compatible with the CDNA4 scale swizzle layout.Testing
Ran the focused padded MoE test file:
cc @kenroche