[Refactor] Move FusedMoE hidden_size roundup to quant_method#34285
[Refactor] Move FusedMoE hidden_size roundup to quant_method#34285BowenBao wants to merge 9 commits intovllm-project:mainfrom
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the logic for rounding up the hidden_size in FusedMoE layers, moving the responsibility from the generic FusedMoE layer to the specific quantization methods. This is a good architectural improvement. My main feedback is about code duplication and a potential bug in QuarkOCP_MX_MoEMethod where the roundup logic is applied unconditionally for gpt_oss models, even for non-MXFP4 quantization types.
There was a problem hiding this comment.
Code Review
This pull request refactors the logic for rounding up the hidden size in FusedMoE layers by moving it from the generic layer.py to the specific quant_method implementations. This is a good architectural improvement, as it places quantization-specific logic where it belongs. The changes in fused_moe_method_base.py and layer.py are correct. However, this refactoring has introduced code duplication in mxfp4.py and quark_moe.py for handling gpt_oss models. I've added comments with suggestions to address this.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the hidden_size roundup logic for FusedMoE layers by moving it into the quant_method. This is a good architectural improvement as it localizes quantization-specific logic. The changes are well-structured. I've found one issue where a function is called with incorrect arguments, which I've detailed in a specific comment.
|
This pull request has merge conflicts that must be resolved before it can be |
|
FYI #32307 might be relevant; I'm not sure what the pad size for gpt-oss on MI300 should be i.e. 128 or 256. Needs further investigation, haven't had time to run proper perf unfortunately. |
8b8fcbd to
6e4c34c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the hidden_size and intermediate_size rounding logic in the FusedMoE layer by moving it into the quant_method. This is a significant improvement in maintainability as it centralizes quantization-specific alignment requirements (especially for MXFP4 backends) within the quantization methods themselves, rather than having brittle model-type checks in the core layer logic. The changes ensure that both the moe_config and the actual weight tensors are created with consistent, correctly padded dimensions across different hardware platforms and quantization schemes.
|
this is a nice simplification. I wonder if we can go even further by making the layer just unaware of the hidden size / intermediate size? WDYT? |
|
I think should be do-able, see #34285 (comment), unless there are other use cases of |
|
Hi @BowenBao, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
ChuanLi1101
left a comment
There was a problem hiding this comment.
Left some comments FYI.
|
Thanks @BowenBao. BTW, I also filed an aiter issue for the MiniMax M2.1 MXFP4 TP4 case. |
ChuanLi1101
left a comment
There was a problem hiding this comment.
LGTM, thanks for addressing the comments.
fxmarty-amd
left a comment
There was a problem hiding this comment.
LGTM. OCP MX emulation should be refactored as an Mxfp4Backend in a follow up PR
|
@tjtanaa could you help land this PR? |
|
@robertgshaw2-redhat could you take another look if comments are all addressed and this can be landed? |
9634bac to
0d76347
Compare
|
Hi @BowenBao, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@BowenBao can you also provide the lm-eval score for this model?
|
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Bowen Bao <bowenbao@amd.com> address comments Signed-off-by: Bowen Bao <bowenbao@amd.com> fix Signed-off-by: Bowen Bao <bowenbao@amd.com> further refactor Signed-off-by: Bowen Bao <bowenbao@amd.com> address comments Signed-off-by: Bowen Bao <bowenbao@amd.com> refine backend check Signed-off-by: Bowen Bao <bowenbao@amd.com> small fix Signed-off-by: Bowen Bao <bowenbao@amd.com> make hidden and inter size property of fusedmoe, src from moe_config Signed-off-by: Bowen Bao <bowenbao@amd.com> typ Signed-off-by: Bowen Bao <bowenbao@amd.com> fix quark emulation Signed-off-by: Bowen Bao <bowenbao@amd.com> inter_size now a property Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
4055e9e to
ae0274d
Compare
Uh oh!
There was an error while loading. Please reload this page.