Skip to content

[Refactor] Move FusedMoE hidden_size roundup to quant_method#34285

Open
BowenBao wants to merge 9 commits intovllm-project:mainfrom
BowenBao:bowenbao/move_mxfp4_moe_roundup
Open

[Refactor] Move FusedMoE hidden_size roundup to quant_method#34285
BowenBao wants to merge 9 commits intovllm-project:mainfrom
BowenBao:bowenbao/move_mxfp4_moe_roundup

Conversation

@BowenBao
Copy link
Contributor

@BowenBao BowenBao commented Feb 10, 2026

  • Refactor hidden_size and intermediate roundup logic to be handled by QuantMethod.
  • Store padded and unpadded sizes under MoeConfig.
  • Update ROCm padding logic to improve performance on mi300x. Thanks for suggestion and evaluation from @Rohan138.
  • Enable Quark MXFP4 MoE with aiter backend running with padded intermediate_size.

@BowenBao BowenBao changed the title Refactor FusedMoE hidden_size roundup [Refactor] Move FusedMoE hidden_size roundup to quant_method Feb 10, 2026
@BowenBao
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the logic for rounding up the hidden_size in FusedMoE layers, moving the responsibility from the generic FusedMoE layer to the specific quantization methods. This is a good architectural improvement. My main feedback is about code duplication and a potential bug in QuarkOCP_MX_MoEMethod where the roundup logic is applied unconditionally for gpt_oss models, even for non-MXFP4 quantization types.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the logic for rounding up the hidden size in FusedMoE layers by moving it from the generic layer.py to the specific quant_method implementations. This is a good architectural improvement, as it places quantization-specific logic where it belongs. The changes in fused_moe_method_base.py and layer.py are correct. However, this refactoring has introduced code duplication in mxfp4.py and quark_moe.py for handling gpt_oss models. I've added comments with suggestions to address this.

@BowenBao
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the hidden_size roundup logic for FusedMoE layers by moving it into the quant_method. This is a good architectural improvement as it localizes quantization-specific logic. The changes are well-structured. I've found one issue where a function is called with incorrect arguments, which I've detailed in a specific comment.

@mergify
Copy link

mergify bot commented Feb 11, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 11, 2026
@Rohan138
Copy link
Contributor

FYI #32307 might be relevant; I'm not sure what the pad size for gpt-oss on MI300 should be i.e. 128 or 256. Needs further investigation, haven't had time to run proper perf unfortunately.

@BowenBao
Copy link
Contributor Author

/gemini review

@mergify mergify bot removed the needs-rebase label Feb 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the hidden_size and intermediate_size rounding logic in the FusedMoE layer by moving it into the quant_method. This is a significant improvement in maintainability as it centralizes quantization-specific alignment requirements (especially for MXFP4 backends) within the quantization methods themselves, rather than having brittle model-type checks in the core layer logic. The changes ensure that both the moe_config and the actual weight tensors are created with consistent, correctly padded dimensions across different hardware platforms and quantization schemes.

@robertgshaw2-redhat
Copy link
Collaborator

this is a nice simplification. I wonder if we can go even further by making the layer just unaware of the hidden size / intermediate size? WDYT?

@BowenBao
Copy link
Contributor Author

I think should be do-able, see #34285 (comment), unless there are other use cases of layer.hidden_size that I'm unaware of.

@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@Rohan138 Rohan138 mentioned this pull request Feb 24, 2026
5 tasks
Copy link
Contributor

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments FYI.

@hongxiayang
Copy link
Collaborator

Thanks @BowenBao.

BTW, I also filed an aiter issue for the MiniMax M2.1 MXFP4 TP4 case.

Copy link
Contributor

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing the comments.

Copy link
Contributor

@fxmarty-amd fxmarty-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. OCP MX emulation should be refactored as an Mxfp4Backend in a follow up PR

@BowenBao
Copy link
Contributor Author

BowenBao commented Mar 9, 2026

@tjtanaa could you help land this PR?

@BowenBao
Copy link
Contributor Author

@robertgshaw2-redhat could you take another look if comments are all addressed and this can be landed?

@tjtanaa tjtanaa added the rocm Related to AMD ROCm label Mar 13, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 13, 2026
@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026
@BowenBao BowenBao force-pushed the bowenbao/move_mxfp4_moe_roundup branch from 9634bac to 0d76347 Compare March 13, 2026 18:53
@mergify
Copy link

mergify bot commented Mar 13, 2026

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@tjtanaa
Copy link
Collaborator

tjtanaa commented Mar 17, 2026

@BowenBao can you also provide the lm-eval score for this model?

please review. Latest part of the PR enables Quark MXFP4 MoE with aiter backend running with padded intermediate_size.

Tested with MiniMax M2.1 MXFP4 TP4 Before:

Output token throughput (tok/s): 2182.16
Peak output token throughput (tok/s): 6973.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 19639.48

After:

Output token throughput (tok/s): 3617.48
Peak output token throughput (tok/s): 14404.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 32557.31

@mergify
Copy link

mergify bot commented Mar 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 18, 2026
Signed-off-by: Bowen Bao <bowenbao@amd.com>

address comments

Signed-off-by: Bowen Bao <bowenbao@amd.com>

fix

Signed-off-by: Bowen Bao <bowenbao@amd.com>

further refactor

Signed-off-by: Bowen Bao <bowenbao@amd.com>

address comments

Signed-off-by: Bowen Bao <bowenbao@amd.com>

refine backend check

Signed-off-by: Bowen Bao <bowenbao@amd.com>

small fix

Signed-off-by: Bowen Bao <bowenbao@amd.com>

make hidden and inter size property of fusedmoe, src from moe_config

Signed-off-by: Bowen Bao <bowenbao@amd.com>

typ

Signed-off-by: Bowen Bao <bowenbao@amd.com>

fix quark emulation

Signed-off-by: Bowen Bao <bowenbao@amd.com>

inter_size now a property

Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
@BowenBao BowenBao force-pushed the bowenbao/move_mxfp4_moe_roundup branch from 4055e9e to ae0274d Compare March 18, 2026 19:59
@mergify mergify bot removed the needs-rebase label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

9 participants