[Refactor] Move FusedMoE hidden_size roundup to quant_method by BowenBao · Pull Request #34285 · vllm-project/vllm

BowenBao · 2026-02-10T23:32:35Z

Refactor hidden_size and intermediate roundup logic to be handled by QuantMethod.
Store padded and unpadded sizes under MoeConfig.
Update ROCm padding logic to improve performance on mi300x. Thanks for suggestion and evaluation from @Rohan138.
Enable Quark MXFP4 MoE with aiter backend running with padded intermediate_size.

BowenBao · 2026-02-10T23:34:07Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the logic for rounding up the hidden_size in FusedMoE layers, moving the responsibility from the generic FusedMoE layer to the specific quantization methods. This is a good architectural improvement. My main feedback is about code duplication and a potential bug in QuarkOCP_MX_MoEMethod where the roundup logic is applied unconditionally for gpt_oss models, even for non-MXFP4 quantization types.

vllm/model_executor/layers/quantization/quark/quark_moe.py

gemini-code-assist

Code Review

This pull request refactors the logic for rounding up the hidden size in FusedMoE layers by moving it from the generic layer.py to the specific quant_method implementations. This is a good architectural improvement, as it places quantization-specific logic where it belongs. The changes in fused_moe_method_base.py and layer.py are correct. However, this refactoring has introduced code duplication in mxfp4.py and quark_moe.py for handling gpt_oss models. I've added comments with suggestions to address this.

vllm/model_executor/layers/quantization/mxfp4.py

vllm/model_executor/layers/quantization/quark/quark_moe.py

BowenBao · 2026-02-11T00:35:07Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the hidden_size roundup logic for FusedMoE layers by moving it into the quant_method. This is a good architectural improvement as it localizes quantization-specific logic. The changes are well-structured. I've found one issue where a function is called with incorrect arguments, which I've detailed in a specific comment.

vllm/model_executor/layers/quantization/quark/quark_moe.py

mergify · 2026-02-11T00:58:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Rohan138 · 2026-02-11T22:04:38Z

FYI #32307 might be relevant; I'm not sure what the pad size for gpt-oss on MI300 should be i.e. 128 or 256. Needs further investigation, haven't had time to run proper perf unfortunately.

BowenBao · 2026-02-16T22:50:44Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the hidden_size and intermediate_size rounding logic in the FusedMoE layer by moving it into the quant_method. This is a significant improvement in maintainability as it centralizes quantization-specific alignment requirements (especially for MXFP4 backends) within the quantization methods themselves, rather than having brittle model-type checks in the core layer logic. The changes ensure that both the moe_config and the actual weight tensors are created with consistent, correctly padded dimensions across different hardware platforms and quantization schemes.

vllm/model_executor/layers/fused_moe/layer.py

robertgshaw2-redhat · 2026-02-18T23:42:26Z

this is a nice simplification. I wonder if we can go even further by making the layer just unaware of the hidden size / intermediate size? WDYT?

BowenBao · 2026-02-19T01:08:09Z

I think should be do-able, see #34285 (comment), unless there are other use cases of layer.hidden_size that I'm unaware of.

vllm/model_executor/layers/fused_moe/layer.py

mergify · 2026-02-24T00:56:56Z

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

ChuanLi1101

Left some comments FYI.

vllm/model_executor/layers/quantization/utils/mxfp4_utils.py

vllm/model_executor/layers/quantization/quark/quark_moe.py

hongxiayang · 2026-03-06T15:16:19Z

Thanks @BowenBao.

BTW, I also filed an aiter issue for the MiniMax M2.1 MXFP4 TP4 case.

ChuanLi1101

LGTM, thanks for addressing the comments.

fxmarty-amd

LGTM. OCP MX emulation should be refactored as an Mxfp4Backend in a follow up PR

BowenBao · 2026-03-09T18:46:43Z

@tjtanaa could you help land this PR?

BowenBao · 2026-03-10T17:12:45Z

@robertgshaw2-redhat could you take another look if comments are all addressed and this can be landed?

mergify · 2026-03-13T19:00:17Z

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

tjtanaa · 2026-03-17T05:46:43Z

@BowenBao can you also provide the lm-eval score for this model?

please review. Latest part of the PR enables Quark MXFP4 MoE with aiter backend running with padded intermediate_size.

Tested with MiniMax M2.1 MXFP4 TP4 Before:

Output token throughput (tok/s): 2182.16
Peak output token throughput (tok/s): 6973.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 19639.48

After:

Output token throughput (tok/s): 3617.48
Peak output token throughput (tok/s): 14404.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 32557.31

mergify · 2026-03-18T17:12:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Bowen Bao <bowenbao@amd.com> address comments Signed-off-by: Bowen Bao <bowenbao@amd.com> fix Signed-off-by: Bowen Bao <bowenbao@amd.com> further refactor Signed-off-by: Bowen Bao <bowenbao@amd.com> address comments Signed-off-by: Bowen Bao <bowenbao@amd.com> refine backend check Signed-off-by: Bowen Bao <bowenbao@amd.com> small fix Signed-off-by: Bowen Bao <bowenbao@amd.com> make hidden and inter size property of fusedmoe, src from moe_config Signed-off-by: Bowen Bao <bowenbao@amd.com> typ Signed-off-by: Bowen Bao <bowenbao@amd.com> fix quark emulation Signed-off-by: Bowen Bao <bowenbao@amd.com> inter_size now a property Signed-off-by: Bowen Bao <bowenbao@amd.com>

Signed-off-by: Bowen Bao <bowenbao@amd.com>

BowenBao changed the title ~~Refactor FusedMoE hidden_size roundup~~ [Refactor] Move FusedMoE hidden_size roundup to quant_method Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

vllm/model_executor/layers/quantization/mxfp4.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Feb 11, 2026

BowenBao mentioned this pull request Feb 13, 2026

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

Merged

5 tasks

bnellnm self-requested a review February 13, 2026 19:35

BowenBao force-pushed the bowenbao/move_mxfp4_moe_roundup branch from 8b8fcbd to 6e4c34c Compare February 16, 2026 22:50

mergify bot removed the needs-rebase label Feb 16, 2026

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

BowenBao marked this pull request as ready for review February 16, 2026 23:01

BowenBao requested review from mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth and yewentao256 as code owners February 16, 2026 23:01

BowenBao mentioned this pull request Feb 18, 2026

[Feature]: Refactor Quark MoE and mxfp4 MoE to align with MoE oracle/MK #34851

Open

7 tasks

robertgshaw2-redhat reviewed Feb 18, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

BowenBao mentioned this pull request Feb 23, 2026

[ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales #30357

Merged

bnellnm reviewed Feb 23, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

Rohan138 mentioned this pull request Feb 24, 2026

fix pad_align for gfx942 #32307

Closed

5 tasks

ChuanLi1101 reviewed Mar 6, 2026

View reviewed changes

vllm/model_executor/layers/quantization/utils/mxfp4_utils.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/utils/mxfp4_utils.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/quark/quark_moe.py Show resolved Hide resolved

ChuanLi1101 approved these changes Mar 6, 2026

View reviewed changes

fxmarty-amd approved these changes Mar 9, 2026

View reviewed changes

BowenBao mentioned this pull request Mar 9, 2026

[ROCm][Bugfix] Fix MXFP4 MoE emulate fallback logic on MX-capable hardware #36422

Open

4 tasks

bnellnm approved these changes Mar 9, 2026

View reviewed changes

tjtanaa added the rocm Related to AMD ROCm label Mar 13, 2026

github-project-automation bot added this to AMD Mar 13, 2026

github-project-automation bot moved this to Todo in AMD Mar 13, 2026

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026

BowenBao force-pushed the bowenbao/move_mxfp4_moe_roundup branch from 9634bac to 0d76347 Compare March 13, 2026 18:53

SandishKumarHN mentioned this pull request Mar 13, 2026

[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions #37010

Open

4 tasks

hongxiayang mentioned this pull request Mar 16, 2026

[Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error #35637

Open

1 task

mergify bot added the needs-rebase label Mar 18, 2026

BowenBao added 9 commits March 18, 2026 19:16

fix lint

772e957

Signed-off-by: Bowen Bao <bowenbao@amd.com>

improve mi300x perf with fixed pad_align

029ebff

Signed-off-by: Bowen Bao <bowenbao@amd.com>

resolve conflict

620e2b8

Signed-off-by: Bowen Bao <bowenbao@amd.com>

fix variable name

7071f14

Signed-off-by: Bowen Bao <bowenbao@amd.com>

fix pre-commit

b928334

Signed-off-by: Bowen Bao <bowenbao@amd.com>

enable padding for quark moe

2f3a571

Signed-off-by: Bowen Bao <bowenbao@amd.com>

fix unittests - make unpadded args optional

d3271c1

Signed-off-by: Bowen Bao <bowenbao@amd.com>

mypy

ae0274d

Signed-off-by: Bowen Bao <bowenbao@amd.com>

BowenBao force-pushed the bowenbao/move_mxfp4_moe_roundup branch from 4055e9e to ae0274d Compare March 18, 2026 19:59

mergify bot removed the needs-rebase label Mar 18, 2026

Uh oh!

Conversation

BowenBao commented Feb 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BowenBao commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

BowenBao commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

Rohan138 commented Feb 11, 2026

Uh oh!

BowenBao commented Feb 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

robertgshaw2-redhat commented Feb 18, 2026

Uh oh!

BowenBao commented Feb 19, 2026

Uh oh!

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hongxiayang commented Mar 6, 2026

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd left a comment

Choose a reason for hiding this comment

Uh oh!

BowenBao commented Mar 9, 2026

Uh oh!

BowenBao commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

tjtanaa commented Mar 17, 2026

Uh oh!

mergify bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

BowenBao commented Feb 10, 2026 •

edited by github-actions bot

Loading