[Refactor] Move MXFP4/MXFP6 logic from fused_experts to Quark by adityakamat24 · Pull Request #32120 · vllm-project/vllm

adityakamat24 · 2026-01-11T14:22:13Z

Purpose

Refactors MXFP4/MXFP6 quantization logic from the generic MoE layer into the Quark-specific quantization layer as mentioned in #30621

Motivation

The generic fused_experts kernel should not contain Quark-specific quantization logic. This PR moves all MXFP-related code into quark_moe.py, improving separation of concerns and making both the MoE kernel and quantization layer easier to maintain independently.

Changes

Added to quark_moe.py:

_dequantize_weights() - handles MXFP4/MXFP6 weight dequantization for all 5 OCP MX schemes
_quantize_activations() - handles activation quantization using QDQ emulation
Updated apply() to pre-process weights and activations before calling fused_experts
Modified get_fused_moe_quant_config() to return None for emulation mode

Removed from fused_moe.py:

~34 lines of MXFP4/MXFP6 weight dequantization code
MXFP4/MXFP6 imports and corresponding type checks

Removed from utils.py:

_mxfp4_quantize(), _mxfp6_e3m2_quantize(), _mxfp6_e2m3_quantize() functions
MXFP branches from moe_kernel_quantize_input()

To test locally with AMD hardware:

pytest tests/quantization/test_quark.py -v -k "mxfp" --tb=short

Implements @fxmarty-amd's suggestion from #30621 to handle quantization in apply() rather than process_weights_after_loading(), preserving on-the-fly dequantization behavior for devices without native MXFP instruction support.

This is an internal refactoring. All 5 OCP MX schemes continue to function as before.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Separates OCP MX logic from the generic MoE kernel and consolidates it in Quark.

Quark (quark_moe.py): Adds _dequantize_weights() and _quantize_activations(); updates QuarkOCP_MX_MoEMethod.apply() to pre-dequantize weights and QDQ activations for emulation, and returns None from get_fused_moe_quant_config() in emulation; retains native AITER path handling; keeps support for all MX schemes.
Generic MoE (fused_moe.py): Removes MXFP4/MXFP6 imports and dequantization branches from fused_experts_impl; _get_config_quant_dtype() no longer returns MXFP strings; simplifies quant handling to FP8/INT8 only.
Utils (fused_moe/utils.py): Deletes MXFP4/MXFP6 quantization helpers and their branches from moe_kernel_quantize_input(); keeps NVFP4/MXFP8 paths.

Overall: MXFP responsibilities live in Quark, reducing coupling and making fused_experts backend-agnostic.

^{Written by Cursor Bugbot for commit 7637104e05cf5e4a3211ebc60d43cb68906ff738. This will update automatically on new commits. Configure here.}

Note

Refactors MXFP4/MXFP6 handling out of the generic MoE path and into Quark’s OCP MX method, simplifying fused_experts and MoE utils.

Quark (quark_moe.py): Adds _dequantize_weights() and _quantize_activations(); in emulation mode apply() dequantizes MX weights and QDQ-quantizes activations before calling fused_experts; get_fused_moe_quant_config() sets weight scales to None in emulation; keeps native ROCm AITER path for supported configs
Generic MoE (fused_moe.py): Removes MXFP4/MXFP6 imports and dequantization branch from fused_experts_impl
Utils (fused_moe/utils.py): Deletes MXFP4/MXFP6 helpers and prunes MX branches from moe_kernel_quantize_input() (keeps NVFP4/MXFP8/FP8/INT8)

Result: MXFP logic is isolated to Quark; the generic MoE kernel focuses on FP8/INT8/NVFP4/MXFP8 quant paths.

^{Written by Cursor Bugbot for commit 05d397ace99660eba766b7991c2a1ef3ce05d93b. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 53c35cc. Configure here.}

Note

Separates OCP MX (MXFP4/MXFP6) logic from the generic MoE path and consolidates it in Quark.

Quark (quark_moe.py): Adds _dequantize_weights() for all OCP MX schemes; in emulation mode, apply() dequantizes MX weights and calls fused_experts; get_fused_moe_quant_config() sets weight scales to None in emulation; retains native ROCm AITER path when available
Generic MoE (fused_moe.py): Removes MXFP imports and in-kernel dequantization; tightens size checks to only treat weights as packed when ocp_mx_scheme and scales are provided; keeps activation quantization via moe_kernel_quantize_input
Utils (fused_moe/utils.py): Streamlines MX quant helpers and routing in moe_kernel_quantize_input (explicit mxfp4/mxfp6 variants; mxfp8 routed after others)

^{Written by Cursor Bugbot for commit 53c35cc. This will update automatically on new commits. Configure here.}

github-actions · 2026-01-11T14:22:23Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-01-11T14:22:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adityakamat24.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request is a well-executed refactoring that moves the MXFP4/MXFP6 quantization logic from the generic fused_experts MoE layer into the Quark-specific QuarkOCP_MX_MoEMethod. The changes correctly relocate the emulation-specific dequantization of weights and quantize-dequantize of activations into the apply method of the Quark MoE method. This improves the separation of concerns and makes the generic MoE kernel cleaner, as it no longer needs to be aware of Quark-specific emulation details. The implementation is clean, consistent across all modified files, and correctly preserves the existing functionality.

cursor · 2026-01-11T14:34:20Z

vllm/model_executor/layers/quantization/quark/quark_moe.py

+                apply_router_weight_on_input=layer.apply_router_weight_on_input,
+                expert_map=layer.expert_map,
+                quant_config=None,
+            )


Missing intermediate activation quantization in MXFP emulation

High Severity

The MXFP emulation path no longer quantizes intermediate activations between the two matmul operations. Previously, _get_config_quant_dtype() returned "mxfp4"/"mxfp6_*" for MXFP schemes, and moe_kernel_quantize_input() applied QDQ to both input activations and intermediate activations (after the activation function, before the second matmul). Now, _quantize_activations() only handles input activations in apply(), and quant_config=None is passed to fused_experts(), causing moe_kernel_quantize_input() to skip intermediate activation quantization since quant_dtype is None. This changes numerical behavior and breaks emulation accuracy.

Additional Locations (1)

vllm/model_executor/layers/fused_moe/fused_moe.py#L1779-L1780

mergify · 2026-01-11T14:59:14Z

Hi @adityakamat24, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vllm/model_executor/layers/fused_moe/utils.py

vllm/model_executor/layers/quantization/quark/quark_moe.py

fxmarty-amd

LGTM, thank you, it is indeed much cleaner!

Just the changes/reordering in fused_moe/utils.py might be unnecessary?

adityakamat24 · 2026-01-12T13:26:35Z

LGTM, thank you, it is indeed much cleaner!

Just the changes/reordering in fused_moe/utils.py might be unnecessary?

@fxmarty-amd Do you want me to change that?

fxmarty-amd · 2026-01-12T13:28:27Z

Let's see RH/vllm folks comment, but apart from removing the outdated comments, the changes there might be unnecessary I think?

adityakamat24 · 2026-01-12T13:36:23Z

I’ll wait for their input, and can adjust the utils changes accordingly if needed.

adityakamat24 · 2026-01-15T08:31:42Z

@tjtanaa , @mgoin , @pavanimajety Please can can you look into this?

Thank you.

hangy-amd · 2026-01-16T08:19:04Z

vllm/model_executor/layers/quantization/quark/quark_moe.py

            assert (dim * 3) % 4 == 0
            return (dim * 3) // 4

+    def _dequantize_weights(


I totally agree with the motivation, but I don't think only moving dequantization part is an elegant way. Dequantization is actually part of inference (kernel emulation). Putting dequantization in quant_method would break the purity. quant methods should be only responsible for quantized weights loading and quantization (eg. online quantization). Dequantization should reside in inference part.

I suggest following the design in this PR. We wrappe inference related code such as dequantization and kernels in Kernel class and return the desired Kernel class in quant_method with factory pattern.

cc @robertgshaw2-redhat

There is

vllm/vllm/model_executor/layers/fused_moe/modular_kernel.py

Line 680 in 6ca4f40

class FusedMoEModularKernel(torch.nn.Module):

I was not aware of

Interestingly, https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py seems not to make use of this, but https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/fp8.py does.

Alternatively there simply needs to be a TODO like

vllm/vllm/model_executor/layers/quantization/fp8.py

Line 1007 in 6ca4f40

# TODO(rob): convert this to MK.

- there are many places that do not use this abstraction it seems.

Yea I think the dequant part should move into emulation kernel after the kernel refactor.

adityakamat24 · 2026-01-17T07:53:20Z

@fxmarty-amd @hangy-amd should the modular kernel refactor happen in this PR, or would you prefer splitting it into:

This PR or follow-up PR: Convert to FusedMoEModularKernel pattern (following the fp8.py example you linked)

Happy to rework it now if you want it in this PR, otherwise I can add a TODO and get this merged first.

mergify · 2026-01-19T04:09:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adityakamat24.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hangy-amd · 2026-01-19T09:57:53Z

@fxmarty-amd @adityakamat24 Hi, guys, there's an on going PR that uses mk.FusedMoEModularKernel. Please also refer to this.

I'd like to refactor kernels within this PR, because kernel refactor in quark will be happening in the coming weeks. The commits in this PR would only exist for a very short time, so why don't we follow the ultimate design directly.

BowenBao

cc @robertgshaw2-redhat for review.

@adityakamat24 have you validated the PR with any mxfp4 quark models?

I'm okay with merging as is. Kernel refactoring anyways needs to be done next.

robertgshaw2-redhat · 2026-02-18T23:44:09Z

vllm/model_executor/layers/fused_moe/utils.py

        return _mxfp6_e3m2_quantize(A, A_scale, per_act_token_quant, block_shape)
    elif quant_dtype == "mxfp6_e2m3":
        return _mxfp6_e2m3_quantize(A, A_scale, per_act_token_quant, block_shape)
+    elif quant_dtype == "mxfp8":


nit: dont change unnessesry lines

robertgshaw2-redhat · 2026-02-18T23:44:39Z

vllm/model_executor/layers/fused_moe/utils.py



-def _mxfp8_e4m3_quantize(
+def _mxfp6_e3m2_quantize(


why are these lines getting changed?

robertgshaw2-redhat · 2026-02-18T23:45:04Z

this pr looks good, but changes unnessary LOC

adityakamat24 · 2026-02-23T00:19:53Z

Hi, I will make the changes and update the PR shortly.

- Move weight dequantization to QuarkOCP_MX_MoEMethod - Restore MXFP quantization functions for intermediate activations - Restore MXFP branches in _get_config_quant_dtype and moe_kernel_quantize_input - Fix size assertions to handle dequantized weights in emulation mode - Remove double-quantization bug by letting fused_experts handle all activation quantization Signed-off-by: adityakamat24 <adityakmat007@gmail.com>

Signed-off-by: adityakamat24 <adityakmat007@gmail.com>

adityakamat24 · 2026-03-10T22:11:55Z

@robertgshaw2-redhat @fxmarty-amd

I made the changes sometime back, please do review it. Thanks!

adityakamat24 requested review from mgoin, pavanimajety and tjtanaa as code owners January 11, 2026 14:22

mergify bot added the needs-rebase label Jan 11, 2026

gemini-code-assist bot reviewed Jan 11, 2026

View reviewed changes

cursor bot reviewed Jan 11, 2026

View reviewed changes

adityakamat24 force-pushed the main branch from 7637104 to 05d397a Compare January 11, 2026 14:55

mergify bot removed the needs-rebase label Jan 11, 2026

cursor bot reviewed Jan 11, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/utils.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/quark/quark_moe.py Show resolved Hide resolved

adityakamat24 force-pushed the main branch from 05d397a to 53c35cc Compare January 12, 2026 07:09

fxmarty-amd approved these changes Jan 12, 2026

View reviewed changes

hangy-amd reviewed Jan 16, 2026

View reviewed changes

mergify bot added the needs-rebase label Jan 19, 2026

xuebwang-amd mentioned this pull request Jan 23, 2026

[ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations #29008

Merged

22 tasks

BowenBao reviewed Feb 13, 2026

View reviewed changes

BowenBao mentioned this pull request Feb 18, 2026

[Feature]: Refactor Quark MoE and mxfp4 MoE to align with MoE oracle/MK #34851

Open

7 tasks

robertgshaw2-redhat reviewed Feb 18, 2026

View reviewed changes

adityakamat24 added 2 commits March 1, 2026 21:07

Revert changes to utils.py

8fcb7f8

Signed-off-by: adityakamat24 <adityakmat007@gmail.com>

adityakamat24 force-pushed the main branch from 252a50f to 8fcb7f8 Compare March 2, 2026 02:08

adityakamat24 requested review from fxmarty-amd and robertgshaw2-redhat March 2, 2026 04:31

Uh oh!

Conversation

adityakamat24 commented Jan 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Motivation

Changes

Uh oh!

github-actions bot commented Jan 11, 2026

Uh oh!

mergify bot commented Jan 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor bot Jan 11, 2026

Choose a reason for hiding this comment

Missing intermediate activation quantization in MXFP emulation

Uh oh!

mergify bot commented Jan 11, 2026

Uh oh!

Uh oh!

Uh oh!

fxmarty-amd left a comment

Choose a reason for hiding this comment

Uh oh!

adityakamat24 commented Jan 12, 2026

Uh oh!

fxmarty-amd commented Jan 12, 2026

Uh oh!

adityakamat24 commented Jan 12, 2026

Uh oh!

adityakamat24 commented Jan 15, 2026

Uh oh!

hangy-amd Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BowenBao Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

adityakamat24 commented Jan 17, 2026

Uh oh!

mergify bot commented Jan 19, 2026

Uh oh!

hangy-amd commented Jan 19, 2026

Uh oh!

BowenBao left a comment

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Feb 18, 2026

Uh oh!

adityakamat24 commented Feb 23, 2026

Uh oh!

adityakamat24 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adityakamat24 commented Jan 11, 2026 •

edited by github-actions bot

Loading

hangy-amd Jan 16, 2026 •

edited

Loading

fxmarty-amd Jan 16, 2026 •

edited

Loading

adityakamat24 commented Mar 10, 2026 •

edited

Loading