Skip to content

[Refactor] Move MXFP4/MXFP6 logic from fused_experts to Quark#32120

Open
adityakamat24 wants to merge 2 commits intovllm-project:mainfrom
adityakamat24:main
Open

[Refactor] Move MXFP4/MXFP6 logic from fused_experts to Quark#32120
adityakamat24 wants to merge 2 commits intovllm-project:mainfrom
adityakamat24:main

Conversation

@adityakamat24
Copy link

@adityakamat24 adityakamat24 commented Jan 11, 2026

Fixes #30621

Purpose

Refactors MXFP4/MXFP6 quantization logic from the generic MoE layer into the Quark-specific quantization layer as mentioned in #30621

Motivation

The generic fused_experts kernel should not contain Quark-specific quantization logic. This PR moves all MXFP-related code into quark_moe.py, improving separation of concerns and making both the MoE kernel and quantization layer easier to maintain independently.

Changes

Added to quark_moe.py:

  • _dequantize_weights() - handles MXFP4/MXFP6 weight dequantization for all 5 OCP MX schemes
  • _quantize_activations() - handles activation quantization using QDQ emulation
  • Updated apply() to pre-process weights and activations before calling fused_experts
  • Modified get_fused_moe_quant_config() to return None for emulation mode

Removed from fused_moe.py:

  • ~34 lines of MXFP4/MXFP6 weight dequantization code
  • MXFP4/MXFP6 imports and corresponding type checks

Removed from utils.py:

  • _mxfp4_quantize(), _mxfp6_e3m2_quantize(), _mxfp6_e2m3_quantize() functions
  • MXFP branches from moe_kernel_quantize_input()

To test locally with AMD hardware:

pytest tests/quantization/test_quark.py -v -k "mxfp" --tb=short

Implements @fxmarty-amd's suggestion from #30621 to handle quantization in apply() rather than process_weights_after_loading(), preserving on-the-fly dequantization behavior for devices without native MXFP instruction support.

This is an internal refactoring. All 5 OCP MX schemes continue to function as before.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Separates OCP MX logic from the generic MoE kernel and consolidates it in Quark.

  • Quark (quark_moe.py): Adds _dequantize_weights() and _quantize_activations(); updates QuarkOCP_MX_MoEMethod.apply() to pre-dequantize weights and QDQ activations for emulation, and returns None from get_fused_moe_quant_config() in emulation; retains native AITER path handling; keeps support for all MX schemes.
  • Generic MoE (fused_moe.py): Removes MXFP4/MXFP6 imports and dequantization branches from fused_experts_impl; _get_config_quant_dtype() no longer returns MXFP strings; simplifies quant handling to FP8/INT8 only.
  • Utils (fused_moe/utils.py): Deletes MXFP4/MXFP6 quantization helpers and their branches from moe_kernel_quantize_input(); keeps NVFP4/MXFP8 paths.

Overall: MXFP responsibilities live in Quark, reducing coupling and making fused_experts backend-agnostic.

Written by Cursor Bugbot for commit 7637104e05cf5e4a3211ebc60d43cb68906ff738. This will update automatically on new commits. Configure here.


Note

Refactors MXFP4/MXFP6 handling out of the generic MoE path and into Quark’s OCP MX method, simplifying fused_experts and MoE utils.

  • Quark (quark_moe.py): Adds _dequantize_weights() and _quantize_activations(); in emulation mode apply() dequantizes MX weights and QDQ-quantizes activations before calling fused_experts; get_fused_moe_quant_config() sets weight scales to None in emulation; keeps native ROCm AITER path for supported configs
  • Generic MoE (fused_moe.py): Removes MXFP4/MXFP6 imports and dequantization branch from fused_experts_impl
  • Utils (fused_moe/utils.py): Deletes MXFP4/MXFP6 helpers and prunes MX branches from moe_kernel_quantize_input() (keeps NVFP4/MXFP8/FP8/INT8)

Result: MXFP logic is isolated to Quark; the generic MoE kernel focuses on FP8/INT8/NVFP4/MXFP8 quant paths.

Written by Cursor Bugbot for commit 05d397ace99660eba766b7991c2a1ef3ce05d93b. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 53c35cc. Configure here.


Note

Separates OCP MX (MXFP4/MXFP6) logic from the generic MoE path and consolidates it in Quark.

  • Quark (quark_moe.py): Adds _dequantize_weights() for all OCP MX schemes; in emulation mode, apply() dequantizes MX weights and calls fused_experts; get_fused_moe_quant_config() sets weight scales to None in emulation; retains native ROCm AITER path when available
  • Generic MoE (fused_moe.py): Removes MXFP imports and in-kernel dequantization; tightens size checks to only treat weights as packed when ocp_mx_scheme and scales are provided; keeps activation quantization via moe_kernel_quantize_input
  • Utils (fused_moe/utils.py): Streamlines MX quant helpers and routing in moe_kernel_quantize_input (explicit mxfp4/mxfp6 variants; mxfp8 routed after others)

Written by Cursor Bugbot for commit 53c35cc. This will update automatically on new commits. Configure here.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link

mergify bot commented Jan 11, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adityakamat24.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 11, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-executed refactoring that moves the MXFP4/MXFP6 quantization logic from the generic fused_experts MoE layer into the Quark-specific QuarkOCP_MX_MoEMethod. The changes correctly relocate the emulation-specific dequantization of weights and quantize-dequantize of activations into the apply method of the Quark MoE method. This improves the separation of concerns and makes the generic MoE kernel cleaner, as it no longer needs to be aware of Quark-specific emulation details. The implementation is clean, consistent across all modified files, and correctly preserves the existing functionality.

apply_router_weight_on_input=layer.apply_router_weight_on_input,
expert_map=layer.expert_map,
quant_config=None,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing intermediate activation quantization in MXFP emulation

High Severity

The MXFP emulation path no longer quantizes intermediate activations between the two matmul operations. Previously, _get_config_quant_dtype() returned "mxfp4"/"mxfp6_*" for MXFP schemes, and moe_kernel_quantize_input() applied QDQ to both input activations and intermediate activations (after the activation function, before the second matmul). Now, _quantize_activations() only handles input activations in apply(), and quant_config=None is passed to fused_experts(), causing moe_kernel_quantize_input() to skip intermediate activation quantization since quant_dtype is None. This changes numerical behavior and breaks emulation accuracy.

Additional Locations (1)

Fix in Cursor Fix in Web

@mergify
Copy link

mergify bot commented Jan 11, 2026

Hi @adityakamat24, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Contributor

@fxmarty-amd fxmarty-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you, it is indeed much cleaner!

Just the changes/reordering in fused_moe/utils.py might be unnecessary?

@adityakamat24
Copy link
Author

LGTM, thank you, it is indeed much cleaner!

Just the changes/reordering in fused_moe/utils.py might be unnecessary?

@fxmarty-amd Do you want me to change that?

@fxmarty-amd
Copy link
Contributor

Let's see RH/vllm folks comment, but apart from removing the outdated comments, the changes there might be unnecessary I think?

@adityakamat24
Copy link
Author

I’ll wait for their input, and can adjust the utils changes accordingly if needed.

@adityakamat24
Copy link
Author

@tjtanaa , @mgoin , @pavanimajety Please can can you look into this?

Thank you.

assert (dim * 3) % 4 == 0
return (dim * 3) // 4

def _dequantize_weights(
Copy link

@hangy-amd hangy-amd Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with the motivation, but I don't think only moving dequantization part is an elegant way. Dequantization is actually part of inference (kernel emulation). Putting dequantization in quant_method would break the purity. quant methods should be only responsible for quantized weights loading and quantization (eg. online quantization). Dequantization should reside in inference part.

I suggest following the design in this PR. We wrappe inference related code such as dequantization and kernels in Kernel class and return the desired Kernel class in quant_method with factory pattern.

cc @robertgshaw2-redhat

Copy link
Contributor

@fxmarty-amd fxmarty-amd Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is

class FusedMoEModularKernel(torch.nn.Module):
I was not aware of

Interestingly, https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py seems not to make use of this, but https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/fp8.py does.

Alternatively there simply needs to be a TODO like

# TODO(rob): convert this to MK.
- there are many places that do not use this abstraction it seems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think the dequant part should move into emulation kernel after the kernel refactor.

@adityakamat24
Copy link
Author

@fxmarty-amd @hangy-amd should the modular kernel refactor happen in this PR, or would you prefer splitting it into:

This PR or follow-up PR: Convert to FusedMoEModularKernel pattern (following the fp8.py example you linked)

Happy to rework it now if you want it in this PR, otherwise I can add a TODO and get this merged first.

@mergify
Copy link

mergify bot commented Jan 19, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @adityakamat24.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 19, 2026
@hangy-amd
Copy link

@fxmarty-amd @adityakamat24 Hi, guys, there's an on going PR that uses mk.FusedMoEModularKernel. Please also refer to this.

I'd like to refactor kernels within this PR, because kernel refactor in quark will be happening in the coming weeks. The commits in this PR would only exist for a very short time, so why don't we follow the ultimate design directly.

Copy link
Contributor

@BowenBao BowenBao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @robertgshaw2-redhat for review.

@adityakamat24 have you validated the PR with any mxfp4 quark models?

I'm okay with merging as is. Kernel refactoring anyways needs to be done next.

return _mxfp6_e3m2_quantize(A, A_scale, per_act_token_quant, block_shape)
elif quant_dtype == "mxfp6_e2m3":
return _mxfp6_e2m3_quantize(A, A_scale, per_act_token_quant, block_shape)
elif quant_dtype == "mxfp8":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dont change unnessesry lines



def _mxfp8_e4m3_quantize(
def _mxfp6_e3m2_quantize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these lines getting changed?

@robertgshaw2-redhat
Copy link
Collaborator

this pr looks good, but changes unnessary LOC

@adityakamat24
Copy link
Author

Hi, I will make the changes and update the PR shortly.

adityakamat24 added 2 commits March 1, 2026 21:07
  - Move weight dequantization to QuarkOCP_MX_MoEMethod
  - Restore MXFP quantization functions for intermediate activations
  - Restore MXFP branches in _get_config_quant_dtype and moe_kernel_quantize_input
  - Fix size assertions to handle dequantized weights in emulation mode
  - Remove double-quantization bug by letting fused_experts handle all activation quantization

Signed-off-by: adityakamat24 <adityakmat007@gmail.com>
Signed-off-by: adityakamat24 <adityakmat007@gmail.com>
@adityakamat24
Copy link
Author

adityakamat24 commented Mar 10, 2026

@robertgshaw2-redhat @fxmarty-amd

I made the changes sometime back, please do review it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Remove MXFP4 Logic From fused_experts

5 participants