[Refactor] Move MXFP4/MXFP6 logic from fused_experts to Quark#32120
[Refactor] Move MXFP4/MXFP6 logic from fused_experts to Quark#32120adityakamat24 wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request is a well-executed refactoring that moves the MXFP4/MXFP6 quantization logic from the generic fused_experts MoE layer into the Quark-specific QuarkOCP_MX_MoEMethod. The changes correctly relocate the emulation-specific dequantization of weights and quantize-dequantize of activations into the apply method of the Quark MoE method. This improves the separation of concerns and makes the generic MoE kernel cleaner, as it no longer needs to be aware of Quark-specific emulation details. The implementation is clean, consistent across all modified files, and correctly preserves the existing functionality.
| apply_router_weight_on_input=layer.apply_router_weight_on_input, | ||
| expert_map=layer.expert_map, | ||
| quant_config=None, | ||
| ) |
There was a problem hiding this comment.
Missing intermediate activation quantization in MXFP emulation
High Severity
The MXFP emulation path no longer quantizes intermediate activations between the two matmul operations. Previously, _get_config_quant_dtype() returned "mxfp4"/"mxfp6_*" for MXFP schemes, and moe_kernel_quantize_input() applied QDQ to both input activations and intermediate activations (after the activation function, before the second matmul). Now, _quantize_activations() only handles input activations in apply(), and quant_config=None is passed to fused_experts(), causing moe_kernel_quantize_input() to skip intermediate activation quantization since quant_dtype is None. This changes numerical behavior and breaks emulation accuracy.
Additional Locations (1)
|
Hi @adityakamat24, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
fxmarty-amd
left a comment
There was a problem hiding this comment.
LGTM, thank you, it is indeed much cleaner!
Just the changes/reordering in fused_moe/utils.py might be unnecessary?
@fxmarty-amd Do you want me to change that? |
|
Let's see RH/vllm folks comment, but apart from removing the outdated comments, the changes there might be unnecessary I think? |
|
I’ll wait for their input, and can adjust the utils changes accordingly if needed. |
|
@tjtanaa , @mgoin , @pavanimajety Please can can you look into this? Thank you. |
| assert (dim * 3) % 4 == 0 | ||
| return (dim * 3) // 4 | ||
|
|
||
| def _dequantize_weights( |
There was a problem hiding this comment.
I totally agree with the motivation, but I don't think only moving dequantization part is an elegant way. Dequantization is actually part of inference (kernel emulation). Putting dequantization in quant_method would break the purity. quant methods should be only responsible for quantized weights loading and quantization (eg. online quantization). Dequantization should reside in inference part.
I suggest following the design in this PR. We wrappe inference related code such as dequantization and kernels in Kernel class and return the desired Kernel class in quant_method with factory pattern.
There was a problem hiding this comment.
There is
I was not aware ofInterestingly, https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py seems not to make use of this, but https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/fp8.py does.
Alternatively there simply needs to be a TODO like
There was a problem hiding this comment.
Yea I think the dequant part should move into emulation kernel after the kernel refactor.
|
@fxmarty-amd @hangy-amd should the modular kernel refactor happen in this PR, or would you prefer splitting it into: This PR or follow-up PR: Convert to FusedMoEModularKernel pattern (following the fp8.py example you linked) Happy to rework it now if you want it in this PR, otherwise I can add a TODO and get this merged first. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@fxmarty-amd @adityakamat24 Hi, guys, there's an on going PR that uses mk.FusedMoEModularKernel. Please also refer to this. I'd like to refactor kernels within this PR, because kernel refactor in quark will be happening in the coming weeks. The commits in this PR would only exist for a very short time, so why don't we follow the ultimate design directly. |
BowenBao
left a comment
There was a problem hiding this comment.
cc @robertgshaw2-redhat for review.
@adityakamat24 have you validated the PR with any mxfp4 quark models?
I'm okay with merging as is. Kernel refactoring anyways needs to be done next.
| return _mxfp6_e3m2_quantize(A, A_scale, per_act_token_quant, block_shape) | ||
| elif quant_dtype == "mxfp6_e2m3": | ||
| return _mxfp6_e2m3_quantize(A, A_scale, per_act_token_quant, block_shape) | ||
| elif quant_dtype == "mxfp8": |
There was a problem hiding this comment.
nit: dont change unnessesry lines
|
|
||
|
|
||
| def _mxfp8_e4m3_quantize( | ||
| def _mxfp6_e3m2_quantize( |
There was a problem hiding this comment.
why are these lines getting changed?
|
this pr looks good, but changes unnessary LOC |
|
Hi, I will make the changes and update the PR shortly. |
- Move weight dequantization to QuarkOCP_MX_MoEMethod - Restore MXFP quantization functions for intermediate activations - Restore MXFP branches in _get_config_quant_dtype and moe_kernel_quantize_input - Fix size assertions to handle dequantized weights in emulation mode - Remove double-quantization bug by letting fused_experts handle all activation quantization Signed-off-by: adityakamat24 <adityakmat007@gmail.com>
Signed-off-by: adityakamat24 <adityakmat007@gmail.com>
|
@robertgshaw2-redhat @fxmarty-amd I made the changes sometime back, please do review it. Thanks! |
Fixes #30621
Purpose
Refactors MXFP4/MXFP6 quantization logic from the generic MoE layer into the Quark-specific quantization layer as mentioned in #30621
Motivation
The generic
fused_expertskernel should not contain Quark-specific quantization logic. This PR moves all MXFP-related code intoquark_moe.py, improving separation of concerns and making both the MoE kernel and quantization layer easier to maintain independently.Changes
Added to
quark_moe.py:_dequantize_weights()- handles MXFP4/MXFP6 weight dequantization for all 5 OCP MX schemes_quantize_activations()- handles activation quantization using QDQ emulationapply()to pre-process weights and activations before callingfused_expertsget_fused_moe_quant_config()to returnNonefor emulation modeRemoved from
fused_moe.py:Removed from
utils.py:_mxfp4_quantize(),_mxfp6_e3m2_quantize(),_mxfp6_e2m3_quantize()functionsmoe_kernel_quantize_input()To test locally with AMD hardware:
pytest tests/quantization/test_quark.py -v -k "mxfp" --tb=shortImplements @fxmarty-amd's suggestion from #30621 to handle quantization in
apply()rather thanprocess_weights_after_loading(), preserving on-the-fly dequantization behavior for devices without native MXFP instruction support.This is an internal refactoring. All 5 OCP MX schemes continue to function as before.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Separates OCP MX logic from the generic MoE kernel and consolidates it in Quark.
quark_moe.py): Adds_dequantize_weights()and_quantize_activations(); updatesQuarkOCP_MX_MoEMethod.apply()to pre-dequantize weights and QDQ activations for emulation, and returnsNonefromget_fused_moe_quant_config()in emulation; retains native AITER path handling; keeps support for all MX schemes.fused_moe.py): Removes MXFP4/MXFP6 imports and dequantization branches fromfused_experts_impl;_get_config_quant_dtype()no longer returns MXFP strings; simplifies quant handling to FP8/INT8 only.fused_moe/utils.py): Deletes MXFP4/MXFP6 quantization helpers and their branches frommoe_kernel_quantize_input(); keeps NVFP4/MXFP8 paths.Overall: MXFP responsibilities live in Quark, reducing coupling and making
fused_expertsbackend-agnostic.Written by Cursor Bugbot for commit 7637104e05cf5e4a3211ebc60d43cb68906ff738. This will update automatically on new commits. Configure here.
Note
Refactors MXFP4/MXFP6 handling out of the generic MoE path and into Quark’s OCP MX method, simplifying
fused_expertsand MoE utils.quark_moe.py): Adds_dequantize_weights()and_quantize_activations(); in emulation modeapply()dequantizes MX weights and QDQ-quantizes activations before callingfused_experts;get_fused_moe_quant_config()sets weight scales toNonein emulation; keeps native ROCm AITER path for supported configsfused_moe.py): Removes MXFP4/MXFP6 imports and dequantization branch fromfused_experts_implfused_moe/utils.py): Deletes MXFP4/MXFP6 helpers and prunes MX branches frommoe_kernel_quantize_input()(keeps NVFP4/MXFP8/FP8/INT8)Result: MXFP logic is isolated to Quark; the generic MoE kernel focuses on FP8/INT8/NVFP4/MXFP8 quant paths.
Written by Cursor Bugbot for commit 05d397ace99660eba766b7991c2a1ef3ce05d93b. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 53c35cc. Configure here.
Note
Separates OCP MX (MXFP4/MXFP6) logic from the generic MoE path and consolidates it in Quark.
quark_moe.py): Adds_dequantize_weights()for all OCP MX schemes; in emulation mode,apply()dequantizes MX weights and callsfused_experts;get_fused_moe_quant_config()sets weight scales toNonein emulation; retains native ROCm AITER path when availablefused_moe.py): Removes MXFP imports and in-kernel dequantization; tightens size checks to only treat weights as packed whenocp_mx_schemeand scales are provided; keeps activation quantization viamoe_kernel_quantize_inputfused_moe/utils.py): Streamlines MX quant helpers and routing inmoe_kernel_quantize_input(explicit mxfp4/mxfp6 variants; mxfp8 routed after others)Written by Cursor Bugbot for commit 53c35cc. This will update automatically on new commits. Configure here.