[AMD] Qwen3.5 MXFP4 breaks after shared expert fusion is enabled#22948
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the can_fuse_shared_expert function in qwen2_moe.py to prevent the fusion of shared experts when they are explicitly excluded from quantization. This logic ensures that FP32 shared experts are not incorrectly fused into quantized MoE weight tensors, which would require unsupported online quantization. Feedback was provided to correct the type hint for the quant_config parameter from None to Optional[QuantizationConfig] to accurately reflect its usage and maintain consistency with the rest of the codebase.
763f2d4 to
e5b3e21
Compare
|
@mqhc2020 Thanks for the fix. |
@hubertlu-tw yeah I think we can add it in amd Nightly Test. If we get more capacity then we can move it to pr-test if needed. |
| exclude_layers = getattr(quant_config, "exclude_layers", []) | ||
| if any( | ||
| "shared_expert" in layer | ||
| and "shared_expert_gate" not in layer | ||
| and not layer.startswith("mtp.") | ||
| for layer in exclude_layers | ||
| ): | ||
| return False |
There was a problem hiding this comment.
can we add a method can_fuse_shared_expert to QuantConfig and implement this check inside QuarkConfig? That way we keep quark specific logic within quark.
and a more precise check is probably checking if shared_expert share the same quantization spec with moe layers.
…-project#22948) Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
…-project#22948) Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
…-project#22948) Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
…-project#22948) Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
…-project#22948) Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Motivation
After shared expert fusion is enabled for Qwen3.5 models (as in #20736 ), MXFP4 model hits an issue: the shared expert in the checkpoint is based on BF16 but current weight loading can only treat it as MXFP4 which is the dtype of routed experts. So before either online quantization is ready, or shared expert in MXFP4 models has been pre-quanted as MXFP4,
we have to skip shared expert fusion feature for MXFP4 model.
Modifications
In qwen2_moe.py, disable fusion when the shared experts are detected:
Note that the following layers are irrelevant:
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci