[NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation#35737
Conversation
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…vfp4-simulation-support-moe
There was a problem hiding this comment.
Code Review
This pull request introduces support for NVFP4 MOE models on a wider range of hardware, including AMD Instinct, Nvidia Ampere, and Hopper, through an emulation backend. The changes are extensive, touching quantization layers, model execution, and tests to accommodate this new emulation path. The implementation appears solid and well-integrated. I've found one critical issue that needs to be addressed.
vllm/model_executor/layers/fused_moe/quantization_emulation_moe.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/fused_moe/quantization_emulation_moe.py
Outdated
Show resolved
Hide resolved
...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py
Show resolved
Hide resolved
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
| if torch.unique(a13_scale).numel() != 1 or torch.unique(a2_scale).numel() != 1: | ||
| logger.warning_once( | ||
| "In NVFP4 linear, the activation global scale for inputs are different" | ||
| " for MOE w13 (gate_up_proj) layer or MOE w2 (down_proj). Using" | ||
| " a13_scale = a13_scale.max() and a2_scale = a2_scale.max()." | ||
| ) |
There was a problem hiding this comment.
I believe we do have some kernels that support different global scales per expert, for instance see #21408
There was a problem hiding this comment.
@mgoin flashinfer default backends use a single shared global scale across all experts for both gate_up_proj and down_proj, see:
vllm/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
Lines 240 to 246 in d9408ff
This logic is here to use similarly a single global scale for gate_up_proj input and down_proj input in the emulation code path using TritonExperts.
We display a warning because there is no logic in vLLM at the moment to recompute fp8_e4m3 scales when taking this .max(). Thankfully enough, Model-Optimizer and compressed-tensors produce models that have the same global_scale for all of gate_proj/up_proj & experts, so this is not an issue. But in case, the serialized global scales are different, taking simply the .max() as done currently is not enough.
This may be fixed in an other PR
| block_shape: list[int] | None = None, | ||
| is_fp4_scale_swizzled: bool = True, | ||
| ocp_mx_scheme: str | None = None, | ||
| emulation: bool = False, |
There was a problem hiding this comment.
This is a bad practice to add emulation as an argument in this function and only use it for a single quant_dtype case. Why don't you just call ref_nvfp4_quant_dequant(A, A_scale, block_size=16) inline in apply?
There was a problem hiding this comment.
@mgoin Which apply are you talking about? Nvfp4QuantizationEmulationTritonExperts inherits from TritonExperts.apply and I do NOT want to modify TritonExperts.experts, and QDQ needs to be applied to BOTH a13 and a2.
For example, moe_kernel_quantize_input already handles MXFP4/MXFP6_E3M2/MXFP6_E4M3 fake QDQ through _mxfp4_quantize, _mxfp6_e3m2_quantize, _mxfp6_e2m3_quantize.
I agree this should be clarified. Do you propose to keep moe_kernel_quantize_input for REAL quantization cases, and have an other function handling all QDQ case?
and have in TritonExperts.apply:
if not emulation:
qintermediate_cache2, a2q_scale = moe_kernel_quantize_input(
intermediate_cache2,
a2_scale,
self.quant_dtype,
self.per_act_token_quant,
self.block_shape,
emulation=self.emulation,
)
else:
qintermediate_cache2, a2q_scale = moe_kernel_input_fake_quantization(
intermediate_cache2,
a2_scale,
self.quant_dtype,
self.per_act_token_quant,
self.block_shape,
)? Let me know!
There was a problem hiding this comment.
I think Michael may be suggesting that the other argument combinations (fp8 + emulation) are not handled and instead silently fall back to real quantization.
There was a problem hiding this comment.
Got it, let me address properly
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
| """ | ||
| Quantization Emulation Experts for MoE. | ||
|
|
||
| This module provides emulation support for MOE quantization schemes that | ||
| don't have native hardware support. It dequantizes weights on the fly | ||
| and falls back to calling fused_experts with activation quantization. | ||
|
|
||
| Similar to QuarkOCP_MX_MoEMethod's emulation path but abstracted into | ||
| a reusable NvFp4MoeBackend. | ||
| """ |
There was a problem hiding this comment.
Is this meant to be a general emulation moe or specific to nvfp4? I'm confused about the name vs the description
There was a problem hiding this comment.
This is meant for NVFP4 only, if it is okay.
Let me update the name/description accordingly
| # moe_kernel_quantize_input -> ref_nvfp4_quant_dequant use the inverse scale. | ||
| # Similar to model_executor/layers/quantization/utils/flashinfer_fp4_moe.py. | ||
| # NOTE: at this point `a13_scale` and `a2_scale` are the inverses such that: | ||
| # `x_fp8_range = x * 1 / global_scale`, and `global_scale` is small. | ||
| # We take the max following e.g. flashinfer_fp4_moe.py, which results in likely | ||
| # overflow of the fp8 range, and scale clamping! | ||
| # It may be better to use min here. | ||
| a13_scale = a13_scale.max().to(torch.float32) | ||
| a2_scale = a2_scale.max().to(torch.float32) | ||
|
|
||
| a13_scale = 1.0 / a13_scale | ||
| a2_scale = 1.0 / a2_scale |
There was a problem hiding this comment.
I think this comment needs to be reworked. Also can just do a13_scale = 1.0 / a13_scale.max().to(torch.float32) etc
There was a problem hiding this comment.
I updated the comment
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…vfp4-simulation-support-moe
|
This pull request has merge conflicts that must be resolved before it can be |
…vfp4-simulation-support-moe
|
Hi @fxmarty-amd, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>
…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…ub.com/fxmarty-amd/vllm into upstream-nvfp4-simulation-support-rocm
|
Hi @fxmarty-amd, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
kylesayrs
left a comment
There was a problem hiding this comment.
Reposting what I commented on the other PR: #35859 (review)
I think as it stands, passing the emulation_dequantize_weights is creating a lot of branching and modifications on existing quantization schemes. I would strongly consider breaking this out into a separate scheme, similar to Fp8OnlineLinearMethod, otherwise a lot of function contracts/behavior get changed.
I agree that emulation_dequantize_weights=False should be a linear backend, no problem there.
|
@kylesayrs Thanks a lot for reviewing! #35859 was based off #35855 that has been deemed not acceptable, so I will remove the logic about |
…vfp4-simulation-support-moe
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
|
Hi @fxmarty-amd, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…vfp4-simulation-support-moe
This PR depends on #35733 for dense models. Please see the correct diff at: fxmarty-amd/vllm@upstream-nvfp4-simulation-support-rocm...upstream-nvfp4-simulation-support-moe
Purpose
This PR enables running NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper.
This is useful for researchers, anybody trying out microscaling formats, and people who would like to run e.g. https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4 or https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4 on non-Blackwell devices.
Test Plan
See
test_llama4_nvfp4_moe_emulationhttps://github.com/fxmarty-amd/vllm/blob/457f9dfa581abc12de32b10ae0674cc8e086edfc/tests/quantization/test_blackwell_moe.py#L119And see
giving:
And
export PRETRAINED_PATH="/shareddata/nvidia/Qwen3-30B-A3B-NVFP4"And
export PRETRAINED_PATH="/shareddata/RedHatAI/Qwen3-30B-A3B-NVFP4"