[Quantization][ModelOpt] Add W4A16 NVFP4 support to fused MoE#42440
[Quantization][ModelOpt] Add W4A16 NVFP4 support to fused MoE#42440juhi10071998 wants to merge 4 commits into
Conversation
Extends ModelOptNvFp4FusedMoE to honor W4A16_NVFP4 checkpoints. When the on-disk quant_algo is W4A16_NVFP4, the MoE class passes activation_key=None to select_nvfp4_moe_backend. W4A4 backends reject the scheme (their _supports_quant_scheme requires (kNvfp4Static, kNvfp4Dynamic) exactly); Marlin survives (it only checks weight_key). Marlin's MoE prep already nulls activation scales in convert_to_nvfp4_moe_kernel_format and routes through nvfp4_w4a16_moe_quant_config — no other change needed. Follow-up to PR vllm-project#41769 (dense Linear W4A16). Signed-off-by: Juhi Mittal <juhim@nvidia.com> Co-authored-by: Claude
Parametrized test in tests/quantization/test_modelopt.py covering the two on-disk routing cases for ModelOptNvFp4FusedMoE: - quant_method="NVFP4" (W4A4 default) → use_a16=False, oracle receives activation_key=kNvfp4Dynamic (W4A4 backends accept). - quant_method="W4A16_NVFP4" → use_a16=True, oracle receives activation_key=None (every W4A4 backend's _supports_quant_scheme rejects (kNvfp4Static, None); Marlin survives). Mocks select_nvfp4_moe_backend to capture call args. CPU-only, ~1s. Signed-off-by: Juhi Mittal <juhim@nvidia.com> Co-authored-by: Claude
There was a problem hiding this comment.
Code Review
This pull request updates the ModelOptNvFp4FusedMoE class to correctly handle W4A16_NVFP4 checkpoints by dispatching to the Marlin backend. It achieves this by setting the activation_key to None when the W4A16_NVFP4 quantization method is used, effectively filtering out W4A4 backends. A corresponding unit test has been added to verify this dispatch logic. I have no feedback to provide.
|
Superseded by #42566 — same Pass-1 commits plus the mixed-precision dispatch addition (which was being held on a separate fork branch). Closing in favor of #42566 since the two changes are tightly coupled (mixed-precision dispatch depends on the use_a16 logic added in Pass-1). cc @mgoin @pavanimajety @meenchen for visibility — note #42566 has scope overlap with #42549 (different approach: two pre-built NVFP4 sub-configs vs parameterized single config); happy to coordinate. |
Summary
Adds W4A16 NVFP4 support to
ModelOptNvFp4FusedMoE. When the on-diskquant_algoisW4A16_NVFP4, the MoE class passesactivation_key=Nonetoselect_nvfp4_moe_backend. Every W4A4 backend's_supports_quant_schemerequires(kNvfp4Static, kNvfp4Dynamic)exactly and rejects; Marlin's check only consultsweight_keyand accepts. Marlin's MoE prep already nulls activation scales inconvert_to_nvfp4_moe_kernel_format(oracle/nvfp4.py:362-363) and routes throughnvfp4_w4a16_moe_quant_config(oracle/nvfp4.py:433) — no other change needed.Follow-up to #41769 (merged), which added
ModelOptNvFp4W4A16LinearMethodfor dense Linear. This PR extends the same scheme to fused MoE. (here the select_nvfp4_moe_backend can decide which kernel to choose depending on activation_key value, in DenseLinear the init_nvfp4_linear_kernel does not take any keys so we need to have a new class)Tested: 1 parametrized unit test (2 cases, CPU-only, ~1s) + end-to-end smoke verified on
nvidia/Qwen3.6-35B-A3B-NVFP4(Qwen3.5-MoE-VL hybrid).Test plan
Unit test — 2/2 pass on CPU, no GPU / no ckpt needed
pytest tests/quantization/test_modelopt.py::test_modelopt_nvfp4_moe_dispatches_to_marlin_when_w4a16 -v # 2 passed in 1.54sThe test mocks
select_nvfp4_moe_backendand asserts:quant_methoduse_a16activation_keyarg"NVFP4"(W4A4 default)FalsekNvfp4Dynamic"W4A16_NVFP4"TrueNoneEnd-to-end smoke —
nvidia/Qwen3.6-35B-A3B-NVFP4Qwen3.5-MoE-VL hybrid (gated-delta linear attention + conv1d + experts, ~35B). One GPU, eager mode.
W4A4 ckpt (default
quant_algo=NVFP4) — regression check, must not change from upstream main:W4A16 ckpt (
hf_quant_config.jsonpatched toquant_algo=W4A16_NVFP4) — exercises this PR's new path:Backend flips from
FLASHINFER_CUTLASS(W4A4) toMARLIN(W4A16) purely from theactivation_key=Nonefilter, as designed.Relation to other work
--override-activation-dtypeCLI flag for runtime-driven W4A16 routing. That portion is on hold pending coordination with [Quantization] Rework quantization_config to use QuantKey and allow for activation override #41566'sQuantSpec(weight, activation)rework, since reviewers noted potential surface overlap. This PR contains only the ckpt-driven W4A16 routing — zero overlap with [Quantization] Rework quantization_config to use QuantKey and allow for activation override #41566 (different quant method, different mechanism, no file overlap).