[Quantization][ModelOpt] Add W4A16 NVFP4 support to fused MoE by juhi10071998 · Pull Request #42440 · vllm-project/vllm

juhi10071998 · 2026-05-12T17:13:00Z

Summary

Adds W4A16 NVFP4 support to ModelOptNvFp4FusedMoE. When the on-disk quant_algo is W4A16_NVFP4, the MoE class passes activation_key=None to select_nvfp4_moe_backend. Every W4A4 backend's _supports_quant_scheme requires (kNvfp4Static, kNvfp4Dynamic) exactly and rejects; Marlin's check only consults weight_key and accepts. Marlin's MoE prep already nulls activation scales in convert_to_nvfp4_moe_kernel_format (oracle/nvfp4.py:362-363) and routes through nvfp4_w4a16_moe_quant_config (oracle/nvfp4.py:433) — no other change needed.

Follow-up to #41769 (merged), which added ModelOptNvFp4W4A16LinearMethod for dense Linear. This PR extends the same scheme to fused MoE. (here the select_nvfp4_moe_backend can decide which kernel to choose depending on activation_key value, in DenseLinear the init_nvfp4_linear_kernel does not take any keys so we need to have a new class)

Tested: 1 parametrized unit test (2 cases, CPU-only, ~1s) + end-to-end smoke verified on nvidia/Qwen3.6-35B-A3B-NVFP4 (Qwen3.5-MoE-VL hybrid).

Test plan

Unit test — 2/2 pass on CPU, no GPU / no ckpt needed

pytest tests/quantization/test_modelopt.py::test_modelopt_nvfp4_moe_dispatches_to_marlin_when_w4a16 -v
# 2 passed in 1.54s

The test mocks select_nvfp4_moe_backend and asserts:

`quant_method`	`use_a16`	`activation_key` arg
`"NVFP4"` (W4A4 default)	`False`	`kNvfp4Dynamic`
`"W4A16_NVFP4"`	`True`	`None`

End-to-end smoke — `nvidia/Qwen3.6-35B-A3B-NVFP4`

Qwen3.5-MoE-VL hybrid (gated-delta linear attention + conv1d + experts, ~35B). One GPU, eager mode.

W4A4 ckpt (default quant_algo=NVFP4) — regression check, must not change from upstream main:

WARNING modelopt.py  Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4)
INFO    nvfp4.py:282 Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends: [...]
=== completion: ' Paris.\n\n<think>\n\n</think>\n\nThat is correct. Paris is the capital and'

W4A16 ckpt (hf_quant_config.json patched to quant_algo=W4A16_NVFP4) — exercises this PR's new path:

WARNING modelopt.py  Detected ModelOpt NVFP4 checkpoint (quant_algo=W4A16_NVFP4)
INFO    nvfp4.py:282 Using 'MARLIN' NvFp4 MoE backend out of potential backends:
                     ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED',
                      'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
=== completion: ' Paris.\nThe capital of France is London.\nThe capital of France is'

Backend flips from FLASHINFER_CUTLASS (W4A4) to MARLIN (W4A16) purely from the activation_key=None filter, as designed.

Relation to other work

Builds on [Quantization] Add ModelOpt NVFP4 W4A16 (4-bit weights, fp16/bf16 activations) support #41769 (merged, dense Linear W4A16).
Slim split from [Quantization][ModelOpt] W4A16 NVFP4 fused MoE + --override-activation-dtype flag #42428 (open draft) — the broader PR also adds a --override-activation-dtype CLI flag for runtime-driven W4A16 routing. That portion is on hold pending coordination with [Quantization] Rework quantization_config to use QuantKey and allow for activation override #41566's QuantSpec(weight, activation) rework, since reviewers noted potential surface overlap. This PR contains only the ckpt-driven W4A16 routing — zero overlap with [Quantization] Rework quantization_config to use QuantKey and allow for activation override #41566 (different quant method, different mechanism, no file overlap).

Extends ModelOptNvFp4FusedMoE to honor W4A16_NVFP4 checkpoints. When the on-disk quant_algo is W4A16_NVFP4, the MoE class passes activation_key=None to select_nvfp4_moe_backend. W4A4 backends reject the scheme (their _supports_quant_scheme requires (kNvfp4Static, kNvfp4Dynamic) exactly); Marlin survives (it only checks weight_key). Marlin's MoE prep already nulls activation scales in convert_to_nvfp4_moe_kernel_format and routes through nvfp4_w4a16_moe_quant_config — no other change needed. Follow-up to PR vllm-project#41769 (dense Linear W4A16). Signed-off-by: Juhi Mittal <juhim@nvidia.com> Co-authored-by: Claude

Parametrized test in tests/quantization/test_modelopt.py covering the two on-disk routing cases for ModelOptNvFp4FusedMoE: - quant_method="NVFP4" (W4A4 default) → use_a16=False, oracle receives activation_key=kNvfp4Dynamic (W4A4 backends accept). - quant_method="W4A16_NVFP4" → use_a16=True, oracle receives activation_key=None (every W4A4 backend's _supports_quant_scheme rejects (kNvfp4Static, None); Marlin survives). Mocks select_nvfp4_moe_backend to capture call args. CPU-only, ~1s. Signed-off-by: Juhi Mittal <juhim@nvidia.com> Co-authored-by: Claude

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the ModelOptNvFp4FusedMoE class to correctly handle W4A16_NVFP4 checkpoints by dispatching to the Marlin backend. It achieves this by setting the activation_key to None when the W4A16_NVFP4 quantization method is used, effectively filtering out W4A4 backends. A corresponding unit test has been added to verify this dispatch logic. I have no feedback to provide.

meenchen

LGTM

juhi10071998 · 2026-05-13T19:32:33Z

Superseded by #42566 — same Pass-1 commits plus the mixed-precision dispatch addition (which was being held on a separate fork branch). Closing in favor of #42566 since the two changes are tightly coupled (mixed-precision dispatch depends on the use_a16 logic added in Pass-1). cc @mgoin @pavanimajety @meenchen for visibility — note #42566 has scope overlap with #42549 (different approach: two pre-built NVFP4 sub-configs vs parameterized single config); happy to coordinate.

juhi10071998 added 2 commits May 12, 2026 16:55

juhi10071998 marked this pull request as ready for review May 12, 2026 17:17

juhi10071998 requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and zyongye as code owners May 12, 2026 17:17

claude Bot reviewed May 12, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

juhi10071998 mentioned this pull request May 12, 2026

[Quantization][ModelOpt] W4A16 NVFP4 fused MoE + --override-activation-dtype flag #42428

Draft

meenchen approved these changes May 12, 2026

View reviewed changes

juhi10071998 added 2 commits May 12, 2026 14:31

Merge branch 'main' into modelopt_nvfp4_moe_w4a16

63feb61

Merge branch 'main' into modelopt_nvfp4_moe_w4a16

c31af7e

juhi10071998 mentioned this pull request May 13, 2026

[Quantization][ModelOpt] W4A16 NVFP4 fused MoE + mixed-precision dispatch #42566

Merged

juhi10071998 closed this May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quantization][ModelOpt] Add W4A16 NVFP4 support to fused MoE#42440

[Quantization][ModelOpt] Add W4A16 NVFP4 support to fused MoE#42440
juhi10071998 wants to merge 4 commits into
vllm-project:mainfrom
juhi10071998:modelopt_nvfp4_moe_w4a16

juhi10071998 commented May 12, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

meenchen left a comment

Uh oh!

juhi10071998 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

juhi10071998 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Unit test — 2/2 pass on CPU, no GPU / no ckpt needed

End-to-end smoke — nvidia/Qwen3.6-35B-A3B-NVFP4

Relation to other work

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

juhi10071998 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juhi10071998 commented May 12, 2026 •

edited

Loading

End-to-end smoke — `nvidia/Qwen3.6-35B-A3B-NVFP4`