Skip to content

[Quantization][ModelOpt] Add W4A16 NVFP4 support to fused MoE#42440

Closed
juhi10071998 wants to merge 4 commits into
vllm-project:mainfrom
juhi10071998:modelopt_nvfp4_moe_w4a16
Closed

[Quantization][ModelOpt] Add W4A16 NVFP4 support to fused MoE#42440
juhi10071998 wants to merge 4 commits into
vllm-project:mainfrom
juhi10071998:modelopt_nvfp4_moe_w4a16

Conversation

@juhi10071998

@juhi10071998 juhi10071998 commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds W4A16 NVFP4 support to ModelOptNvFp4FusedMoE. When the on-disk quant_algo is W4A16_NVFP4, the MoE class passes activation_key=None to select_nvfp4_moe_backend. Every W4A4 backend's _supports_quant_scheme requires (kNvfp4Static, kNvfp4Dynamic) exactly and rejects; Marlin's check only consults weight_key and accepts. Marlin's MoE prep already nulls activation scales in convert_to_nvfp4_moe_kernel_format (oracle/nvfp4.py:362-363) and routes through nvfp4_w4a16_moe_quant_config (oracle/nvfp4.py:433) — no other change needed.

Follow-up to #41769 (merged), which added ModelOptNvFp4W4A16LinearMethod for dense Linear. This PR extends the same scheme to fused MoE. (here the select_nvfp4_moe_backend can decide which kernel to choose depending on activation_key value, in DenseLinear the init_nvfp4_linear_kernel does not take any keys so we need to have a new class)

Tested: 1 parametrized unit test (2 cases, CPU-only, ~1s) + end-to-end smoke verified on nvidia/Qwen3.6-35B-A3B-NVFP4 (Qwen3.5-MoE-VL hybrid).

Test plan

Unit test — 2/2 pass on CPU, no GPU / no ckpt needed

pytest tests/quantization/test_modelopt.py::test_modelopt_nvfp4_moe_dispatches_to_marlin_when_w4a16 -v
# 2 passed in 1.54s

The test mocks select_nvfp4_moe_backend and asserts:

quant_method use_a16 activation_key arg
"NVFP4" (W4A4 default) False kNvfp4Dynamic
"W4A16_NVFP4" True None

End-to-end smoke — nvidia/Qwen3.6-35B-A3B-NVFP4

Qwen3.5-MoE-VL hybrid (gated-delta linear attention + conv1d + experts, ~35B). One GPU, eager mode.

W4A4 ckpt (default quant_algo=NVFP4) — regression check, must not change from upstream main:

WARNING modelopt.py  Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4)
INFO    nvfp4.py:282 Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends: [...]
=== completion: ' Paris.\n\n<think>\n\n</think>\n\nThat is correct. Paris is the capital and'

W4A16 ckpt (hf_quant_config.json patched to quant_algo=W4A16_NVFP4) — exercises this PR's new path:

WARNING modelopt.py  Detected ModelOpt NVFP4 checkpoint (quant_algo=W4A16_NVFP4)
INFO    nvfp4.py:282 Using 'MARLIN' NvFp4 MoE backend out of potential backends:
                     ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED',
                      'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
=== completion: ' Paris.\nThe capital of France is London.\nThe capital of France is'

Backend flips from FLASHINFER_CUTLASS (W4A4) to MARLIN (W4A16) purely from the activation_key=None filter, as designed.

Relation to other work

Extends ModelOptNvFp4FusedMoE to honor W4A16_NVFP4 checkpoints. When the
on-disk quant_algo is W4A16_NVFP4, the MoE class passes activation_key=None
to select_nvfp4_moe_backend. W4A4 backends reject the scheme (their
_supports_quant_scheme requires (kNvfp4Static, kNvfp4Dynamic) exactly);
Marlin survives (it only checks weight_key). Marlin's MoE prep already
nulls activation scales in convert_to_nvfp4_moe_kernel_format and routes
through nvfp4_w4a16_moe_quant_config — no other change needed.

Follow-up to PR vllm-project#41769 (dense Linear W4A16).

Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Co-authored-by: Claude
Parametrized test in tests/quantization/test_modelopt.py covering the
two on-disk routing cases for ModelOptNvFp4FusedMoE:

- quant_method="NVFP4" (W4A4 default) → use_a16=False, oracle receives
  activation_key=kNvfp4Dynamic (W4A4 backends accept).
- quant_method="W4A16_NVFP4" → use_a16=True, oracle receives
  activation_key=None (every W4A4 backend's _supports_quant_scheme
  rejects (kNvfp4Static, None); Marlin survives).

Mocks select_nvfp4_moe_backend to capture call args. CPU-only, ~1s.

Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Co-authored-by: Claude

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ModelOptNvFp4FusedMoE class to correctly handle W4A16_NVFP4 checkpoints by dispatching to the Marlin backend. It achieves this by setting the activation_key to None when the W4A16_NVFP4 quantization method is used, effectively filtering out W4A4 backends. A corresponding unit test has been added to verify this dispatch logic. I have no feedback to provide.

@meenchen meenchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juhi10071998

Copy link
Copy Markdown
Contributor Author

Superseded by #42566 — same Pass-1 commits plus the mixed-precision dispatch addition (which was being held on a separate fork branch). Closing in favor of #42566 since the two changes are tightly coupled (mixed-precision dispatch depends on the use_a16 logic added in Pass-1). cc @mgoin @pavanimajety @meenchen for visibility — note #42566 has scope overlap with #42549 (different approach: two pre-built NVFP4 sub-configs vs parameterized single config); happy to coordinate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants