[NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation by fxmarty-amd · Pull Request #35737 · vllm-project/vllm

fxmarty-amd · 2026-03-02T11:38:30Z

This PR depends on #35733 for dense models. Please see the correct diff at: fxmarty-amd/vllm@upstream-nvfp4-simulation-support-rocm...upstream-nvfp4-simulation-support-moe

Purpose

This PR enables running NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper.

This is useful for researchers, anybody trying out microscaling formats, and people who would like to run e.g. https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4 or https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4 on non-Blackwell devices.

Test Plan

See

test_llama4_nvfp4_moe_emulation https://github.com/fxmarty-amd/vllm/blob/457f9dfa581abc12de32b10ae0674cc8e086edfc/tests/quantization/test_blackwell_moe.py#L119

And see

export PRETRAINED_PATH="/shareddata/Qwen/Qwen3-30B-A3B"

CUDA_VISIBLE_DEVICES=4 nohup lm_eval \
  --model vllm \
  --model_args '{"pretrained":"'"${PRETRAINED_PATH}"'","dtype":"auto","tensor_parallel_size":1,"enable_thinking": false,"chat_template_args":{"enable_thinking":false}}' \
  --device "cuda" \
  --tasks wikitext,piqa \
  --batch_size auto &> lm_eval.log &

giving:

| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|piqa    |      1|none  |     0|acc            |↑  | 0.7922|±  |0.0095|
|        |       |none  |     0|acc_norm       |↑  | 0.8030|±  |0.0093|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6443|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5630|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |10.8936|±  |   N/A|

And export PRETRAINED_PATH="/shareddata/nvidia/Qwen3-30B-A3B-NVFP4"

(EngineCore_DP0 pid=183384) INFO 03-04 16:36:56 [nvfp4_utils.py:87] Using NvFp4LinearBackend.EMULATION for NVFP4 GEMM
(EngineCore_DP0 pid=183384) INFO 03-04 16:36:56 [rocm.py:464] Using Triton Attention backend.
(EngineCore_DP0 pid=183384) INFO 03-04 16:36:56 [nvfp4.py:266] Using 'EMULATION' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(EngineCore_DP0 pid=183384) WARNING 03-04 16:37:12 [quantization_emulation_moe.py:51] Using Nvfp4QuantizationEmulationTritonExperts MOE backend. This will dequantize weights on the fly and may be slower than native quantized MOE. Consider using a device with native quantization support (e.g. Nvidia Blackwell) for better performance.
...
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|piqa    |      1|none  |     0|acc            |↑  | 0.7867|±  |0.0096|
|        |       |none  |     0|acc_norm       |↑  | 0.7938|±  |0.0094|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6527|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5721|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |11.2391|±  |   N/A|

And export PRETRAINED_PATH="/shareddata/RedHatAI/Qwen3-30B-A3B-NVFP4"

(EngineCore_DP0 pid=184824) INFO 03-04 16:39:30 [nvfp4_utils.py:87] Using NvFp4LinearBackend.EMULATION for NVFP4 GEMM
(EngineCore_DP0 pid=184824) INFO 03-04 16:39:30 [rocm.py:464] Using Triton Attention backend.
(EngineCore_DP0 pid=184824) INFO 03-04 16:39:30 [nvfp4.py:266] Using 'EMULATION' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(EngineCore_DP0 pid=184824) WARNING 03-04 16:39:36 [quantization_emulation_moe.py:51] Using Nvfp4QuantizationEmulationTritonExperts MOE backend. This will dequantize weights on the fly and may be slower than native quantized MOE. Consider using a device with native quantization support (e.g. Nvidia Blackwell) for better performance.
...
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|piqa    |      1|none  |     0|acc            |↑  | 0.7867|±  |0.0096|
|        |       |none  |     0|acc_norm       |↑  | 0.7954|±  |0.0094|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6604|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5806|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |11.5645|±  |   N/A|

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

gemini-code-assist

Code Review

This pull request introduces support for NVFP4 MOE models on a wider range of hardware, including AMD Instinct, Nvidia Ampere, and Hopper, through an emulation backend. The changes are extensive, touching quantization layers, model execution, and tests to accommodate this new emulation path. The implementation appears solid and well-integrated. I've found one critical issue that needs to be addressed.

vllm/model_executor/layers/fused_moe/quantization_emulation_moe.py

vllm/model_executor/layers/fused_moe/config.py

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py

vllm/model_executor/layers/fused_moe/utils.py

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mgoin · 2026-03-24T22:07:20Z

vllm/model_executor/layers/fused_moe/oracle/nvfp4.py

+        if torch.unique(a13_scale).numel() != 1 or torch.unique(a2_scale).numel() != 1:
+            logger.warning_once(
+                "In NVFP4 linear, the activation global scale for inputs are different"
+                " for MOE w13 (gate_up_proj) layer or MOE w2 (down_proj). Using"
+                " a13_scale = a13_scale.max() and a2_scale = a2_scale.max()."
+            )


I believe we do have some kernels that support different global scales per expert, for instance see #21408

@mgoin flashinfer default backends use a single shared global scale across all experts for both gate_up_proj and down_proj, see:

vllm/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

Lines 240 to 246 in d9408ff

# For some FI kernels, the input scales are shared by all experts.

if is_global_sf_supported_for_nvfp4_backend(backend):

num_experts = w13.shape[0]

a13_scale = a13_scale.max().to(torch.float32).expand(num_experts)

a2_scale = a2_scale.max().to(torch.float32).expand(num_experts)

else:

a13_scale = a13_scale.max(dim=1).values.to(torch.float32)

.

This logic is here to use similarly a single global scale for gate_up_proj input and down_proj input in the emulation code path using TritonExperts.

We display a warning because there is no logic in vLLM at the moment to recompute fp8_e4m3 scales when taking this .max(). Thankfully enough, Model-Optimizer and compressed-tensors produce models that have the same global_scale for all of gate_proj/up_proj & experts, so this is not an issue. But in case, the serialized global scales are different, taking simply the .max() as done currently is not enough.

This may be fixed in an other PR

mgoin · 2026-03-24T22:24:05Z

vllm/model_executor/layers/fused_moe/utils.py

    block_shape: list[int] | None = None,
    is_fp4_scale_swizzled: bool = True,
    ocp_mx_scheme: str | None = None,
+    emulation: bool = False,


This is a bad practice to add emulation as an argument in this function and only use it for a single quant_dtype case. Why don't you just call ref_nvfp4_quant_dequant(A, A_scale, block_size=16) inline in apply?

@mgoin Which apply are you talking about? Nvfp4QuantizationEmulationTritonExperts inherits from TritonExperts.apply and I do NOT want to modify TritonExperts.experts, and QDQ needs to be applied to BOTH a13 and a2.

For example, moe_kernel_quantize_input already handles MXFP4/MXFP6_E3M2/MXFP6_E4M3 fake QDQ through _mxfp4_quantize, _mxfp6_e3m2_quantize, _mxfp6_e2m3_quantize.

I agree this should be clarified. Do you propose to keep moe_kernel_quantize_input for REAL quantization cases, and have an other function handling all QDQ case?

and have in TritonExperts.apply:

if not emulation: qintermediate_cache2, a2q_scale = moe_kernel_quantize_input( intermediate_cache2, a2_scale, self.quant_dtype, self.per_act_token_quant, self.block_shape, emulation=self.emulation, ) else: qintermediate_cache2, a2q_scale = moe_kernel_input_fake_quantization( intermediate_cache2, a2_scale, self.quant_dtype, self.per_act_token_quant, self.block_shape, )

? Let me know!

I think Michael may be suggesting that the other argument combinations (fp8 + emulation) are not handled and instead silently fall back to real quantization.

Got it, let me address properly

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

mgoin · 2026-03-24T22:26:29Z

vllm/model_executor/layers/fused_moe/quantization_emulation_moe.py

+"""
+Quantization Emulation Experts for MoE.
+
+This module provides emulation support for MOE quantization schemes that
+don't have native hardware support. It dequantizes weights on the fly
+and falls back to calling fused_experts with activation quantization.
+
+Similar to QuarkOCP_MX_MoEMethod's emulation path but abstracted into
+a reusable NvFp4MoeBackend.
+"""


Is this meant to be a general emulation moe or specific to nvfp4? I'm confused about the name vs the description

This is meant for NVFP4 only, if it is okay.

Let me update the name/description accordingly

vllm/model_executor/layers/fused_moe/fused_moe.py

mgoin · 2026-03-24T22:29:02Z

vllm/model_executor/layers/fused_moe/oracle/nvfp4.py

+        # moe_kernel_quantize_input -> ref_nvfp4_quant_dequant use the inverse scale.
+        # Similar to model_executor/layers/quantization/utils/flashinfer_fp4_moe.py.
+        # NOTE: at this point `a13_scale` and `a2_scale` are the inverses such that:
+        # `x_fp8_range = x * 1 / global_scale`, and `global_scale` is small.
+        # We take the max following e.g. flashinfer_fp4_moe.py, which results in likely
+        # overflow of the fp8 range, and scale clamping!
+        # It may be better to use min here.
+        a13_scale = a13_scale.max().to(torch.float32)
+        a2_scale = a2_scale.max().to(torch.float32)
+
+        a13_scale = 1.0 / a13_scale
+        a2_scale = 1.0 / a2_scale


I think this comment needs to be reworked. Also can just do a13_scale = 1.0 / a13_scale.max().to(torch.float32) etc

I updated the comment

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

mergify · 2026-03-30T14:09:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

mergify · 2026-04-01T08:38:13Z

Hi @fxmarty-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…ub.com/fxmarty-amd/vllm into upstream-nvfp4-simulation-support-rocm

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify · 2026-04-01T12:51:58Z

Hi @fxmarty-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

kylesayrs

Reposting what I commented on the other PR: #35859 (review)

I think as it stands, passing the emulation_dequantize_weights is creating a lot of branching and modifications on existing quantization schemes. I would strongly consider breaking this out into a separate scheme, similar to Fp8OnlineLinearMethod, otherwise a lot of function contracts/behavior get changed.

I agree that emulation_dequantize_weights=False should be a linear backend, no problem there.

fxmarty-amd · 2026-04-02T14:28:00Z

@kylesayrs Thanks a lot for reviewing! #35859 was based off #35855 that has been deemed not acceptable, so I will remove the logic about emulation_dequantize_weights in there. This logic is not in this PR.

…vfp4-simulation-support-moe

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify · 2026-04-02T15:35:29Z

Hi @fxmarty-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

fxmarty-amd added 4 commits March 2, 2026 12:16

fix issues with nvfp4 dense emulation in vllm (squash)

b313689

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comments

bc6ff39

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

nvfp4 moe emulation support

14bc668

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

a11d131

…vfp4-simulation-support-moe

mergify bot added nvidia rocm Related to AMD ROCm labels Mar 2, 2026

github-project-automation bot added this to NVIDIA and AMD Mar 2, 2026

github-project-automation bot moved this to Todo in AMD Mar 2, 2026

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/quantization_emulation_moe.py Outdated Show resolved Hide resolved

fxmarty-amd commented Mar 2, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/quantization_emulation_moe.py Outdated Show resolved Hide resolved

fxmarty-amd commented Mar 2, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/config.py Outdated Show resolved Hide resolved

fxmarty-amd commented Mar 2, 2026

View reviewed changes

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py Show resolved Hide resolved

fxmarty-amd commented Mar 2, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/utils.py Outdated Show resolved Hide resolved

fxmarty-amd added 5 commits March 2, 2026 06:10

wip use TritonExperts

95c6a4a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

5a2cf8c

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

0ea8f82

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

d99373e

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fix activation quantization

7a5f2ba

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd marked this pull request as ready for review March 2, 2026 16:35

fxmarty-amd requested review from DarkLight1337, WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners March 2, 2026 16:35

fxmarty-amd added 3 commits March 6, 2026 16:46

simplify test

35c88a8

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

remove outdated comment

d439e80

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'main' into upstream-nvfp4-simulation-support-rocm

2d9e65c

mgoin reviewed Mar 24, 2026

View reviewed changes

will-deines mentioned this pull request Mar 25, 2026

[MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy #38054

Draft

fxmarty-amd added 2 commits March 26, 2026 13:42

address Michael's comments

c6791f7

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

1fa136e

…vfp4-simulation-support-moe

mergify bot added the needs-rebase label Mar 30, 2026

fxmarty-amd added 3 commits April 1, 2026 10:30

Merge branch 'main' into upstream-nvfp4-simulation-support-rocm

56dd2bf

linting

ad93d2a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

0d788d8

…vfp4-simulation-support-moe

mergify bot removed the needs-rebase label Apr 1, 2026

fxmarty-amd and others added 5 commits April 1, 2026 10:38

Update vllm/model_executor/layers/quantization/compressed_tensors/sch…

e8a596f

…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>

Update vllm/model_executor/layers/quantization/compressed_tensors/sch…

c6adfe8

…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>

move unsupported reasons warning in is_backend_supported

e36296a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' of https://gith…

33f118f

…ub.com/fxmarty-amd/vllm into upstream-nvfp4-simulation-support-rocm

fix input

44aadca

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

kylesayrs reviewed Apr 1, 2026

View reviewed changes

fxmarty-amd added 2 commits April 2, 2026 16:28

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

3f36269

…vfp4-simulation-support-moe

addres Michael's comments

911b316

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd added 5 commits April 2, 2026 17:45

simulation -> emulation

90a54e3

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

linting

74b9212

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'main' into upstream-nvfp4-simulation-support-rocm

d930b84

pre-commit passes locally and should not take 50min

24ec4ce

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

58439aa

…vfp4-simulation-support-moe

	# For some FI kernels, the input scales are shared by all experts.
	if is_global_sf_supported_for_nvfp4_backend(backend):
	num_experts = w13.shape[0]
	a13_scale = a13_scale.max().to(torch.float32).expand(num_experts)
	a2_scale = a2_scale.max().to(torch.float32).expand(num_experts)
	else:
	a13_scale = a13_scale.max(dim=1).values.to(torch.float32)

Uh oh!

Conversation

fxmarty-amd commented Mar 2, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 30, 2026

Uh oh!

mergify bot commented Apr 1, 2026

Uh oh!

mergify bot commented Apr 1, 2026

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd commented Apr 2, 2026

Uh oh!

mergify bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fxmarty-amd commented Mar 2, 2026 •

edited by github-actions bot

Loading

fxmarty-amd Apr 2, 2026 •

edited

Loading