[Quark] Support loading Quark NVFP4 checkpoints in vLLM by fxmarty-amd · Pull Request #35859 · vllm-project/vllm

fxmarty-amd · 2026-03-03T11:52:04Z

This PR depends on

[NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct MI300, MI355X and Hopper through emulation #35733
[NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation #35737.

Please see the correct and much more minimalist diff at: fxmarty-amd/vllm@upstream-nvfp4-simulation-support-moe...upstream-nvfp4-simulated-quark

Purpose

https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases.

Todo:

Port the parallel layer scale recomputation logic [won't do - raising an error in case q/k/v projections, gate/up projections weight global scales are not equal].

Test Plan

See test_nvfp4_wikitext_correctness (to be fixed) at https://github.com/fxmarty-amd/vllm/blob/0cc42070bcb087f7dad4e5bc9124bafbae29c7bd/tests/quantization/test_quark.py#L268

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…fp4-simulation-aot-weight-dequantization

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

gemini-code-assist

Code Review

This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.

vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…to upstream-nvfp4-simulated-quark

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify · 2026-03-12T17:29:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

kylesayrs · 2026-03-30T22:59:11Z

vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py

+    x_fp4 = x_fp4.reshape(x_m, x_k // block_size, block_size)
+    x_blockscale = x_blockscale.unsqueeze(-1) / global_scale
+    x_dq = (x_fp4 * x_blockscale).reshape(x_m, x_k).to(output_dtype)
+    del x_fp4, x_blockscale


These variables are already deleted (and garbage collected) up upon function exit

kylesayrs · 2026-03-30T23:01:57Z

vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py

+    # Only delete w_dq if we created it (not a reference to weight)
+    if weight.dtype != x.dtype:
+        del w_dq
+    del x_dq


These variables are already deleted (and garbage collected) up upon function exit

kylesayrs · 2026-03-30T23:08:17Z

vllm/model_executor/layers/quantization/utils/nvfp4_utils.py

 def convert_to_nvfp4_linear_kernel_format(
    backend: NvFp4LinearBackend,
    layer: torch.nn.Module,
+    emulation_dequantize_weights: bool | None = None,


Suggested change

emulation_dequantize_weights: bool | None = None,

emulation_dequantize_weights: bool = False,

kylesayrs · 2026-03-30T23:14:08Z

vllm/model_executor/layers/quantization/utils/nvfp4_utils.py

+        # (operation not permitted when stream is capturing)
+        kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device)
+
+        if emulation_dequantize_weights:


It almost feels like this should be a separate scheme, like OnlineDequantizeNvFp4LinearMethod. This avoids extra branching in the nvfp4 emulation logic.

kylesayrs · 2026-03-30T23:16:48Z

vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py

-        group_size,
-    )
+    # Check if weight is already dequantized (same dtype as x)
+    if weight.dtype == x.dtype:


This feels potentially brittle, but will work in this case. Again would be nice to avoid branching and instead branch at the scheme level

kylesayrs

I think as it stands, passing the emulation_dequantize_weights is creating a lot of branching and modifications on existing quantization schemes. I would strongly consider breaking this out into a separate scheme, similar to Fp8OnlineLinearMethod, otherwise a lot of function contracts/behavior get changed.

I agree that emulation_dequantize_weights=False should be a linear backend, no problem there.

kylesayrs · 2026-03-30T23:21:50Z

vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py

+                "Only the backend NvFp4LinearBackend.EMULATION is tested with"
+                f" QuarkNVFP4, got backend={self.backend}. Use at your own risk."
+            )
+            self.swizzle: bool | None = None


Is there a purpose behind defaulting swizzle to None?

kylesayrs · 2026-03-30T23:22:20Z

vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py

+        if self.backend != NvFp4LinearBackend.EMULATION:
+            logger.warning_once(
+                "Only the backend NvFp4LinearBackend.EMULATION is tested with"
+                f" QuarkNVFP4, got backend={self.backend}. Use at your own risk."


Consider being more descriptive about what might happen.

kylesayrs · 2026-03-30T23:24:22Z

vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py

+            layer,
+            emulation_dequantize_weights=self.emulation_dequantize_weights,
+        )
+        del layer.weight_scale_2


If quark is going to support nvfp4 in the future, you'll need this parameter, right?

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…ub.com/fxmarty-amd/vllm into upstream-nvfp4-simulation-support-rocm

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

…fp4-simulated-quark

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd added 17 commits March 2, 2026 12:16

fix issues with nvfp4 dense emulation in vllm (squash)

b313689

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comments

bc6ff39

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

nvfp4 moe emulation support

14bc668

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

a11d131

…vfp4-simulation-support-moe

wip use TritonExperts

95c6a4a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

5a2cf8c

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

0ea8f82

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

d99373e

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fix activation quantization

7a5f2ba

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comment

457f9df

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

aot weight dequantization

86d6316

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

use emulation_dequantize_weights for quark OCP MX as well

2cb040b

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

tiny fix

7a67180

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

enable test on non-blackwell devices

01b4dce

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…

aef916d

…fp4-simulation-aot-weight-dequantization

add test

c4aff81

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

add test

4710a00

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify bot added the rocm Related to AMD ROCm label Mar 3, 2026

github-project-automation bot added this to AMD Mar 3, 2026

github-project-automation bot moved this to Todo in AMD Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py Outdated Show resolved Hide resolved

support quark dense and moe nvfp4

affdda7

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd force-pushed the upstream-nvfp4-simulated-quark branch from 6e11ec3 to affdda7 Compare March 3, 2026 12:03

fxmarty-amd added 2 commits March 3, 2026 08:13

wip cleanup

da111bd

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

bug fixes and add test

0cc4207

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd marked this pull request as ready for review March 3, 2026 15:42

fxmarty-amd requested review from mgoin, robertgshaw2-redhat, tjtanaa and yewentao256 as code owners March 3, 2026 15:42

Merge branch 'upstream-nvfp4-simulation-aot-weight-dequantization' in…

58b90f1

…to upstream-nvfp4-simulated-quark

mergify bot removed the needs-rebase label Mar 6, 2026

revert min change

a5da270

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify bot added the needs-rebase label Mar 12, 2026

fxmarty-amd added 3 commits March 24, 2026 10:54

Merge branch 'main' into upstream-nvfp4-simulation-support-rocm

2d9e65c

address Michael's comments

c6791f7

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

1fa136e

…vfp4-simulation-support-moe

kylesayrs reviewed Mar 30, 2026

View reviewed changes

fxmarty-amd and others added 8 commits April 1, 2026 10:30

Merge branch 'main' into upstream-nvfp4-simulation-support-rocm

56dd2bf

linting

ad93d2a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

0d788d8

…vfp4-simulation-support-moe

Update vllm/model_executor/layers/quantization/compressed_tensors/sch…

e8a596f

…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>

Update vllm/model_executor/layers/quantization/compressed_tensors/sch…

c6adfe8

…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>

move unsupported reasons warning in is_backend_supported

e36296a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' of https://gith…

33f118f

…ub.com/fxmarty-amd/vllm into upstream-nvfp4-simulation-support-rocm

fix input

44aadca

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

kylesayrs mentioned this pull request Apr 1, 2026

[NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation #35737

Open

fxmarty-amd added 11 commits April 2, 2026 16:28

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

3f36269

…vfp4-simulation-support-moe

addres Michael's comments

911b316

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

simulation -> emulation

90a54e3

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

linting

74b9212

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'main' into upstream-nvfp4-simulation-support-rocm

d930b84

pre-commit passes locally and should not take 50min

24ec4ce

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

58439aa

…vfp4-simulation-support-moe

Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…

34fba54

…fp4-simulated-quark

remove unnecessary changes

70b2d5d

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fix

0b6b325

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fix

0b2de40

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
fxmarty-amd wants to merge 80 commits intovllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark

fxmarty-amd commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

kylesayrs left a comment

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

kylesayrs Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	emulation_dequantize_weights: bool \| None = None,
	emulation_dequantize_weights: bool = False,

Uh oh!

Conversation

fxmarty-amd commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fxmarty-amd commented Mar 3, 2026 •

edited

Loading