Skip to content

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859

Open
fxmarty-amd wants to merge 80 commits intovllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark
Open

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
fxmarty-amd wants to merge 80 commits intovllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark

Conversation

@fxmarty-amd
Copy link
Copy Markdown
Contributor

@fxmarty-amd fxmarty-amd commented Mar 3, 2026

This PR depends on

  1. [NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct MI300, MI355X and Hopper through emulation #35733
  2. [NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation #35737.

Please see the correct and much more minimalist diff at: fxmarty-amd/vllm@upstream-nvfp4-simulation-support-moe...upstream-nvfp4-simulated-quark

Purpose

https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases.

Todo:

  • Port the parallel layer scale recomputation logic [won't do - raising an error in case q/k/v projections, gate/up projections weight global scales are not equal].

Test Plan

See test_nvfp4_wikitext_correctness (to be fixed) at https://github.com/fxmarty-amd/vllm/blob/0cc42070bcb087f7dad4e5bc9124bafbae29c7bd/tests/quantization/test_quark.py#L268

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Mar 3, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd fxmarty-amd force-pushed the upstream-nvfp4-simulated-quark branch from 6e11ec3 to affdda7 Compare March 3, 2026 12:03
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd fxmarty-amd marked this pull request as ready for review March 3, 2026 15:42
@mergify mergify bot removed the needs-rebase label Mar 6, 2026
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 12, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 12, 2026
x_fp4 = x_fp4.reshape(x_m, x_k // block_size, block_size)
x_blockscale = x_blockscale.unsqueeze(-1) / global_scale
x_dq = (x_fp4 * x_blockscale).reshape(x_m, x_k).to(output_dtype)
del x_fp4, x_blockscale
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables are already deleted (and garbage collected) up upon function exit

Comment on lines +204 to +207
# Only delete w_dq if we created it (not a reference to weight)
if weight.dtype != x.dtype:
del w_dq
del x_dq
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables are already deleted (and garbage collected) up upon function exit

def convert_to_nvfp4_linear_kernel_format(
backend: NvFp4LinearBackend,
layer: torch.nn.Module,
emulation_dequantize_weights: bool | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
emulation_dequantize_weights: bool | None = None,
emulation_dequantize_weights: bool = False,

# (operation not permitted when stream is capturing)
kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device)

if emulation_dequantize_weights:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost feels like this should be a separate scheme, like OnlineDequantizeNvFp4LinearMethod. This avoids extra branching in the nvfp4 emulation logic.

group_size,
)
# Check if weight is already dequantized (same dtype as x)
if weight.dtype == x.dtype:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels potentially brittle, but will work in this case. Again would be nice to avoid branching and instead branch at the scheme level

Copy link
Copy Markdown
Contributor

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as it stands, passing the emulation_dequantize_weights is creating a lot of branching and modifications on existing quantization schemes. I would strongly consider breaking this out into a separate scheme, similar to Fp8OnlineLinearMethod, otherwise a lot of function contracts/behavior get changed.

I agree that emulation_dequantize_weights=False should be a linear backend, no problem there.

"Only the backend NvFp4LinearBackend.EMULATION is tested with"
f" QuarkNVFP4, got backend={self.backend}. Use at your own risk."
)
self.swizzle: bool | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a purpose behind defaulting swizzle to None?

if self.backend != NvFp4LinearBackend.EMULATION:
logger.warning_once(
"Only the backend NvFp4LinearBackend.EMULATION is tested with"
f" QuarkNVFP4, got backend={self.backend}. Use at your own risk."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider being more descriptive about what might happen.

layer,
emulation_dequantize_weights=self.emulation_dequantize_weights,
)
del layer.weight_scale_2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If quark is going to support nvfp4 in the future, you'll need this parameter, right?

fxmarty-amd and others added 8 commits April 1, 2026 10:30
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…emes/compressed_tensors_w4a4_nvfp4.py

Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
…emes/compressed_tensors_w4a4_nvfp4.py

Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: fxmarty-amd <felmarty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants