[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859fxmarty-amd wants to merge 80 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…vfp4-simulation-support-moe
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…fp4-simulation-aot-weight-dequantization
There was a problem hiding this comment.
Code Review
This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.
vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
6e11ec3 to
affdda7
Compare
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…to upstream-nvfp4-simulated-quark
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…vfp4-simulation-support-moe
| x_fp4 = x_fp4.reshape(x_m, x_k // block_size, block_size) | ||
| x_blockscale = x_blockscale.unsqueeze(-1) / global_scale | ||
| x_dq = (x_fp4 * x_blockscale).reshape(x_m, x_k).to(output_dtype) | ||
| del x_fp4, x_blockscale |
There was a problem hiding this comment.
These variables are already deleted (and garbage collected) up upon function exit
| # Only delete w_dq if we created it (not a reference to weight) | ||
| if weight.dtype != x.dtype: | ||
| del w_dq | ||
| del x_dq |
There was a problem hiding this comment.
These variables are already deleted (and garbage collected) up upon function exit
| def convert_to_nvfp4_linear_kernel_format( | ||
| backend: NvFp4LinearBackend, | ||
| layer: torch.nn.Module, | ||
| emulation_dequantize_weights: bool | None = None, |
There was a problem hiding this comment.
| emulation_dequantize_weights: bool | None = None, | |
| emulation_dequantize_weights: bool = False, |
| # (operation not permitted when stream is capturing) | ||
| kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device) | ||
|
|
||
| if emulation_dequantize_weights: |
There was a problem hiding this comment.
It almost feels like this should be a separate scheme, like OnlineDequantizeNvFp4LinearMethod. This avoids extra branching in the nvfp4 emulation logic.
| group_size, | ||
| ) | ||
| # Check if weight is already dequantized (same dtype as x) | ||
| if weight.dtype == x.dtype: |
There was a problem hiding this comment.
This feels potentially brittle, but will work in this case. Again would be nice to avoid branching and instead branch at the scheme level
kylesayrs
left a comment
There was a problem hiding this comment.
I think as it stands, passing the emulation_dequantize_weights is creating a lot of branching and modifications on existing quantization schemes. I would strongly consider breaking this out into a separate scheme, similar to Fp8OnlineLinearMethod, otherwise a lot of function contracts/behavior get changed.
I agree that emulation_dequantize_weights=False should be a linear backend, no problem there.
| "Only the backend NvFp4LinearBackend.EMULATION is tested with" | ||
| f" QuarkNVFP4, got backend={self.backend}. Use at your own risk." | ||
| ) | ||
| self.swizzle: bool | None = None |
There was a problem hiding this comment.
Is there a purpose behind defaulting swizzle to None?
| if self.backend != NvFp4LinearBackend.EMULATION: | ||
| logger.warning_once( | ||
| "Only the backend NvFp4LinearBackend.EMULATION is tested with" | ||
| f" QuarkNVFP4, got backend={self.backend}. Use at your own risk." |
There was a problem hiding this comment.
Consider being more descriptive about what might happen.
| layer, | ||
| emulation_dequantize_weights=self.emulation_dequantize_weights, | ||
| ) | ||
| del layer.weight_scale_2 |
There was a problem hiding this comment.
If quark is going to support nvfp4 in the future, you'll need this parameter, right?
…vfp4-simulation-support-moe
…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>
…emes/compressed_tensors_w4a4_nvfp4.py Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: fxmarty-amd <felmarty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…ub.com/fxmarty-amd/vllm into upstream-nvfp4-simulation-support-rocm
…vfp4-simulation-support-moe
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…vfp4-simulation-support-moe
…fp4-simulated-quark
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
This PR depends on
modeloptandcompressed-tensorson AMD Instinct MI300, MI355X and Hopper through emulation #35733Please see the correct and much more minimalist diff at: fxmarty-amd/vllm@upstream-nvfp4-simulation-support-moe...upstream-nvfp4-simulated-quark
Purpose
https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases.
Todo:
Test Plan
See
test_nvfp4_wikitext_correctness(to be fixed) at https://github.com/fxmarty-amd/vllm/blob/0cc42070bcb087f7dad4e5bc9124bafbae29c7bd/tests/quantization/test_quark.py#L268