[Quantization] Refactor compressed-tensors quantization implement to reuse upstream implement. And add w4a16 support.#6644
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello @menogrey, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the quantization capabilities within the vllm-ascend project. It refactors the existing compressed-tensors quantization implementation to align with the upstream vllm project, promoting code reuse and simplifying maintenance. Additionally, it introduces support for W4A16 quantization, a technique that can lead to improved performance and reduced memory usage on Ascend NPU. The changes include modifications to fused MoE operations and layernorm, as well as the addition of new files for W4A16 scheme implementation and kernel definitions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Quantization][Feature] Refactor compressed-tensors to reuse upstream and add W4A16 supportSuggested PR Summary:
### What this PR does / why we need it?
This pull request refactors the `compressed-tensors` quantization implementation to better align with and reuse the upstream `vllm` implementation. By monkey-patching the upstream quantization methods, we can leverage the generic logic while providing Ascend-specific kernels and weight processing.
Key changes include:
- **Reusing Upstream Implementation**: Instead of replacing the `compressed-tensors` quantization method, this PR patches the upstream `CompressedTensorsWNA16MarlinMoEMethod` and mixed-precision kernel selection to inject Ascend-specific logic. This improves maintainability and compatibility with future upstream changes.
- **W4A16 Support**: Adds support for W4A16 quantization for both standard linear layers and FusedMoE layers on Ascend NPUs.
- `AscendwNa16LinearKernel` is introduced for linear layers, which handles weight transformation (unpack, transpose, repack for NPU) and uses `npu_weight_quant_batchmatmul` for computation.
- `AscendW4A16FusedMoEMethod` is added for MoE layers, which performs the necessary weight layout conversion from the Marlin format to the NPU-optimized format.
- **Robustness Improvements**: Adds a `hasattr` check in `RMSNorm` to prevent crashes when `quant_config` does not have a `quant_description` attribute.
### Does this PR introduce _any_ user-facing change?
Yes, this PR introduces support for `w4a16` quantization via the `compressed-tensors` method on Ascend hardware. Users can now leverage this quantization scheme for models that support it.
### How was this patch tested?
CI should pass. The changes were tested with models using W4A16 quantization to ensure correctness and performance.This PR refactors the compressed-tensors quantization to reuse upstream implementations and adds W4A16 support. The changes are mostly well-structured, but I've found a critical bug in the weight transformation logic that needs to be addressed. I've also pointed out some areas for improvement regarding code duplication and commented-out code.
| unpacked_w13_weight = ( | ||
| unpack_from_int32( | ||
| layer.w13_weight_packed.data.flatten(0, 1), | ||
| torch.Size([w13_shape[0] * w13_shape[1], w13_shape[2] * pack_factor]), | ||
| num_bits, | ||
| ) | ||
| .view(w13_shape[0], w13_shape[1], -1) | ||
| .transpose(1, 2) | ||
| .contiguous() | ||
| .int() | ||
| ) | ||
| unpacked_w2_weight = ( | ||
| unpack_from_int32( | ||
| layer.w2_weight_packed.data.flatten(0, 1), | ||
| torch.Size([w2_shape[0] * w2_shape[1], w2_shape[2] * pack_factor]), | ||
| num_bits, | ||
| ) | ||
| .view(w2_shape[0], w2_shape[1], -1) | ||
| .transpose(1, 2) | ||
| .contiguous() | ||
| .int() | ||
| ) |
There was a problem hiding this comment.
The .transpose(1, 2) calls for both unpacked_w13_weight (line 298) and unpacked_w2_weight (line 309) are incorrect. The weight transformation logic should produce tensors with shapes [e, 2*n, k] for w13 and [e, k, n] for w2 before packing for the NPU. The current implementation incorrectly transposes them to [e, k, 2*n] and [e, n, k] respectively, which will lead to incorrect model outputs. Please remove both .transpose(1, 2) calls.
unpacked_w13_weight = (
unpack_from_int32(
layer.w13_weight_packed.data.flatten(0, 1),
torch.Size([w13_shape[0] * w13_shape[1], w13_shape[2] * pack_factor]),
num_bits,
)
.view(w13_shape[0], w13_shape[1], -1)
.contiguous()
.int()
)
unpacked_w2_weight = (
unpack_from_int32(
layer.w2_weight_packed.data.flatten(0, 1),
torch.Size([w2_shape[0] * w2_shape[1], w2_shape[2] * pack_factor]),
num_bits,
)
.view(w2_shape[0], w2_shape[1], -1)
.contiguous()
.int()
)| mc2_mask=kwargs.get("mc2_mask"), | ||
| ) | ||
|
|
||
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: |
| def unpack_from_int32( | ||
| weight: torch.Tensor, | ||
| shape: torch.Size, | ||
| num_bits: int, | ||
| packed_dim: int = 1, | ||
| ) -> torch.Tensor: | ||
| """Unpacks quantized weights from int32 format back to original bits. | ||
|
|
||
| :param weight: The packed int32 tensor containing quantized weights | ||
| :param shape: Original shape to restore, defaults to None | ||
| :param num_bits: The number of bits used for quantization (<= 8) | ||
| :param packed_dim: Dimension along which weights are packed (0 or 1), defaults to 1 | ||
| :return: Unpacked tensor with int8 dtype after applying offset correction | ||
| """ | ||
| assert weight.dtype == torch.int32, f"Expecting `weight.dtype` is torch.int32 but got {weight.dtype}." | ||
| assert num_bits <= 8, f"Expecting `num_bits` should not be larger than 8 but got {num_bits}." | ||
|
|
||
| pack_factor = 32 // num_bits | ||
| mask = (1 << num_bits) - 1 | ||
|
|
||
| if packed_dim == 1: | ||
| unpacked_weight = torch.zeros( | ||
| (weight.shape[0], weight.shape[1] * pack_factor), | ||
| device=weight.device, | ||
| dtype=torch.int32, | ||
| ) | ||
| for i in range(pack_factor): | ||
| unpacked_weight[:, i::pack_factor] = (weight >> (num_bits * i)) & mask | ||
| original_row_size = int(shape[1]) | ||
| unpacked_weight = unpacked_weight[:, :original_row_size] | ||
| else: | ||
| unpacked_weight = torch.zeros( | ||
| (weight.shape[0] * pack_factor, weight.shape[1]), | ||
| device=weight.device, | ||
| dtype=torch.int32, | ||
| ) | ||
| for i in range(pack_factor): | ||
| unpacked_weight[i::pack_factor, :] = (weight >> (num_bits * i)) & mask | ||
| original_row_size = int(shape[0]) | ||
| unpacked_weight = unpacked_weight[:original_row_size, :] | ||
|
|
||
| offset = pow(2, num_bits) // 2 | ||
| unpacked_weight = (unpacked_weight - offset).to(torch.int8) | ||
|
|
||
| return unpacked_weight |
There was a problem hiding this comment.
The function unpack_from_int32 is a duplicate of the one defined in vllm_ascend/quantization/compressed_tensors/schemes/wNa16.py. To avoid code duplication and improve maintainability, this function should be moved to a shared utility module (e.g., in a new vllm_ascend/quantization/utils.py file) and imported in both places where it's used.
b0aa633 to
25c19ad
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
53a6b69 to
794b9ab
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the compressed-tensors quantization implementation to better align with and reuse upstream vLLM logic. This is a significant architectural improvement, removing the custom AscendCompressedTensorsConfig in favor of patching upstream methods and kernels for Ascend-specific backends. The changes also introduce support for W4A16 quantization. The refactoring is extensive but appears consistent and well-executed. I have identified one high-severity issue concerning device hardcoding which could impact portability and testing environments.
| scale = scale.transpose(1, 2).to(torch.float32).contiguous() | ||
| scale_np = scale.cpu().numpy() | ||
| scale_np.dtype = np.uint32 | ||
| scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).npu() |
There was a problem hiding this comment.
The use of .npu() hardcodes the device to NPU. It's better practice to use .to(scale.device) to respect the original device of the tensor. This improves code portability and robustness, especially for environments where testing might occur on different device types like CPU.
| scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).npu() | |
| scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).to(scale.device) |
…reuse upstream implement. And add w4a16 support. Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
17b1113 to
75e9cd1
Compare
There was a problem hiding this comment.
w4a8.py should align with vLLM's file naming convention, rename it to compressed_tensors_w4a8_int.py
There was a problem hiding this comment.
w8a8.py should align with vLLM's file naming convention, rename it to compressed_tensors_w8a8_int8.py
There was a problem hiding this comment.
wNa16.py should align with vLLM's file naming convention, rename it to compressed_tensors_wNa16.py
| # Transpose scales and offsets: [n, num_groups] -> [num_groups, n] | ||
| scale.data = scale.data.transpose(0, 1).contiguous() | ||
|
|
||
| def apply_weights( |
There was a problem hiding this comment.
Since apply_weights is moved to the common kernels module, modelslim quantized linear should also reuse it. The apply implementation under modelslim should be removed.
|
|
||
| return True, None | ||
|
|
||
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: |
There was a problem hiding this comment.
process_weights_after_loading and related functions are tied to on-disk weight formats and may not be reusable across different quantization tools. Keep the kernels module focused only on generic quantization forward implementations (activation quantization + quant matmul).
There was a problem hiding this comment.
The methods module only serves modelslim. Rename the folder to modelslim and move different quantization schemes under modelslim/schemes.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
please rebase and fix the merge conflict if this PR is still needed. |
What this PR does / why we need it?
This PR refactors vllm-ascend compressed-tensors quantization to reuse upstream vLLM implementations instead of maintaining a large custom config path. It removes the legacy AscendCompressedTensorsConfig flow, adds worker-time quantization patching, and registers Ascend NPU kernels for mixed-precision and scaled-mm paths. It also introduces Ascend-specific compressed-tensors scheme implementations for MoE W8A8, W4A8, and W4A16 (Marlin) and hooks them into upstream compressed-tensors methods via monkey patches.
Highlights
The implement refer to the other hardware plugin implement: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_compressed_tensors.py
refer to #6953
Does this PR introduce any user-facing change?
How was this patch tested?