[Quantization] Refactor compressed-tensors quantization implement to reuse upstream implement. And add w4a16 support. by menogrey · Pull Request #6644 · vllm-project/vllm-ascend

menogrey · 2026-02-09T14:14:19Z

What this PR does / why we need it?

This PR refactors vllm-ascend compressed-tensors quantization to reuse upstream vLLM implementations instead of maintaining a large custom config path. It removes the legacy AscendCompressedTensorsConfig flow, adds worker-time quantization patching, and registers Ascend NPU kernels for mixed-precision and scaled-mm paths. It also introduces Ascend-specific compressed-tensors scheme implementations for MoE W8A8, W4A8, and W4A16 (Marlin) and hooks them into upstream compressed-tensors methods via monkey patches.

Highlights

Quantization Refactor: Refactors the compressed-tensors quantization implementation to reuse the upstream implementation, aiming for better maintainability and consistency.
W4A16 Linear Support: Adds support for W4A16 quantization, enabling the use of 4-bit weights and 16-bit activations, potentially improving performance and reducing memory footprint on Ascend.
Ascend Linear Kernel: Extract linear kernel for quantization refer to the upstream implement, for further reuse.

The implement refer to the other hardware plugin implement: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_compressed_tensors.py

refer to #6953

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.15.0
vLLM main: vllm-project/vllm@d7e17aa

github-actions · 2026-02-09T14:14:37Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-02-09T14:15:45Z

Summary of Changes

Hello @menogrey, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the quantization capabilities within the vllm-ascend project. It refactors the existing compressed-tensors quantization implementation to align with the upstream vllm project, promoting code reuse and simplifying maintenance. Additionally, it introduces support for W4A16 quantization, a technique that can lead to improved performance and reduced memory usage on Ascend NPU. The changes include modifications to fused MoE operations and layernorm, as well as the addition of new files for W4A16 scheme implementation and kernel definitions.

Highlights

Quantization Refactor: Refactors the compressed-tensors quantization implementation to reuse the upstream implementation, aiming for better maintainability and consistency.
W4A16 Support: Adds support for W4A16 quantization, enabling the use of 4-bit weights and 16-bit activations, potentially improving performance and reducing memory footprint on Ascend.
Weight Processing: Introduces weight unpacking and packing functions optimized for Ascend NPU, ensuring compatibility with the hardware's expected data layout.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/ops/fused_moe/fused_moe.py
- Adds CompressedTensorsWNA16MarlinMoEMethod to the quant_method classname check.
vllm_ascend/ops/layernorm.py
- Adds a check to ensure that quant_description exists before accessing it.
vllm_ascend/patch/worker/init.py
- Imports the patch_quantization module.
vllm_ascend/patch/worker/patch_quantization.py
- Introduces a new module to patch quantization-related functionalities, including W4A16 support.
vllm_ascend/quantization/compressed_tensors/init.py
- Adds an empty init file.
vllm_ascend/quantization/compressed_tensors/schemes/wNa16.py
- Implements the W4A16 quantization scheme, including weight packing and unpacking functions.
vllm_ascend/quantization/compressed_tensors_config.py
- Removes original compressed tensors method and registration.
vllm_ascend/quantization/kernels/init.py
- Adds an empty init file.
vllm_ascend/quantization/kernels/mixed_precision/npu.py
- Adds a new module for NPU-specific mixed-precision kernels, including weight processing and application.

Activity

The PR introduces several new files related to quantization, particularly for W4A16 support.
Existing files like fused_moe.py and layernorm.py are modified to accommodate the new quantization methods.
The patch system is extended to include quantization-related patches.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Quantization][Feature] Refactor compressed-tensors to reuse upstream and add W4A16 support

Suggested PR Summary:

### What this PR does / why we need it?
This pull request refactors the `compressed-tensors` quantization implementation to better align with and reuse the upstream `vllm` implementation. By monkey-patching the upstream quantization methods, we can leverage the generic logic while providing Ascend-specific kernels and weight processing.

Key changes include:
- **Reusing Upstream Implementation**: Instead of replacing the `compressed-tensors` quantization method, this PR patches the upstream `CompressedTensorsWNA16MarlinMoEMethod` and mixed-precision kernel selection to inject Ascend-specific logic. This improves maintainability and compatibility with future upstream changes.
- **W4A16 Support**: Adds support for W4A16 quantization for both standard linear layers and FusedMoE layers on Ascend NPUs.
  - `AscendwNa16LinearKernel` is introduced for linear layers, which handles weight transformation (unpack, transpose, repack for NPU) and uses `npu_weight_quant_batchmatmul` for computation.
  - `AscendW4A16FusedMoEMethod` is added for MoE layers, which performs the necessary weight layout conversion from the Marlin format to the NPU-optimized format.
- **Robustness Improvements**: Adds a `hasattr` check in `RMSNorm` to prevent crashes when `quant_config` does not have a `quant_description` attribute.

### Does this PR introduce _any_ user-facing change?
Yes, this PR introduces support for `w4a16` quantization via the `compressed-tensors` method on Ascend hardware. Users can now leverage this quantization scheme for models that support it.

### How was this patch tested?
CI should pass. The changes were tested with models using W4A16 quantization to ensure correctness and performance.

This PR refactors the compressed-tensors quantization to reuse upstream implementations and adds W4A16 support. The changes are mostly well-structured, but I've found a critical bug in the weight transformation logic that needs to be addressed. I've also pointed out some areas for improvement regarding code duplication and commented-out code.

gemini-code-assist · 2026-02-09T14:20:12Z

+        unpacked_w13_weight = (
+            unpack_from_int32(
+                layer.w13_weight_packed.data.flatten(0, 1),
+                torch.Size([w13_shape[0] * w13_shape[1], w13_shape[2] * pack_factor]),
+                num_bits,
+            )
+            .view(w13_shape[0], w13_shape[1], -1)
+            .transpose(1, 2)
+            .contiguous()
+            .int()
+        )
+        unpacked_w2_weight = (
+            unpack_from_int32(
+                layer.w2_weight_packed.data.flatten(0, 1),
+                torch.Size([w2_shape[0] * w2_shape[1], w2_shape[2] * pack_factor]),
+                num_bits,
+            )
+            .view(w2_shape[0], w2_shape[1], -1)
+            .transpose(1, 2)
+            .contiguous()
+            .int()
+        )


The .transpose(1, 2) calls for both unpacked_w13_weight (line 298) and unpacked_w2_weight (line 309) are incorrect. The weight transformation logic should produce tensors with shapes [e, 2*n, k] for w13 and [e, k, n] for w2 before packing for the NPU. The current implementation incorrectly transposes them to [e, k, 2*n] and [e, n, k] respectively, which will lead to incorrect model outputs. Please remove both .transpose(1, 2) calls.

unpacked_w13_weight = ( unpack_from_int32( layer.w13_weight_packed.data.flatten(0, 1), torch.Size([w13_shape[0] * w13_shape[1], w13_shape[2] * pack_factor]), num_bits, ) .view(w13_shape[0], w13_shape[1], -1) .contiguous() .int() ) unpacked_w2_weight = ( unpack_from_int32( layer.w2_weight_packed.data.flatten(0, 1), torch.Size([w2_shape[0] * w2_shape[1], w2_shape[2] * pack_factor]), num_bits, ) .view(w2_shape[0], w2_shape[1], -1) .contiguous() .int() )

gemini-code-assist · 2026-02-09T14:20:12Z

+            mc2_mask=kwargs.get("mc2_mask"),
+        )
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:


The function process_weights_after_loading contains large blocks of commented-out code (e.g., lines 177-281 and 329-332). This reduces readability and maintainability. Please remove the unused, commented-out code blocks.

gemini-code-assist · 2026-02-09T14:20:12Z

+def unpack_from_int32(
+    weight: torch.Tensor,
+    shape: torch.Size,
+    num_bits: int,
+    packed_dim: int = 1,
+) -> torch.Tensor:
+    """Unpacks quantized weights from int32 format back to original bits.
+
+    :param weight: The packed int32 tensor containing quantized weights
+    :param shape: Original shape to restore, defaults to None
+    :param num_bits: The number of bits used for quantization (<= 8)
+    :param packed_dim: Dimension along which weights are packed (0 or 1), defaults to 1
+    :return: Unpacked tensor with int8 dtype after applying offset correction
+    """
+    assert weight.dtype == torch.int32, f"Expecting `weight.dtype` is torch.int32 but got {weight.dtype}."
+    assert num_bits <= 8, f"Expecting `num_bits` should not be larger than 8 but got {num_bits}."
+
+    pack_factor = 32 // num_bits
+    mask = (1 << num_bits) - 1
+
+    if packed_dim == 1:
+        unpacked_weight = torch.zeros(
+            (weight.shape[0], weight.shape[1] * pack_factor),
+            device=weight.device,
+            dtype=torch.int32,
+        )
+        for i in range(pack_factor):
+            unpacked_weight[:, i::pack_factor] = (weight >> (num_bits * i)) & mask
+        original_row_size = int(shape[1])
+        unpacked_weight = unpacked_weight[:, :original_row_size]
+    else:
+        unpacked_weight = torch.zeros(
+            (weight.shape[0] * pack_factor, weight.shape[1]),
+            device=weight.device,
+            dtype=torch.int32,
+        )
+        for i in range(pack_factor):
+            unpacked_weight[i::pack_factor, :] = (weight >> (num_bits * i)) & mask
+        original_row_size = int(shape[0])
+        unpacked_weight = unpacked_weight[:original_row_size, :]
+
+    offset = pow(2, num_bits) // 2
+    unpacked_weight = (unpacked_weight - offset).to(torch.int8)
+
+    return unpacked_weight


The function unpack_from_int32 is a duplicate of the one defined in vllm_ascend/quantization/compressed_tensors/schemes/wNa16.py. To avoid code duplication and improve maintainability, this function should be moved to a shared utility module (e.g., in a new vllm_ascend/quantization/utils.py file) and imported in both places where it's used.

github-actions · 2026-02-12T00:48:23Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

menogrey · 2026-02-28T08:24:46Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the compressed-tensors quantization implementation to better align with and reuse upstream vLLM logic. This is a significant architectural improvement, removing the custom AscendCompressedTensorsConfig in favor of patching upstream methods and kernels for Ascend-specific backends. The changes also introduce support for W4A16 quantization. The refactoring is extensive but appears consistent and well-executed. I have identified one high-severity issue concerning device hardcoding which could impact portability and testing environments.

gemini-code-assist · 2026-02-28T08:28:06Z

+    scale = scale.transpose(1, 2).to(torch.float32).contiguous()
+    scale_np = scale.cpu().numpy()
+    scale_np.dtype = np.uint32
+    scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).npu()


The use of .npu() hardcodes the device to NPU. It's better practice to use .to(scale.device) to respect the original device of the tensor. This improves code portability and robustness, especially for environments where testing might occur on different device types like CPU.

Suggested change

scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).npu()

scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).to(scale.device)

…reuse upstream implement. And add w4a16 support. Signed-off-by: menogrey <1299267905@qq.com>

Signed-off-by: menogrey <1299267905@qq.com>

SlightwindSec · 2026-03-03T09:06:35Z

w4a8.py should align with vLLM's file naming convention, rename it to compressed_tensors_w4a8_int.py

SlightwindSec · 2026-03-03T09:07:39Z

w8a8.py should align with vLLM's file naming convention, rename it to compressed_tensors_w8a8_int8.py

SlightwindSec · 2026-03-03T09:08:14Z

wNa16.py should align with vLLM's file naming convention, rename it to compressed_tensors_wNa16.py

SlightwindSec · 2026-03-03T09:11:34Z

+        # Transpose scales and offsets: [n, num_groups] -> [num_groups, n]
+        scale.data = scale.data.transpose(0, 1).contiguous()
+
+    def apply_weights(


Since apply_weights is moved to the common kernels module, modelslim quantized linear should also reuse it. The apply implementation under modelslim should be removed.

SlightwindSec · 2026-03-03T09:16:43Z

+
+        return True, None
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:


process_weights_after_loading and related functions are tied to on-disk weight formats and may not be reusable across different quantization tools. Keep the kernels module focused only on generic quantization forward implementations (activation quantization + quant matmul).

SlightwindSec · 2026-03-03T09:20:13Z

The methods module only serves modelslim. Rename the folder to modelslim and move different quantization schemes under modelslim/schemes.

github-actions · 2026-03-04T03:32:01Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan · 2026-04-10T07:06:16Z

please rebase and fix the merge conflict if this PR is still needed.

menogrey requested review from realliujiaxu, wangxiyuan, whx-sjtu and zzzzwwjj as code owners February 9, 2026 14:14

github-actions Bot added module:ops module:quantization labels Feb 9, 2026

gemini-code-assist Bot reviewed Feb 9, 2026

View reviewed changes

menogrey force-pushed the refactor_compressed_tensor branch 2 times, most recently from b0aa633 to 25c19ad Compare February 10, 2026 02:44

github-actions Bot added the merge-conflicts label Feb 12, 2026

menogrey force-pushed the refactor_compressed_tensor branch from 53a6b69 to 794b9ab Compare February 27, 2026 09:56

github-actions Bot removed the merge-conflicts label Feb 27, 2026

menogrey added ready read for review ready-for-test start test by label for PR labels Feb 27, 2026

gemini-code-assist Bot reviewed Feb 28, 2026

View reviewed changes

menogrey requested a review from Yikun as a code owner February 28, 2026 10:16

Potabk added the model-download label Mar 2, 2026

menogrey added 8 commits March 3, 2026 03:16

[Quantization] Refactor compressed-tensors quantization implement to …

93624e6

…reuse upstream implement. And add w4a16 support. Signed-off-by: menogrey <1299267905@qq.com>

Refactor compressed-tensors w8a8 and w4a8.

a7203ae

Signed-off-by: menogrey <1299267905@qq.com>

Clean up messy code

9e27e46

Signed-off-by: menogrey <1299267905@qq.com>

Clean up legacy code after refactoring.

3b09da1

Signed-off-by: menogrey <1299267905@qq.com>

Update and resolve conflicts.

64b91cd

Signed-off-by: menogrey <1299267905@qq.com>

Fix vllm==0.16.0 error.

ccd38f1

Signed-off-by: menogrey <1299267905@qq.com>

Fix mypy error.

9b93d71

Signed-off-by: menogrey <1299267905@qq.com>

Fix UT error and add e2e w4a16 testcase.

75e9cd1

Signed-off-by: menogrey <1299267905@qq.com>

menogrey force-pushed the refactor_compressed_tensor branch from 17b1113 to 75e9cd1 Compare March 3, 2026 03:16

SlightwindSec reviewed Mar 3, 2026

View reviewed changes

github-actions Bot added the merge-conflicts label Mar 4, 2026

menogrey mentioned this pull request Mar 26, 2026

[Quantization][Feature] Add AWQ quantization in vllm-ascend. #4316

Open

ZhongsJie mentioned this pull request Mar 26, 2026

[Quantization][Feature] Add AWQ quantization for Ascend #7672

Open

vllm-ascend-ci removed the model-download label Apr 21, 2026

	scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).npu()
	scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).to(scale.device)


		return True, None

		def process_weights_after_loading(self, layer: torch.nn.Module) -> None:

Conversation

menogrey commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Feb 9, 2026

Uh oh!

gemini-code-assist Bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 12, 2026

Uh oh!

menogrey commented Feb 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 4, 2026

Uh oh!

wangxiyuan commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

menogrey commented Feb 9, 2026 •

edited

Loading