Skip to content

[Quantization] Refactor compressed-tensors quantization implement to reuse upstream implement. And add w4a16 support.#6644

Open
menogrey wants to merge 8 commits intovllm-project:mainfrom
menogrey:refactor_compressed_tensor
Open

[Quantization] Refactor compressed-tensors quantization implement to reuse upstream implement. And add w4a16 support.#6644
menogrey wants to merge 8 commits intovllm-project:mainfrom
menogrey:refactor_compressed_tensor

Conversation

@menogrey
Copy link
Copy Markdown
Collaborator

@menogrey menogrey commented Feb 9, 2026

What this PR does / why we need it?

This PR refactors vllm-ascend compressed-tensors quantization to reuse upstream vLLM implementations instead of maintaining a large custom config path. It removes the legacy AscendCompressedTensorsConfig flow, adds worker-time quantization patching, and registers Ascend NPU kernels for mixed-precision and scaled-mm paths. It also introduces Ascend-specific compressed-tensors scheme implementations for MoE W8A8, W4A8, and W4A16 (Marlin) and hooks them into upstream compressed-tensors methods via monkey patches.

Highlights

  • Quantization Refactor: Refactors the compressed-tensors quantization implementation to reuse the upstream implementation, aiming for better maintainability and consistency.
  • W4A16 Linear Support: Adds support for W4A16 quantization, enabling the use of 4-bit weights and 16-bit activations, potentially improving performance and reducing memory footprint on Ascend.
  • Ascend Linear Kernel: Extract linear kernel for quantization refer to the upstream implement, for further reuse.

The implement refer to the other hardware plugin implement: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_compressed_tensors.py

refer to #6953

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 9, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @menogrey, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the quantization capabilities within the vllm-ascend project. It refactors the existing compressed-tensors quantization implementation to align with the upstream vllm project, promoting code reuse and simplifying maintenance. Additionally, it introduces support for W4A16 quantization, a technique that can lead to improved performance and reduced memory usage on Ascend NPU. The changes include modifications to fused MoE operations and layernorm, as well as the addition of new files for W4A16 scheme implementation and kernel definitions.

Highlights

  • Quantization Refactor: Refactors the compressed-tensors quantization implementation to reuse the upstream implementation, aiming for better maintainability and consistency.
  • W4A16 Support: Adds support for W4A16 quantization, enabling the use of 4-bit weights and 16-bit activations, potentially improving performance and reducing memory footprint on Ascend.
  • Weight Processing: Introduces weight unpacking and packing functions optimized for Ascend NPU, ensuring compatibility with the hardware's expected data layout.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/ops/fused_moe/fused_moe.py
    • Adds CompressedTensorsWNA16MarlinMoEMethod to the quant_method classname check.
  • vllm_ascend/ops/layernorm.py
    • Adds a check to ensure that quant_description exists before accessing it.
  • vllm_ascend/patch/worker/init.py
    • Imports the patch_quantization module.
  • vllm_ascend/patch/worker/patch_quantization.py
    • Introduces a new module to patch quantization-related functionalities, including W4A16 support.
  • vllm_ascend/quantization/compressed_tensors/init.py
    • Adds an empty init file.
  • vllm_ascend/quantization/compressed_tensors/schemes/wNa16.py
    • Implements the W4A16 quantization scheme, including weight packing and unpacking functions.
  • vllm_ascend/quantization/compressed_tensors_config.py
    • Removes original compressed tensors method and registration.
  • vllm_ascend/quantization/kernels/init.py
    • Adds an empty init file.
  • vllm_ascend/quantization/kernels/mixed_precision/npu.py
    • Adds a new module for NPU-specific mixed-precision kernels, including weight processing and application.
Activity
  • The PR introduces several new files related to quantization, particularly for W4A16 support.
  • Existing files like fused_moe.py and layernorm.py are modified to accommodate the new quantization methods.
  • The patch system is extended to include quantization-related patches.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Quantization][Feature] Refactor compressed-tensors to reuse upstream and add W4A16 support

Suggested PR Summary:

### What this PR does / why we need it?
This pull request refactors the `compressed-tensors` quantization implementation to better align with and reuse the upstream `vllm` implementation. By monkey-patching the upstream quantization methods, we can leverage the generic logic while providing Ascend-specific kernels and weight processing.

Key changes include:
- **Reusing Upstream Implementation**: Instead of replacing the `compressed-tensors` quantization method, this PR patches the upstream `CompressedTensorsWNA16MarlinMoEMethod` and mixed-precision kernel selection to inject Ascend-specific logic. This improves maintainability and compatibility with future upstream changes.
- **W4A16 Support**: Adds support for W4A16 quantization for both standard linear layers and FusedMoE layers on Ascend NPUs.
  - `AscendwNa16LinearKernel` is introduced for linear layers, which handles weight transformation (unpack, transpose, repack for NPU) and uses `npu_weight_quant_batchmatmul` for computation.
  - `AscendW4A16FusedMoEMethod` is added for MoE layers, which performs the necessary weight layout conversion from the Marlin format to the NPU-optimized format.
- **Robustness Improvements**: Adds a `hasattr` check in `RMSNorm` to prevent crashes when `quant_config` does not have a `quant_description` attribute.

### Does this PR introduce _any_ user-facing change?
Yes, this PR introduces support for `w4a16` quantization via the `compressed-tensors` method on Ascend hardware. Users can now leverage this quantization scheme for models that support it.

### How was this patch tested?
CI should pass. The changes were tested with models using W4A16 quantization to ensure correctness and performance.

This PR refactors the compressed-tensors quantization to reuse upstream implementations and adds W4A16 support. The changes are mostly well-structured, but I've found a critical bug in the weight transformation logic that needs to be addressed. I've also pointed out some areas for improvement regarding code duplication and commented-out code.

Comment on lines +291 to +312
unpacked_w13_weight = (
unpack_from_int32(
layer.w13_weight_packed.data.flatten(0, 1),
torch.Size([w13_shape[0] * w13_shape[1], w13_shape[2] * pack_factor]),
num_bits,
)
.view(w13_shape[0], w13_shape[1], -1)
.transpose(1, 2)
.contiguous()
.int()
)
unpacked_w2_weight = (
unpack_from_int32(
layer.w2_weight_packed.data.flatten(0, 1),
torch.Size([w2_shape[0] * w2_shape[1], w2_shape[2] * pack_factor]),
num_bits,
)
.view(w2_shape[0], w2_shape[1], -1)
.transpose(1, 2)
.contiguous()
.int()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The .transpose(1, 2) calls for both unpacked_w13_weight (line 298) and unpacked_w2_weight (line 309) are incorrect. The weight transformation logic should produce tensors with shapes [e, 2*n, k] for w13 and [e, k, n] for w2 before packing for the NPU. The current implementation incorrectly transposes them to [e, k, 2*n] and [e, n, k] respectively, which will lead to incorrect model outputs. Please remove both .transpose(1, 2) calls.

        unpacked_w13_weight = (
            unpack_from_int32(
                layer.w13_weight_packed.data.flatten(0, 1),
                torch.Size([w13_shape[0] * w13_shape[1], w13_shape[2] * pack_factor]),
                num_bits,
            )
            .view(w13_shape[0], w13_shape[1], -1)
            .contiguous()
            .int()
        )
        unpacked_w2_weight = (
            unpack_from_int32(
                layer.w2_weight_packed.data.flatten(0, 1),
                torch.Size([w2_shape[0] * w2_shape[1], w2_shape[2] * pack_factor]),
                num_bits,
            )
            .view(w2_shape[0], w2_shape[1], -1)
            .contiguous()
            .int()
        )

mc2_mask=kwargs.get("mc2_mask"),
)

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function process_weights_after_loading contains large blocks of commented-out code (e.g., lines 177-281 and 329-332). This reduces readability and maintainability. Please remove the unused, commented-out code blocks.

Comment on lines +10 to +54
def unpack_from_int32(
weight: torch.Tensor,
shape: torch.Size,
num_bits: int,
packed_dim: int = 1,
) -> torch.Tensor:
"""Unpacks quantized weights from int32 format back to original bits.

:param weight: The packed int32 tensor containing quantized weights
:param shape: Original shape to restore, defaults to None
:param num_bits: The number of bits used for quantization (<= 8)
:param packed_dim: Dimension along which weights are packed (0 or 1), defaults to 1
:return: Unpacked tensor with int8 dtype after applying offset correction
"""
assert weight.dtype == torch.int32, f"Expecting `weight.dtype` is torch.int32 but got {weight.dtype}."
assert num_bits <= 8, f"Expecting `num_bits` should not be larger than 8 but got {num_bits}."

pack_factor = 32 // num_bits
mask = (1 << num_bits) - 1

if packed_dim == 1:
unpacked_weight = torch.zeros(
(weight.shape[0], weight.shape[1] * pack_factor),
device=weight.device,
dtype=torch.int32,
)
for i in range(pack_factor):
unpacked_weight[:, i::pack_factor] = (weight >> (num_bits * i)) & mask
original_row_size = int(shape[1])
unpacked_weight = unpacked_weight[:, :original_row_size]
else:
unpacked_weight = torch.zeros(
(weight.shape[0] * pack_factor, weight.shape[1]),
device=weight.device,
dtype=torch.int32,
)
for i in range(pack_factor):
unpacked_weight[i::pack_factor, :] = (weight >> (num_bits * i)) & mask
original_row_size = int(shape[0])
unpacked_weight = unpacked_weight[:original_row_size, :]

offset = pow(2, num_bits) // 2
unpacked_weight = (unpacked_weight - offset).to(torch.int8)

return unpacked_weight
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function unpack_from_int32 is a duplicate of the one defined in vllm_ascend/quantization/compressed_tensors/schemes/wNa16.py. To avoid code duplication and improve maintainability, this function should be moved to a shared utility module (e.g., in a new vllm_ascend/quantization/utils.py file) and imported in both places where it's used.

@menogrey menogrey force-pushed the refactor_compressed_tensor branch 2 times, most recently from b0aa633 to 25c19ad Compare February 10, 2026 02:44
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@menogrey menogrey force-pushed the refactor_compressed_tensor branch from 53a6b69 to 794b9ab Compare February 27, 2026 09:56
@menogrey menogrey added ready read for review ready-for-test start test by label for PR labels Feb 27, 2026
@menogrey
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the compressed-tensors quantization implementation to better align with and reuse upstream vLLM logic. This is a significant architectural improvement, removing the custom AscendCompressedTensorsConfig in favor of patching upstream methods and kernels for Ascend-specific backends. The changes also introduce support for W4A16 quantization. The refactoring is extensive but appears consistent and well-executed. I have identified one high-severity issue concerning device hardcoding which could impact portability and testing environments.

scale = scale.transpose(1, 2).to(torch.float32).contiguous()
scale_np = scale.cpu().numpy()
scale_np.dtype = np.uint32
scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).npu()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of .npu() hardcodes the device to NPU. It's better practice to use .to(scale.device) to respect the original device of the tensor. This improves code portability and robustness, especially for environments where testing might occur on different device types like CPU.

Suggested change
scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).npu()
scale_uint64_tensor = torch.from_numpy(scale_np.astype(np.int64)).to(scale.device)

menogrey added 8 commits March 3, 2026 03:16
…reuse

upstream implement. And add w4a16 support.

Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
@menogrey menogrey force-pushed the refactor_compressed_tensor branch from 17b1113 to 75e9cd1 Compare March 3, 2026 03:16
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w4a8.py should align with vLLM's file naming convention, rename it to compressed_tensors_w4a8_int.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w8a8.py should align with vLLM's file naming convention, rename it to compressed_tensors_w8a8_int8.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wNa16.py should align with vLLM's file naming convention, rename it to compressed_tensors_wNa16.py

# Transpose scales and offsets: [n, num_groups] -> [num_groups, n]
scale.data = scale.data.transpose(0, 1).contiguous()

def apply_weights(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since apply_weights is moved to the common kernels module, modelslim quantized linear should also reuse it. The apply implementation under modelslim should be removed.


return True, None

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_weights_after_loading and related functions are tied to on-disk weight formats and may not be reusable across different quantization tools. Keep the kernels module focused only on generic quantization forward implementations (activation quantization + quant matmul).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods module only serves modelslim. Rename the folder to modelslim and move different quantization schemes under modelslim/schemes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangxiyuan
Copy link
Copy Markdown
Collaborator

please rebase and fix the merge conflict if this PR is still needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants