Skip to content

Add AWQ quantization support for NPU. #10158

Merged
merrymercy merged 40 commits intosgl-project:mainfrom
kvcache-ai:awq
Oct 23, 2025
Merged

Add AWQ quantization support for NPU. #10158
merrymercy merged 40 commits intosgl-project:mainfrom
kvcache-ai:awq

Conversation

@ErvinXie
Copy link
Copy Markdown
Contributor

@ErvinXie ErvinXie commented Sep 8, 2025

Motivation

This PR follows #9104 and Roadmap of NPU support #8004.

Modifications

We mainly modified python/sglang/srt/layers/quantization/awq.py, add AWQLinearAscendMethod and AWQMoEAscendMethod to support AWQ.

Accuracy and Benchmark Tests

python3 -m sglang.launch_server --model-path /data/models/DeepSeek-AWQ --tp 8 --device npu --attention-backend ascend --port 8001 --disable-radix-cache --quantization awq
python ./python/sglang/test/few_shot_gsm8k.py 
Accuracy: 0.975
Invalid: 0.000
Latency: 389.043 s
Output throughput: 50.036 token/s

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ErvinXie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request extends the SGLang framework to support AWQ quantization on NPU devices, specifically targeting Ascend hardware. It introduces new NPU-specific quantization methods for linear and MoE layers, along with necessary adjustments to the weight loading and dequantization processes to leverage NPU capabilities. The changes aim to improve inference efficiency and performance on NPU-enabled systems.

Highlights

  • NPU AWQ Quantization Support: Introduced dedicated AWQ (Activation-aware Weight Quantization) support for NPU (Neural Processing Unit) devices, enabling efficient quantized model inference on NPU hardware.
  • New Quantization Methods: Added AWQLinearAscendMethod and AWQMoEAscendMethod classes to handle linear and Mixture-of-Experts (MoE) layers specifically for Ascend NPU, including NPU-optimized weight processing and application logic.
  • NPU-Specific Dequantization: Implemented awq_dequantize_decomposition for NPU, providing a specialized method for dequantizing weights on NPU devices.
  • Memory Management for NPU: Integrated torch.npu.empty_cache() calls in the model loading and utility functions to optimize memory usage on NPU devices.
  • Benchmark Results: Provided benchmark results demonstrating the accuracy and throughput of AWQ quantization on NPU for a DeepSeek-AWQ model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for AWQ quantization on Ascend NPUs. The changes are mostly concentrated in python/sglang/srt/layers/quantization/awq.py, where new classes AWQLinearAscendMethod and AWQMoEAscendMethod are added to handle NPU-specific logic. While the overall approach is sound, I've identified a critical issue with an incorrect return type that would cause a runtime error, as well as some opportunities to improve code clarity and correctness in class initialization. My review includes suggestions to address these points.

Comment on lines +999 to +1025
def apply(
self,
layer: torch.nn.Module,
dispatch_output: StandardDispatchOutput,
) -> torch.Tensor:
assert (
self.moe_runner_config.activation == "silu"
), "Only SiLU activation is supported."

x = dispatch_output.hidden_states
topk_output = dispatch_output.topk_output

topk_weights, topk_ids, _ = topk_output
topk_ids = topk_ids.to(torch.int32)
topk_weights = topk_weights.to(x.dtype)
return npu_fused_experts(
hidden_states=x,
w13=layer.w13_qweight,
w13_scale=layer.w13_scales,
w13_offset=layer.w13_qzeros,
w2=layer.w2_qweight,
w2_scale=layer.w2_scales,
w2_offset=layer.w2_qzeros,
topk_weights=topk_weights,
topk_ids=topk_ids,
top_k=topk_ids.shape[1],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The apply method in AWQMoEAscendMethod is declared to return a torch.Tensor, but its parent class AWQMoEMethod and the base class FusedMoEMethodBase specify a return type of CombineInput. The caller, FusedMoE.forward, expects an object with a hidden_states attribute, not a raw tensor. Returning a tensor directly will lead to a runtime AttributeError.

The return value should be wrapped in a StandardCombineInput object to conform to the expected interface, and the return type hint should be corrected.

    def apply(
        self,
        layer: torch.nn.Module,
        dispatch_output: StandardDispatchOutput,
    ) -> "CombineInput":
        from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput

        assert (
            self.moe_runner_config.activation == "silu"
        ), "Only SiLU activation is supported."

        x = dispatch_output.hidden_states
        topk_output = dispatch_output.topk_output

        topk_weights, topk_ids, _ = topk_output
        topk_ids = topk_ids.to(torch.int32)
        topk_weights = topk_weights.to(x.dtype)
        output = npu_fused_experts(
            hidden_states=x,
            w13=layer.w13_qweight,
            w13_scale=layer.w13_scales,
            w13_offset=layer.w13_qzeros,
            w2=layer.w2_qweight,
            w2_scale=layer.w2_scales,
            w2_offset=layer.w2_qzeros,
            topk_weights=topk_weights,
            topk_ids=topk_ids,
            top_k=topk_ids.shape[1],
        )
        return StandardCombineInput(hidden_states=output)

Comment on lines +941 to +942
def __init__(self, quant_config: AWQConfig):
self.quant_config = quant_config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The __init__ method of AWQMoEAscendMethod does not initialize its parent class AWQMoEMethod. This can lead to an improperly initialized object, as attributes set in the parent's __init__ (like self.quant_type) will be missing. While AWQMoEAscendMethod is specific to Ascend and AWQMoEMethod is for Marlin, inheriting methods like create_weights implies a need for proper parent initialization.

Given that AWQMoEAscendMethod is instantiated with an AWQConfig and not an AWQMarlinConfig, a direct super().__init__() call would cause a type error. A better approach would be to replicate the necessary initialization logic from the parent.

Suggested change
def __init__(self, quant_config: AWQConfig):
self.quant_config = quant_config
def __init__(self, quant_config: AWQConfig):
self.quant_config = quant_config
if self.quant_config.weight_bits != 4:
raise ValueError(f"{type(self).__name__} only supports 4bit now.")
self.quant_type = scalar_types.uint4

Comment on lines +610 to +612
qweight_tmp.bitwise_or_(
((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i))
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bitwise operation used for repacking weights is functionally correct but unnecessarily complex and hard to read. Using (2 ** (4 * i)) for left-shifting and then masking can be simplified. A more direct and readable approach is to first mask the desired nibble with & 0xF and then shift it to its new position. This improves code clarity and maintainability.

Suggested change
qweight_tmp.bitwise_or_(
((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i))
)
qweight_tmp.bitwise_or_(
(((layer.qweight.data >> shift_num) & 0xF) << (4 * i))
)

Comment on lines +958 to +965
w13_qweight_tmp.bitwise_or_(
((layer.w13_qweight.data >> shift_num) * (2 ** (4 * i)))
& (0xF << (4 * i))
)
w2_qweight_tmp.bitwise_or_(
((layer.w2_qweight.data >> shift_num) * (2 ** (4 * i)))
& (0xF << (4 * i))
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to AWQLinearAscendMethod, the bitwise operation here for repacking weights is unnecessarily complex. Using * (2 ** (4 * i)) for left-shifting is less clear than using the left-shift operator << after masking the desired nibble. Simplifying this expression will improve code readability and maintainability.

            w13_qweight_tmp.bitwise_or_(
                (((layer.w13_qweight.data >> shift_num) & 0xF) << (4 * i))
            )
            w2_qweight_tmp.bitwise_or_(
                (((layer.w2_qweight.data >> shift_num) & 0xF) << (4 * i))
            )

@OrangeRedeng
Copy link
Copy Markdown
Contributor

Why was radix-cache disabled during the test?

@Alisehen
Copy link
Copy Markdown
Contributor

Why was radix-cache disabled during the test?

We use the same command as #9355
Use radix-cache is ok.

Comment thread python/sglang/srt/layers/quantization/awq.py Outdated
@ErvinXie ErvinXie requested a review from FlamingoPg as a code owner October 20, 2025 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants