Add AWQ quantization support for NPU. #10158
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @ErvinXie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request extends the SGLang framework to support AWQ quantization on NPU devices, specifically targeting Ascend hardware. It introduces new NPU-specific quantization methods for linear and MoE layers, along with necessary adjustments to the weight loading and dequantization processes to leverage NPU capabilities. The changes aim to improve inference efficiency and performance on NPU-enabled systems.
Highlights
- NPU AWQ Quantization Support: Introduced dedicated AWQ (Activation-aware Weight Quantization) support for NPU (Neural Processing Unit) devices, enabling efficient quantized model inference on NPU hardware.
- New Quantization Methods: Added
AWQLinearAscendMethodandAWQMoEAscendMethodclasses to handle linear and Mixture-of-Experts (MoE) layers specifically for Ascend NPU, including NPU-optimized weight processing and application logic. - NPU-Specific Dequantization: Implemented
awq_dequantize_decompositionfor NPU, providing a specialized method for dequantizing weights on NPU devices. - Memory Management for NPU: Integrated
torch.npu.empty_cache()calls in the model loading and utility functions to optimize memory usage on NPU devices. - Benchmark Results: Provided benchmark results demonstrating the accuracy and throughput of AWQ quantization on NPU for a DeepSeek-AWQ model.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces support for AWQ quantization on Ascend NPUs. The changes are mostly concentrated in python/sglang/srt/layers/quantization/awq.py, where new classes AWQLinearAscendMethod and AWQMoEAscendMethod are added to handle NPU-specific logic. While the overall approach is sound, I've identified a critical issue with an incorrect return type that would cause a runtime error, as well as some opportunities to improve code clarity and correctness in class initialization. My review includes suggestions to address these points.
| def apply( | ||
| self, | ||
| layer: torch.nn.Module, | ||
| dispatch_output: StandardDispatchOutput, | ||
| ) -> torch.Tensor: | ||
| assert ( | ||
| self.moe_runner_config.activation == "silu" | ||
| ), "Only SiLU activation is supported." | ||
|
|
||
| x = dispatch_output.hidden_states | ||
| topk_output = dispatch_output.topk_output | ||
|
|
||
| topk_weights, topk_ids, _ = topk_output | ||
| topk_ids = topk_ids.to(torch.int32) | ||
| topk_weights = topk_weights.to(x.dtype) | ||
| return npu_fused_experts( | ||
| hidden_states=x, | ||
| w13=layer.w13_qweight, | ||
| w13_scale=layer.w13_scales, | ||
| w13_offset=layer.w13_qzeros, | ||
| w2=layer.w2_qweight, | ||
| w2_scale=layer.w2_scales, | ||
| w2_offset=layer.w2_qzeros, | ||
| topk_weights=topk_weights, | ||
| topk_ids=topk_ids, | ||
| top_k=topk_ids.shape[1], | ||
| ) |
There was a problem hiding this comment.
The apply method in AWQMoEAscendMethod is declared to return a torch.Tensor, but its parent class AWQMoEMethod and the base class FusedMoEMethodBase specify a return type of CombineInput. The caller, FusedMoE.forward, expects an object with a hidden_states attribute, not a raw tensor. Returning a tensor directly will lead to a runtime AttributeError.
The return value should be wrapped in a StandardCombineInput object to conform to the expected interface, and the return type hint should be corrected.
def apply(
self,
layer: torch.nn.Module,
dispatch_output: StandardDispatchOutput,
) -> "CombineInput":
from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput
assert (
self.moe_runner_config.activation == "silu"
), "Only SiLU activation is supported."
x = dispatch_output.hidden_states
topk_output = dispatch_output.topk_output
topk_weights, topk_ids, _ = topk_output
topk_ids = topk_ids.to(torch.int32)
topk_weights = topk_weights.to(x.dtype)
output = npu_fused_experts(
hidden_states=x,
w13=layer.w13_qweight,
w13_scale=layer.w13_scales,
w13_offset=layer.w13_qzeros,
w2=layer.w2_qweight,
w2_scale=layer.w2_scales,
w2_offset=layer.w2_qzeros,
topk_weights=topk_weights,
topk_ids=topk_ids,
top_k=topk_ids.shape[1],
)
return StandardCombineInput(hidden_states=output)| def __init__(self, quant_config: AWQConfig): | ||
| self.quant_config = quant_config |
There was a problem hiding this comment.
The __init__ method of AWQMoEAscendMethod does not initialize its parent class AWQMoEMethod. This can lead to an improperly initialized object, as attributes set in the parent's __init__ (like self.quant_type) will be missing. While AWQMoEAscendMethod is specific to Ascend and AWQMoEMethod is for Marlin, inheriting methods like create_weights implies a need for proper parent initialization.
Given that AWQMoEAscendMethod is instantiated with an AWQConfig and not an AWQMarlinConfig, a direct super().__init__() call would cause a type error. A better approach would be to replicate the necessary initialization logic from the parent.
| def __init__(self, quant_config: AWQConfig): | |
| self.quant_config = quant_config | |
| def __init__(self, quant_config: AWQConfig): | |
| self.quant_config = quant_config | |
| if self.quant_config.weight_bits != 4: | |
| raise ValueError(f"{type(self).__name__} only supports 4bit now.") | |
| self.quant_type = scalar_types.uint4 |
| qweight_tmp.bitwise_or_( | ||
| ((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i)) | ||
| ) |
There was a problem hiding this comment.
The bitwise operation used for repacking weights is functionally correct but unnecessarily complex and hard to read. Using (2 ** (4 * i)) for left-shifting and then masking can be simplified. A more direct and readable approach is to first mask the desired nibble with & 0xF and then shift it to its new position. This improves code clarity and maintainability.
| qweight_tmp.bitwise_or_( | |
| ((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i)) | |
| ) | |
| qweight_tmp.bitwise_or_( | |
| (((layer.qweight.data >> shift_num) & 0xF) << (4 * i)) | |
| ) |
| w13_qweight_tmp.bitwise_or_( | ||
| ((layer.w13_qweight.data >> shift_num) * (2 ** (4 * i))) | ||
| & (0xF << (4 * i)) | ||
| ) | ||
| w2_qweight_tmp.bitwise_or_( | ||
| ((layer.w2_qweight.data >> shift_num) * (2 ** (4 * i))) | ||
| & (0xF << (4 * i)) | ||
| ) |
There was a problem hiding this comment.
Similar to AWQLinearAscendMethod, the bitwise operation here for repacking weights is unnecessarily complex. Using * (2 ** (4 * i)) for left-shifting is less clear than using the left-shift operator << after masking the desired nibble. Simplifying this expression will improve code readability and maintainability.
w13_qweight_tmp.bitwise_or_(
(((layer.w13_qweight.data >> shift_num) & 0xF) << (4 * i))
)
w2_qweight_tmp.bitwise_or_(
(((layer.w2_qweight.data >> shift_num) & 0xF) << (4 * i))
)|
Why was radix-cache disabled during the test? |
We use the same command as #9355 |
Motivation
This PR follows #9104 and Roadmap of NPU support #8004.
Modifications
We mainly modified python/sglang/srt/layers/quantization/awq.py, add
AWQLinearAscendMethodandAWQMoEAscendMethodto support AWQ.Accuracy and Benchmark Tests
Checklist