Add AWQ quantization support for NPU. by ErvinXie · Pull Request #10158 · sgl-project/sglang

ErvinXie · 2025-09-08T09:01:23Z

Motivation

This PR follows #9104 and Roadmap of NPU support #8004.

Modifications

We mainly modified python/sglang/srt/layers/quantization/awq.py, add AWQLinearAscendMethod and AWQMoEAscendMethod to support AWQ.

Accuracy and Benchmark Tests

python3 -m sglang.launch_server --model-path /data/models/DeepSeek-AWQ --tp 8 --device npu --attention-backend ascend --port 8001 --disable-radix-cache --quantization awq
python ./python/sglang/test/few_shot_gsm8k.py 
Accuracy: 0.975
Invalid: 0.000
Latency: 389.043 s
Output throughput: 50.036 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @ErvinXie, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request extends the SGLang framework to support AWQ quantization on NPU devices, specifically targeting Ascend hardware. It introduces new NPU-specific quantization methods for linear and MoE layers, along with necessary adjustments to the weight loading and dequantization processes to leverage NPU capabilities. The changes aim to improve inference efficiency and performance on NPU-enabled systems.

Highlights

NPU AWQ Quantization Support: Introduced dedicated AWQ (Activation-aware Weight Quantization) support for NPU (Neural Processing Unit) devices, enabling efficient quantized model inference on NPU hardware.
New Quantization Methods: Added AWQLinearAscendMethod and AWQMoEAscendMethod classes to handle linear and Mixture-of-Experts (MoE) layers specifically for Ascend NPU, including NPU-optimized weight processing and application logic.
NPU-Specific Dequantization: Implemented awq_dequantize_decomposition for NPU, providing a specialized method for dequantizing weights on NPU devices.
Memory Management for NPU: Integrated torch.npu.empty_cache() calls in the model loading and utility functions to optimize memory usage on NPU devices.
Benchmark Results: Provided benchmark results demonstrating the accuracy and throughput of AWQ quantization on NPU for a DeepSeek-AWQ model.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for AWQ quantization on Ascend NPUs. The changes are mostly concentrated in python/sglang/srt/layers/quantization/awq.py, where new classes AWQLinearAscendMethod and AWQMoEAscendMethod are added to handle NPU-specific logic. While the overall approach is sound, I've identified a critical issue with an incorrect return type that would cause a runtime error, as well as some opportunities to improve code clarity and correctness in class initialization. My review includes suggestions to address these points.

gemini-code-assist · 2025-09-08T09:04:20Z

+    def apply(
+        self,
+        layer: torch.nn.Module,
+        dispatch_output: StandardDispatchOutput,
+    ) -> torch.Tensor:
+        assert (
+            self.moe_runner_config.activation == "silu"
+        ), "Only SiLU activation is supported."
+
+        x = dispatch_output.hidden_states
+        topk_output = dispatch_output.topk_output
+
+        topk_weights, topk_ids, _ = topk_output
+        topk_ids = topk_ids.to(torch.int32)
+        topk_weights = topk_weights.to(x.dtype)
+        return npu_fused_experts(
+            hidden_states=x,
+            w13=layer.w13_qweight,
+            w13_scale=layer.w13_scales,
+            w13_offset=layer.w13_qzeros,
+            w2=layer.w2_qweight,
+            w2_scale=layer.w2_scales,
+            w2_offset=layer.w2_qzeros,
+            topk_weights=topk_weights,
+            topk_ids=topk_ids,
+            top_k=topk_ids.shape[1],
+        )


The apply method in AWQMoEAscendMethod is declared to return a torch.Tensor, but its parent class AWQMoEMethod and the base class FusedMoEMethodBase specify a return type of CombineInput. The caller, FusedMoE.forward, expects an object with a hidden_states attribute, not a raw tensor. Returning a tensor directly will lead to a runtime AttributeError.

The return value should be wrapped in a StandardCombineInput object to conform to the expected interface, and the return type hint should be corrected.

def apply( self, layer: torch.nn.Module, dispatch_output: StandardDispatchOutput, ) -> "CombineInput": from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput assert ( self.moe_runner_config.activation == "silu" ), "Only SiLU activation is supported." x = dispatch_output.hidden_states topk_output = dispatch_output.topk_output topk_weights, topk_ids, _ = topk_output topk_ids = topk_ids.to(torch.int32) topk_weights = topk_weights.to(x.dtype) output = npu_fused_experts( hidden_states=x, w13=layer.w13_qweight, w13_scale=layer.w13_scales, w13_offset=layer.w13_qzeros, w2=layer.w2_qweight, w2_scale=layer.w2_scales, w2_offset=layer.w2_qzeros, topk_weights=topk_weights, topk_ids=topk_ids, top_k=topk_ids.shape[1], ) return StandardCombineInput(hidden_states=output)

gemini-code-assist · 2025-09-08T09:04:20Z

+    def __init__(self, quant_config: AWQConfig):
+        self.quant_config = quant_config


The __init__ method of AWQMoEAscendMethod does not initialize its parent class AWQMoEMethod. This can lead to an improperly initialized object, as attributes set in the parent's __init__ (like self.quant_type) will be missing. While AWQMoEAscendMethod is specific to Ascend and AWQMoEMethod is for Marlin, inheriting methods like create_weights implies a need for proper parent initialization.

Given that AWQMoEAscendMethod is instantiated with an AWQConfig and not an AWQMarlinConfig, a direct super().__init__() call would cause a type error. A better approach would be to replicate the necessary initialization logic from the parent.

Suggested change

def __init__(self, quant_config: AWQConfig):

self.quant_config = quant_config

def __init__(self, quant_config: AWQConfig):

self.quant_config = quant_config

if self.quant_config.weight_bits != 4:

raise ValueError(f"{type(self).__name__} only supports 4bit now.")

self.quant_type = scalar_types.uint4

gemini-code-assist · 2025-09-08T09:04:21Z

+            qweight_tmp.bitwise_or_(
+                ((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i))
+            )


The bitwise operation used for repacking weights is functionally correct but unnecessarily complex and hard to read. Using (2 ** (4 * i)) for left-shifting and then masking can be simplified. A more direct and readable approach is to first mask the desired nibble with & 0xF and then shift it to its new position. This improves code clarity and maintainability.

Suggested change

qweight_tmp.bitwise_or_(

((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i))

)

qweight_tmp.bitwise_or_(

(((layer.qweight.data >> shift_num) & 0xF) << (4 * i))

)

gemini-code-assist · 2025-09-08T09:04:21Z

+            w13_qweight_tmp.bitwise_or_(
+                ((layer.w13_qweight.data >> shift_num) * (2 ** (4 * i)))
+                & (0xF << (4 * i))
+            )
+            w2_qweight_tmp.bitwise_or_(
+                ((layer.w2_qweight.data >> shift_num) * (2 ** (4 * i)))
+                & (0xF << (4 * i))
+            )


Similar to AWQLinearAscendMethod, the bitwise operation here for repacking weights is unnecessarily complex. Using * (2 ** (4 * i)) for left-shifting is less clear than using the left-shift operator << after masking the desired nibble. Simplifying this expression will improve code readability and maintainability.

w13_qweight_tmp.bitwise_or_( (((layer.w13_qweight.data >> shift_num) & 0xF) << (4 * i)) ) w2_qweight_tmp.bitwise_or_( (((layer.w2_qweight.data >> shift_num) & 0xF) << (4 * i)) )

OrangeRedeng · 2025-09-08T13:10:18Z

Why was radix-cache disabled during the test?

Alisehen · 2025-09-10T06:36:21Z

Why was radix-cache disabled during the test?

We use the same command as #9355
Use radix-cache is ok.

Alisehen and others added 2 commits September 8, 2025 07:57

add awq quantization to ascend backend

2ca9cb7

format

c74b35b

ErvinXie requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners September 8, 2025 09:01

gemini-code-assist Bot reviewed Sep 8, 2025

View reviewed changes

Alisehen added 2 commits September 8, 2025 10:06

code fix

b009a54

Merge commit 'HEAD@{1}' into awq

b865bce

Alisehen and others added 3 commits September 10, 2025 15:18

Merge branch 'main' into awq

6857687

format

4515d25

Merge branch 'main' into awq

e8cb970

Alcanderian added the npu label Sep 16, 2025

Alcanderian reviewed Sep 16, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/quantization/awq.py Outdated

ErvinXie added 6 commits September 17, 2025 07:53

refact npu_fused_experts

281d084

Merge branch 'awq' of https://github.com/kvcache-ai/sglang_awq into awq

a2c4e02

minor

7c30f09

minor

0744852

minor

a32ed84

minor

a3d9077

iforgetmyname approved these changes Sep 22, 2025

View reviewed changes

ErvinXie requested a review from Alcanderian September 24, 2025 02:28

Alisehen added 2 commits September 29, 2025 12:21

Merge branch 'main' into awq

e9530b2

Merge branch 'main' into awq

1e38fe5

ping1jing2 approved these changes Oct 9, 2025

View reviewed changes

alanhe151220037 force-pushed the awq branch from ef82a5b to 1e38fe5 Compare October 9, 2025 05:40

ZhengdQin and others added 14 commits October 11, 2025 15:44

merge main

90ae315

Merge branch 'main' into awq

21ce362

Merge branch 'sgl-project:main' into awq

709f011

Merge branch 'main' into awq

b289972

ci fix

af799de

Merge branch 'awq' of github.com:kvcache-ai/sglang_awq into awq

2c00cf3

chore: apply pre-commit autofix (trailing whitespace)

31c6ec7

Merge branch 'main' into awq

6d62983

Merge branch 'main' into awq

599d004

Merge branch 'main' into awq

385e194

format

7570fd2

format

168faac

Merge branch 'main' into awq

71a8ec2

Merge branch 'main' into awq

8fc5424

ErvinXie requested a review from FlamingoPg as a code owner October 20, 2025 03:38

ErvinXie and others added 5 commits October 21, 2025 10:32

Merge branch 'main' into awq

de5e2d2

Merge remote-tracking branch 'upstream/main' into awq

0b24a48

format fix

0bdb55b

Merge branch 'main' into awq

630f76d

Merge branch 'main' into awq

ed55e68

merrymercy merged commit 39c237f into sgl-project:main Oct 23, 2025
69 of 71 checks passed

menogrey mentioned this pull request Nov 26, 2025

[Quantization][Feature] Add AWQ quantization in vllm-ascend. vllm-project/vllm-ascend#4316

Open

22dimensions mentioned this pull request Dec 2, 2025

sglang roadmap 22dimensions/docs#2

Open

ping1jing2 self-assigned this Dec 5, 2025

OrangeRedeng mentioned this pull request Dec 17, 2025

[Roadmap] NPU quantization 2026 Q1 Roadmap #14424

Open

34 tasks

Comtive-Wdson mentioned this pull request Dec 18, 2025

[Bug] [Ascend] [AWQ] AWQ quantization RuntimeError with aclnnAddRmsNorm operator on ascend backend #15391

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWQ quantization support for NPU. #10158

Add AWQ quantization support for NPU. #10158
merrymercy merged 40 commits intosgl-project:mainfrom
kvcache-ai:awq

ErvinXie commented Sep 8, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Uh oh!

OrangeRedeng commented Sep 8, 2025

Uh oh!

Alisehen commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

		def __init__(self, quant_config: AWQConfig):
		self.quant_config = quant_config

Conversation

ErvinXie commented Sep 8, 2025

Motivation

Modifications

Accuracy and Benchmark Tests

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

OrangeRedeng commented Sep 8, 2025

Uh oh!

Alisehen commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants