[NPU] Support w4a8 with activation clip by jiaming1130 · Pull Request #14736 · sgl-project/sglang

jiaming1130 · 2025-12-09T12:57:03Z

Co-authored-by: @ZhengdQin

Motivation

This PR introduces an optimized W4A8 quantization implementation for MoE.

Weight Quantization (W4): Static Per-Channel Int4 quantization is applied to expert weights.
Activation Quantization (A8): Dynamic Per-Token Int8 quantization is applied to activations.

Compared to W8A8, this implementation achieves a significant reduction in the memory footprint of expert weights (approximately 2× less), while maintaining comparable model precision.

Compared to the existing W4A8 implementation in the repository, this version introduces a key enhancement: a learned clamp mechanism for determining quantization bounds. This method is specifically optimized for the DeepSeek-V3.2-Exp model and more effectively addresses the challenge of quantizing outlier values, resulting in improved quantization stability and accuracy.

For comprehensive technical details and weight conversion scripts, see this external resource.

Modifications

Added an activation_use_clip variable read from the model configuration to differentiate between the original w4a8 quantization method and our newly adapted quantization scheme.
Adapted the weight loading logic. Compared to the original quantization method, we introduced new model weight parameters: w3_bias, w12_bias, and w2_alpha, and streamlined the process scale computation within the process_weights_after_loading function.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-09T12:57:24Z

Summary of Changes

Hello @jiaming1130, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the NPU backend's support for W4A8 quantization by integrating an activation clipping feature. It introduces a configurable flag that dictates how quantization scales and biases are defined and processed, providing a more adaptive approach to optimizing Mixture-of-Experts (MoE) models on NPU devices. The changes ensure that the system can dynamically adjust its quantization methodology based on the presence of activation clipping, leading to potentially improved performance or accuracy trade-offs.

Highlights

Activation Clipping Support: Introduced an activation_use_clip flag to enable or disable activation clipping for NPU W4A8 quantization, allowing for more flexible quantization strategies.
Conditional Quantization Parameters: Modified the NPUW4A8Int4DynamicMoEMethod to conditionally create and process different sets of quantization parameters (scales, biases) based on whether activation clipping is enabled.
Weight Loader Enhancement: Updated the weight loading mechanism to correctly handle and load bias parameters alongside scales and offsets, ensuring proper initialization for the new clipping-aware quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for w4a8 quantization with activation clipping for FusedMoE on NPU. The changes introduce a new parameter activation_use_clip to control this feature, which alters how weights and scales are created and processed. The implementation correctly passes this configuration from ModelSlimConfig down to the NPUW4A8Int4DynamicMoEMethod. Additionally, a good correctness fix was made in apply_without_routing_weights to use layer parameters instead of self.

My review has identified a critical bug in process_weights_after_loading where the order of weight transposition was changed, which could lead to incorrect behavior for the non-clipping path. I've also pointed out a maintainability issue due to code duplication in create_weights. Please see the detailed comments for suggestions.

gemini-code-assist · 2025-12-09T13:00:22Z

python/sglang/srt/hardware_backend/npu/quantization/fused_moe_method_npu.py

+        if self.activation_use_clip:
+            w13_bias = torch.nn.Parameter(
+                torch.ones(
+                    num_experts, 2 * intermediate_size_per_partition, dtype=torch.float
+                ),
+                requires_grad=False,
+            )
+            layer.register_parameter("w13_bias", w13_bias)
+            set_weight_attrs(w13_bias, extra_weight_attrs)

-        w2_scale_bias = torch.nn.Parameter(
-            torch.empty(
-                num_experts, hidden_size, 16 // self.tp_size, dtype=torch.float32
-            ),
-            requires_grad=False,
-        )
-        layer.register_parameter("w2_scale_bias", w2_scale_bias)
-        set_weight_attrs(w2_scale_bias, extra_weight_attrs)
+            w2_bias = torch.nn.Parameter(
+                torch.ones(num_experts, hidden_size, dtype=torch.float),
+                requires_grad=False,
+            )
+            layer.register_parameter("w2_bias", w2_bias)
+            set_weight_attrs(w2_bias, extra_weight_attrs)
+            w2_alpha = torch.nn.Parameter(
+                torch.ones(num_experts, dtype=torch.float), requires_grad=False
+            )
+            layer.register_parameter("w2_alpha", w2_alpha)
+            set_weight_attrs(w2_alpha, extra_weight_attrs)
+        else:
+            w13_weight_scale_second = torch.nn.Parameter(
+                torch.empty(
+                    num_experts,
+                    2 * intermediate_size_per_partition,
+                    hidden_size // self.group_size,
+                    dtype=torch.float32,
+                ),
+                requires_grad=False,
+            )
+            layer.register_parameter("w13_weight_scale_second", w13_weight_scale_second)
+            set_weight_attrs(w13_weight_scale_second, extra_weight_attrs)
+            w13_weight_offset_second = torch.nn.Parameter(
+                torch.empty(
+                    num_experts,
+                    2 * intermediate_size_per_partition,
+                    hidden_size // self.group_size,
+                    dtype=torch.float32,
+                ),
+                requires_grad=False,
+            )
+            layer.register_parameter("w13_weight_offset_second", w13_weight_offset_second)
+            set_weight_attrs(w13_weight_offset_second, extra_weight_attrs)
+
+            w2_weight_scale_second = torch.nn.Parameter(
+                torch.empty(
+                    num_experts,
+                    hidden_size,
+                    intermediate_size_per_partition // self.group_size,
+                    dtype=torch.float32,
+                ),
+                requires_grad=False,
+            )
+            layer.register_parameter("w2_weight_scale_second", w2_weight_scale_second)
+            set_weight_attrs(w2_weight_scale_second, extra_weight_attrs)
+
+            w2_weight_offset_second = torch.nn.Parameter(
+                torch.empty(
+                    num_experts,
+                    hidden_size,
+                    intermediate_size_per_partition // self.group_size,
+                    dtype=torch.float32,
+                ),
+                requires_grad=False,
+            )
+            layer.register_parameter("w2_weight_offset_second", w2_weight_offset_second)
+            set_weight_attrs(w2_weight_offset_second, extra_weight_attrs)
+
+            w13_scale_bias = torch.nn.Parameter(
+                torch.empty(
+                    num_experts, 2 * intermediate_size_per_partition, 1, dtype=torch.float32
+                ),
+                requires_grad=False,
+            )
+            layer.register_parameter("w13_scale_bias", w13_scale_bias)
+            set_weight_attrs(w13_scale_bias, extra_weight_attrs)
+
+            w2_scale_bias = torch.nn.Parameter(
+                torch.empty(
+                    num_experts, hidden_size, 16 // self.tp_size, dtype=torch.float32
+                ),
+                requires_grad=False,
+            )
+            layer.register_parameter("w2_scale_bias", w2_scale_bias)
+            set_weight_attrs(w2_scale_bias, extra_weight_attrs)


This if/else block for creating special parameters for w4a8 introduces significant code duplication. The else branch contains a large block of code that is nearly identical to the original implementation before this change. This can make the code harder to maintain and read.

Consider refactoring this section to reduce duplication. For example, you could define separate helper methods for creating parameters for each case (activation_use_clip true or false). This is a suggestion for future improvement to enhance code maintainability.

ZhengdQin · 2025-12-15T03:39:31Z

python/sglang/srt/layers/moe/fused_moe_triton/layer.py


        # Case weight scales and zero_points
-        if "scale" in weight_name or "zero" in weight_name or "offset" in weight_name:
+        if "scale" in weight_name or "zero" in weight_name or "offset" in weight_name or "bias" in weight_name:


modify this '"bias" in weight_name' condition

iforgetmyname · 2025-12-25T00:51:48Z

/tag-and-rerun-ci

Co-authored-by: ZhengdQin<zhengdqin@gmail.com>

TamirBaydasov · 2025-12-26T10:09:38Z

lgtm

OrangeRedeng · 2025-12-26T11:31:31Z

Do you have instructions for obtaining models? As I understand it, this link https://gitcode.com/cann/cann-recipes-infer/pull/13 shows W4A8C8 with quantized KV cache, but this functionality is currently not supported in the slang.

jiaming1130 · 2025-12-27T02:50:58Z

Do you have instructions for obtaining models? As I understand it, this link https://gitcode.com/cann/cann-recipes-infer/pull/13 shows W4A8C8 with quantized KV cache, but this functionality is currently not supported in the slang.

You can use the weight converter script to convert the raw FP8 weights to dynamic W4A8 weights (compressed tensor). Refer to https://gitcode.com/cann/cann-recipes-infer/blob/master/models/deepseek-v3.2-exp/utils/weight_convert.sh and the script is:

bash utils/weight_convert.sh --input_fp8_hf_path /data/models/DeepSeek-V3.2-Exp-Fp8 --output_hf_path /data/models/DeepSeek-V3.2-Exp-W4A8C8 --quant_mode w4a8c8

The quantization of KV cache has not been included in this PR yet.

jiaming1130 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, iforgetmyname, ispobock, merrymercy and ping1jing2 as code owners December 9, 2025 12:57

github-actions bot added the npu label Dec 9, 2025

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

ping1jing2 self-assigned this Dec 9, 2025

jiaming1130 force-pushed the w4a8_support_activation_use_clip branch from d19a844 to fd221af Compare December 12, 2025 07:22

ZhengdQin reviewed Dec 15, 2025

View reviewed changes

jiaming1130 force-pushed the w4a8_support_activation_use_clip branch 8 times, most recently from 4ccbb23 to 7edbe27 Compare December 22, 2025 09:10

iforgetmyname approved these changes Dec 25, 2025

View reviewed changes

github-actions bot added the run-ci label Dec 25, 2025

jiaming1130 force-pushed the w4a8_support_activation_use_clip branch from edaafd4 to d26b4b6 Compare December 25, 2025 13:04

jiaming1130 requested review from CatherineSue and slin1237 as code owners December 25, 2025 13:04

github-actions bot added Multi-modal multi-modal language model deepseek speculative-decoding hicache Hierarchical Caching for SGLang sgl-kernel blackwell SM100/SM120 piecewise-cuda-graph diffusion SGLang Diffusion model-gateway labels Dec 25, 2025

jiaming1130 force-pushed the w4a8_support_activation_use_clip branch from d26b4b6 to 6f3f277 Compare December 25, 2025 13:26

[NPU] Support w4a8 with activation clipping

fd10022

Co-authored-by: ZhengdQin<zhengdqin@gmail.com>

ping1jing2 approved these changes Dec 26, 2025

View reviewed changes

Merge branch 'main' into w4a8_support_activation_use_clip

0b852bc

OrangeRedeng mentioned this pull request Dec 26, 2025

[Roadmap] Ascend NPU quantization refactoring & more quantization formats support #14424

Open

34 tasks

Merge branch 'main' into w4a8_support_activation_use_clip

78c07e1

OrangeRedeng mentioned this pull request Dec 29, 2025

[NPU] NPU quantization refactoring & more quantization formats support #14504

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] Support w4a8 with activation clip#14736

[NPU] Support w4a8 with activation clip#14736
iforgetmyname merged 3 commits intosgl-project:mainfrom
zhuyijie88:w4a8_support_activation_use_clip

jiaming1130 commented Dec 9, 2025 •

edited by ZhengdQin

Loading

Uh oh!

gemini-code-assist bot commented Dec 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

ZhengdQin Dec 15, 2025

Uh oh!

iforgetmyname commented Dec 25, 2025

Uh oh!

TamirBaydasov commented Dec 26, 2025

Uh oh!

OrangeRedeng commented Dec 26, 2025

Uh oh!

jiaming1130 commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jiaming1130 commented Dec 9, 2025 • edited by ZhengdQin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ZhengdQin Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

iforgetmyname commented Dec 25, 2025

Uh oh!

TamirBaydasov commented Dec 26, 2025

Uh oh!

OrangeRedeng commented Dec 26, 2025

Uh oh!

jiaming1130 commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jiaming1130 commented Dec 9, 2025 •

edited by ZhengdQin

Loading