Skip to content

[NPU] Support w4a8 with activation clip#14736

Merged
iforgetmyname merged 3 commits intosgl-project:mainfrom
zhuyijie88:w4a8_support_activation_use_clip
Dec 27, 2025
Merged

[NPU] Support w4a8 with activation clip#14736
iforgetmyname merged 3 commits intosgl-project:mainfrom
zhuyijie88:w4a8_support_activation_use_clip

Conversation

@jiaming1130
Copy link
Contributor

@jiaming1130 jiaming1130 commented Dec 9, 2025

Co-authored-by: @ZhengdQin

Motivation

This PR introduces an optimized W4A8 quantization implementation for MoE.

  • Weight Quantization (W4): Static Per-Channel Int4 quantization is applied to expert weights.

  • Activation Quantization (A8): Dynamic Per-Token Int8 quantization is applied to activations.

Compared to W8A8, this implementation achieves a significant reduction in the memory footprint of expert weights (approximately 2× less), while maintaining comparable model precision.

Compared to the existing W4A8 implementation in the repository, this version introduces a key enhancement: a learned clamp mechanism for determining quantization bounds. This method is specifically optimized for the DeepSeek-V3.2-Exp model and more effectively addresses the challenge of quantizing outlier values, resulting in improved quantization stability and accuracy.

For comprehensive technical details and weight conversion scripts, see this external resource.

Modifications

  1. Added an activation_use_clip variable read from the model configuration to differentiate between the original w4a8 quantization method and our newly adapted quantization scheme.

  2. Adapted the weight loading logic. Compared to the original quantization method, we introduced new model weight parameters: w3_bias, w12_bias, and w2_alpha, and streamlined the process scale computation within the process_weights_after_loading function.

Accuracy Tests

image

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jiaming1130, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the NPU backend's support for W4A8 quantization by integrating an activation clipping feature. It introduces a configurable flag that dictates how quantization scales and biases are defined and processed, providing a more adaptive approach to optimizing Mixture-of-Experts (MoE) models on NPU devices. The changes ensure that the system can dynamically adjust its quantization methodology based on the presence of activation clipping, leading to potentially improved performance or accuracy trade-offs.

Highlights

  • Activation Clipping Support: Introduced an activation_use_clip flag to enable or disable activation clipping for NPU W4A8 quantization, allowing for more flexible quantization strategies.
  • Conditional Quantization Parameters: Modified the NPUW4A8Int4DynamicMoEMethod to conditionally create and process different sets of quantization parameters (scales, biases) based on whether activation clipping is enabled.
  • Weight Loader Enhancement: Updated the weight loading mechanism to correctly handle and load bias parameters alongside scales and offsets, ensuring proper initialization for the new clipping-aware quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for w4a8 quantization with activation clipping for FusedMoE on NPU. The changes introduce a new parameter activation_use_clip to control this feature, which alters how weights and scales are created and processed. The implementation correctly passes this configuration from ModelSlimConfig down to the NPUW4A8Int4DynamicMoEMethod. Additionally, a good correctness fix was made in apply_without_routing_weights to use layer parameters instead of self.

My review has identified a critical bug in process_weights_after_loading where the order of weight transposition was changed, which could lead to incorrect behavior for the non-clipping path. I've also pointed out a maintainability issue due to code duplication in create_weights. Please see the detailed comments for suggestions.

Comment on lines +405 to +490
if self.activation_use_clip:
w13_bias = torch.nn.Parameter(
torch.ones(
num_experts, 2 * intermediate_size_per_partition, dtype=torch.float
),
requires_grad=False,
)
layer.register_parameter("w13_bias", w13_bias)
set_weight_attrs(w13_bias, extra_weight_attrs)

w2_scale_bias = torch.nn.Parameter(
torch.empty(
num_experts, hidden_size, 16 // self.tp_size, dtype=torch.float32
),
requires_grad=False,
)
layer.register_parameter("w2_scale_bias", w2_scale_bias)
set_weight_attrs(w2_scale_bias, extra_weight_attrs)
w2_bias = torch.nn.Parameter(
torch.ones(num_experts, hidden_size, dtype=torch.float),
requires_grad=False,
)
layer.register_parameter("w2_bias", w2_bias)
set_weight_attrs(w2_bias, extra_weight_attrs)
w2_alpha = torch.nn.Parameter(
torch.ones(num_experts, dtype=torch.float), requires_grad=False
)
layer.register_parameter("w2_alpha", w2_alpha)
set_weight_attrs(w2_alpha, extra_weight_attrs)
else:
w13_weight_scale_second = torch.nn.Parameter(
torch.empty(
num_experts,
2 * intermediate_size_per_partition,
hidden_size // self.group_size,
dtype=torch.float32,
),
requires_grad=False,
)
layer.register_parameter("w13_weight_scale_second", w13_weight_scale_second)
set_weight_attrs(w13_weight_scale_second, extra_weight_attrs)
w13_weight_offset_second = torch.nn.Parameter(
torch.empty(
num_experts,
2 * intermediate_size_per_partition,
hidden_size // self.group_size,
dtype=torch.float32,
),
requires_grad=False,
)
layer.register_parameter("w13_weight_offset_second", w13_weight_offset_second)
set_weight_attrs(w13_weight_offset_second, extra_weight_attrs)

w2_weight_scale_second = torch.nn.Parameter(
torch.empty(
num_experts,
hidden_size,
intermediate_size_per_partition // self.group_size,
dtype=torch.float32,
),
requires_grad=False,
)
layer.register_parameter("w2_weight_scale_second", w2_weight_scale_second)
set_weight_attrs(w2_weight_scale_second, extra_weight_attrs)

w2_weight_offset_second = torch.nn.Parameter(
torch.empty(
num_experts,
hidden_size,
intermediate_size_per_partition // self.group_size,
dtype=torch.float32,
),
requires_grad=False,
)
layer.register_parameter("w2_weight_offset_second", w2_weight_offset_second)
set_weight_attrs(w2_weight_offset_second, extra_weight_attrs)

w13_scale_bias = torch.nn.Parameter(
torch.empty(
num_experts, 2 * intermediate_size_per_partition, 1, dtype=torch.float32
),
requires_grad=False,
)
layer.register_parameter("w13_scale_bias", w13_scale_bias)
set_weight_attrs(w13_scale_bias, extra_weight_attrs)

w2_scale_bias = torch.nn.Parameter(
torch.empty(
num_experts, hidden_size, 16 // self.tp_size, dtype=torch.float32
),
requires_grad=False,
)
layer.register_parameter("w2_scale_bias", w2_scale_bias)
set_weight_attrs(w2_scale_bias, extra_weight_attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This if/else block for creating special parameters for w4a8 introduces significant code duplication. The else branch contains a large block of code that is nearly identical to the original implementation before this change. This can make the code harder to maintain and read.

Consider refactoring this section to reduce duplication. For example, you could define separate helper methods for creating parameters for each case (activation_use_clip true or false). This is a suggestion for future improvement to enhance code maintainability.

@ping1jing2 ping1jing2 self-assigned this Dec 9, 2025
@jiaming1130 jiaming1130 force-pushed the w4a8_support_activation_use_clip branch from d19a844 to fd221af Compare December 12, 2025 07:22

# Case weight scales and zero_points
if "scale" in weight_name or "zero" in weight_name or "offset" in weight_name:
if "scale" in weight_name or "zero" in weight_name or "offset" in weight_name or "bias" in weight_name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify this '"bias" in weight_name' condition

@jiaming1130 jiaming1130 force-pushed the w4a8_support_activation_use_clip branch 8 times, most recently from 4ccbb23 to 7edbe27 Compare December 22, 2025 09:10
@iforgetmyname
Copy link
Collaborator

/tag-and-rerun-ci

@jiaming1130 jiaming1130 force-pushed the w4a8_support_activation_use_clip branch from edaafd4 to d26b4b6 Compare December 25, 2025 13:04
@jiaming1130 jiaming1130 force-pushed the w4a8_support_activation_use_clip branch from d26b4b6 to 6f3f277 Compare December 25, 2025 13:26
@ZhengdQin ZhengdQin removed documentation Improvements or additions to documentation quant LLM Quantization amd dependencies Pull requests that update a dependency file lora Multi-modal multi-modal language model hicache Hierarchical Caching for SGLang sgl-kernel blackwell SM100/SM120 piecewise-cuda-graph diffusion SGLang Diffusion labels Dec 26, 2025
Co-authored-by: ZhengdQin<zhengdqin@gmail.com>
@TamirBaydasov
Copy link
Contributor

lgtm

@OrangeRedeng
Copy link
Contributor

Do you have instructions for obtaining models? As I understand it, this link https://gitcode.com/cann/cann-recipes-infer/pull/13 shows W4A8C8 with quantized KV cache, but this functionality is currently not supported in the slang.

@jiaming1130
Copy link
Contributor Author

Do you have instructions for obtaining models? As I understand it, this link https://gitcode.com/cann/cann-recipes-infer/pull/13 shows W4A8C8 with quantized KV cache, but this functionality is currently not supported in the slang.

You can use the weight converter script to convert the raw FP8 weights to dynamic W4A8 weights (compressed tensor). Refer to https://gitcode.com/cann/cann-recipes-infer/blob/master/models/deepseek-v3.2-exp/utils/weight_convert.sh and the script is:

bash utils/weight_convert.sh --input_fp8_hf_path /data/models/DeepSeek-V3.2-Exp-Fp8 --output_hf_path /data/models/DeepSeek-V3.2-Exp-W4A8C8 --quant_mode w4a8c8

The quantization of KV cache has not been included in this PR yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants