[NPU] Support w4a8 with activation clip#14736
[NPU] Support w4a8 with activation clip#14736iforgetmyname merged 3 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @jiaming1130, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the NPU backend's support for W4A8 quantization by integrating an activation clipping feature. It introduces a configurable flag that dictates how quantization scales and biases are defined and processed, providing a more adaptive approach to optimizing Mixture-of-Experts (MoE) models on NPU devices. The changes ensure that the system can dynamically adjust its quantization methodology based on the presence of activation clipping, leading to potentially improved performance or accuracy trade-offs. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for w4a8 quantization with activation clipping for FusedMoE on NPU. The changes introduce a new parameter activation_use_clip to control this feature, which alters how weights and scales are created and processed. The implementation correctly passes this configuration from ModelSlimConfig down to the NPUW4A8Int4DynamicMoEMethod. Additionally, a good correctness fix was made in apply_without_routing_weights to use layer parameters instead of self.
My review has identified a critical bug in process_weights_after_loading where the order of weight transposition was changed, which could lead to incorrect behavior for the non-clipping path. I've also pointed out a maintainability issue due to code duplication in create_weights. Please see the detailed comments for suggestions.
| if self.activation_use_clip: | ||
| w13_bias = torch.nn.Parameter( | ||
| torch.ones( | ||
| num_experts, 2 * intermediate_size_per_partition, dtype=torch.float | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w13_bias", w13_bias) | ||
| set_weight_attrs(w13_bias, extra_weight_attrs) | ||
|
|
||
| w2_scale_bias = torch.nn.Parameter( | ||
| torch.empty( | ||
| num_experts, hidden_size, 16 // self.tp_size, dtype=torch.float32 | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w2_scale_bias", w2_scale_bias) | ||
| set_weight_attrs(w2_scale_bias, extra_weight_attrs) | ||
| w2_bias = torch.nn.Parameter( | ||
| torch.ones(num_experts, hidden_size, dtype=torch.float), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w2_bias", w2_bias) | ||
| set_weight_attrs(w2_bias, extra_weight_attrs) | ||
| w2_alpha = torch.nn.Parameter( | ||
| torch.ones(num_experts, dtype=torch.float), requires_grad=False | ||
| ) | ||
| layer.register_parameter("w2_alpha", w2_alpha) | ||
| set_weight_attrs(w2_alpha, extra_weight_attrs) | ||
| else: | ||
| w13_weight_scale_second = torch.nn.Parameter( | ||
| torch.empty( | ||
| num_experts, | ||
| 2 * intermediate_size_per_partition, | ||
| hidden_size // self.group_size, | ||
| dtype=torch.float32, | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w13_weight_scale_second", w13_weight_scale_second) | ||
| set_weight_attrs(w13_weight_scale_second, extra_weight_attrs) | ||
| w13_weight_offset_second = torch.nn.Parameter( | ||
| torch.empty( | ||
| num_experts, | ||
| 2 * intermediate_size_per_partition, | ||
| hidden_size // self.group_size, | ||
| dtype=torch.float32, | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w13_weight_offset_second", w13_weight_offset_second) | ||
| set_weight_attrs(w13_weight_offset_second, extra_weight_attrs) | ||
|
|
||
| w2_weight_scale_second = torch.nn.Parameter( | ||
| torch.empty( | ||
| num_experts, | ||
| hidden_size, | ||
| intermediate_size_per_partition // self.group_size, | ||
| dtype=torch.float32, | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w2_weight_scale_second", w2_weight_scale_second) | ||
| set_weight_attrs(w2_weight_scale_second, extra_weight_attrs) | ||
|
|
||
| w2_weight_offset_second = torch.nn.Parameter( | ||
| torch.empty( | ||
| num_experts, | ||
| hidden_size, | ||
| intermediate_size_per_partition // self.group_size, | ||
| dtype=torch.float32, | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w2_weight_offset_second", w2_weight_offset_second) | ||
| set_weight_attrs(w2_weight_offset_second, extra_weight_attrs) | ||
|
|
||
| w13_scale_bias = torch.nn.Parameter( | ||
| torch.empty( | ||
| num_experts, 2 * intermediate_size_per_partition, 1, dtype=torch.float32 | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w13_scale_bias", w13_scale_bias) | ||
| set_weight_attrs(w13_scale_bias, extra_weight_attrs) | ||
|
|
||
| w2_scale_bias = torch.nn.Parameter( | ||
| torch.empty( | ||
| num_experts, hidden_size, 16 // self.tp_size, dtype=torch.float32 | ||
| ), | ||
| requires_grad=False, | ||
| ) | ||
| layer.register_parameter("w2_scale_bias", w2_scale_bias) | ||
| set_weight_attrs(w2_scale_bias, extra_weight_attrs) |
There was a problem hiding this comment.
This if/else block for creating special parameters for w4a8 introduces significant code duplication. The else branch contains a large block of code that is nearly identical to the original implementation before this change. This can make the code harder to maintain and read.
Consider refactoring this section to reduce duplication. For example, you could define separate helper methods for creating parameters for each case (activation_use_clip true or false). This is a suggestion for future improvement to enhance code maintainability.
d19a844 to
fd221af
Compare
|
|
||
| # Case weight scales and zero_points | ||
| if "scale" in weight_name or "zero" in weight_name or "offset" in weight_name: | ||
| if "scale" in weight_name or "zero" in weight_name or "offset" in weight_name or "bias" in weight_name: |
There was a problem hiding this comment.
modify this '"bias" in weight_name' condition
4ccbb23 to
7edbe27
Compare
|
/tag-and-rerun-ci |
edaafd4 to
d26b4b6
Compare
d26b4b6 to
6f3f277
Compare
Co-authored-by: ZhengdQin<zhengdqin@gmail.com>
|
lgtm |
|
Do you have instructions for obtaining models? As I understand it, this link https://gitcode.com/cann/cann-recipes-infer/pull/13 shows W4A8C8 with quantized KV cache, but this functionality is currently not supported in the slang. |
You can use the weight converter script to convert the raw FP8 weights to dynamic W4A8 weights (compressed tensor). Refer to https://gitcode.com/cann/cann-recipes-infer/blob/master/models/deepseek-v3.2-exp/utils/weight_convert.sh and the script is: bash utils/weight_convert.sh --input_fp8_hf_path /data/models/DeepSeek-V3.2-Exp-Fp8 --output_hf_path /data/models/DeepSeek-V3.2-Exp-W4A8C8 --quant_mode w4a8c8The quantization of KV cache has not been included in this PR yet. |
Co-authored-by: @ZhengdQin
Motivation
This PR introduces an optimized W4A8 quantization implementation for MoE.
Weight Quantization (W4): Static Per-Channel Int4 quantization is applied to expert weights.
Activation Quantization (A8): Dynamic Per-Token Int8 quantization is applied to activations.
Compared to W8A8, this implementation achieves a significant reduction in the memory footprint of expert weights (approximately 2× less), while maintaining comparable model precision.
Compared to the existing W4A8 implementation in the repository, this version introduces a key enhancement: a learned clamp mechanism for determining quantization bounds. This method is specifically optimized for the DeepSeek-V3.2-Exp model and more effectively addresses the challenge of quantizing outlier values, resulting in improved quantization stability and accuracy.
For comprehensive technical details and weight conversion scripts, see this external resource.
Modifications
Added an
activation_use_clipvariable read from the model configuration to differentiate between the original w4a8 quantization method and our newly adapted quantization scheme.Adapted the weight loading logic. Compared to the original quantization method, we introduced new model weight parameters:
w3_bias,w12_bias, andw2_alpha, and streamlined the process scale computation within theprocess_weights_after_loadingfunction.Accuracy Tests
Benchmarking and Profiling
Checklist