[Quantization][Feature] Add AWQ quantization in vllm-ascend.#4316
[Quantization][Feature] Add AWQ quantization in vllm-ascend.#4316menogrey wants to merge 15 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for AWQ quantization in vllm-ascend. The changes are well-structured, adding the necessary configurations and Ascend-specific implementations for AWQ, including linear and MoE layers. A key improvement is the added robustness in AscendRMSNorm to handle different quantization configurations without crashing. My review has identified a couple of redundant function calls within the new npu_fused_experts function that should be removed to improve performance.
| # gmm1: gate_up_proj | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) | ||
| if not use_wna16: | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) |
| hidden_states = torch_npu.npu_swiglu(hidden_states) | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) | ||
| if not use_wna16: | ||
| hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states) |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
d1be882 to
9fe852c
Compare
|
@paulyu12 this pr implement AWQ quantization, and now it is under testing, just at you to take a look |
|
Validation on this issue #4378 |
4e14b12 to
4faf918
Compare
44fdc96 to
5953663
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Please rebase to main now. |
7251f58 to
ce953d2
Compare
| def __init__(self, quant_config: AWQQuantConfig): | ||
| self.quant_config = quant_config | ||
|
|
||
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: |
There was a problem hiding this comment.
This is a solid solution for unifying weight formats.
By preprocessing the weights during the loading phase, you effectively resolve the packing layout discrepancy between GPU and NPU, which is critical for compatibility.
Optimization Suggestions:
However, for larger models, the current for loop might introduce noticeable latency due to repeated kernel launches. Additionally, bitwise shift operations (>>, <<) can be less efficient than arithmetic operations on NPU hardware.
I suggest refactoring this implementation to vectorize the logic (eliminating the loop) and replace the bitwise shifts with equivalent multiplication and integer division. This should significantly improve the loading performance.
There was a problem hiding this comment.
Thanks for your suggestions. I try to remove the for loop but an OOM problem occurred, and the fact is if you want to execute the unpack operation, you have to alloc a big tensor (8x times memory for one tensor), and need more temporary variable if not using a in-place operation (like the bitwise operation). I think it is a trade-off between the memory usage and execution efficiency, but for a quantization, the memory usage is more important, so i won't change it, unless you have some better suggestions. @SlightwindSec
There was a problem hiding this comment.
Fair point. I agree that avoiding OOM is the top priority here.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
fa8b76b to
f8e7ea5
Compare
Signed-off-by: menogrey <1299267905@qq.com>
DeepSeek-V3.1-AWQ. Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
ccd956d to
17067ab
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Any progress on this? |
@ZhongsJie I will resume work on this PR following the completion of the quantization module refactoring. |
@menogrey I’m reworking this part in #7672 — if you have time, could you take a look? |
That's great! In addition there are two reason that this work is pending
|
What this PR does / why we need it?
Add AWQ quantization in vllm-ascend. Most of the code refer to existing implement , and new quantization adaptation refer to compressed tensor: #4036 .
Known issue:
Does this PR introduce any user-facing change?
How was this patch tested?