[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_…#907
[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_…#907linfeng-yuan wants to merge 1 commit intovllm-project:mainfrom
Conversation
…quant during decoding stage Signed-off-by: linfeng-yuan <1102311262@qq.com>
| w2_scale: weights2 scale with shape (num_experts, hidden_size) | ||
| group_list: number of tokens for each expert, follow cumsum mode, and | ||
| with shape (num_experts). | ||
| transpose_weight: |
There was a problem hiding this comment.
what's transpose_weight, looks like dynamic_scale and dynamic_scale is missing.
| hidden_states, swiglu_out_scale = torch_npu.npu_dequant_swiglu_quant( | ||
| x=hidden_states, |
There was a problem hiding this comment.
Using this operator with graph mode causes the process to freeze. The cause is currently unknown.
There was a problem hiding this comment.
Using this operator with graph mode causes the process to freeze. The cause is currently unknown.
Could you please summarize the settings where process being forzen? We test the code with RC1 CANN & PTA and it works.
There was a problem hiding this comment.
In a PD separation scenario, the Decode node, TP2 DP16, sometimes gets stuck during compilation or execution when graph mode is enabled.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
In #819, we downgraded dequant_swiglu_quant since it was memory-consumed during prefill stage. The root cause is that this operation requires hidden states with dtype=torch.int32. After evaluating the latency of deepseek models with npu graph mode, we decided to use this operation during decode stage with high inference speed and acceptable memory increase.