[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_… by linfeng-yuan · Pull Request #907 · vllm-project/vllm-ascend

linfeng-yuan · 2025-05-20T11:50:49Z

In #819, we downgraded dequant_swiglu_quant since it was memory-consumed during prefill stage. The root cause is that this operation requires hidden states with dtype=torch.int32. After evaluating the latency of deepseek models with npu graph mode, we decided to use this operation during decode stage with high inference speed and acceptable memory increase.

…quant during decoding stage Signed-off-by: linfeng-yuan <1102311262@qq.com>

wangxiyuan

LGTM

wangxiyuan · 2025-05-22T10:50:56Z

+        w2_scale: weights2 scale with shape (num_experts, hidden_size)
+        group_list: number of tokens for each expert, follow cumsum mode, and
+            with shape (num_experts).
+        transpose_weight:


what's transpose_weight, looks like dynamic_scale and dynamic_scale is missing.

jianzs · 2025-05-25T23:51:13Z

+    hidden_states, swiglu_out_scale = torch_npu.npu_dequant_swiglu_quant(
+        x=hidden_states,


Using this operator with graph mode causes the process to freeze. The cause is currently unknown.

Using this operator with graph mode causes the process to freeze. The cause is currently unknown.

Could you please summarize the settings where process being forzen? We test the code with RC1 CANN & PTA and it works.

In a PD separation scenario, the Decode node, TP2 DP16, sometimes gets stuck during compilation or execution when graph mode is enabled.

github-actions · 2025-06-03T09:39:57Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the module:quantization label May 20, 2025

[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_…

e7e4ca4

…quant during decoding stage Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the perf_graph branch from f55aa85 to e7e4ca4 Compare May 21, 2025 07:54

wangxiyuan reviewed May 22, 2025

View reviewed changes

wangxiyuan added the ready read for review label May 22, 2025

wangxiyuan mentioned this pull request May 22, 2025

use group gemm nz #910

Closed

wangxiyuan added ready read for review and removed ready read for review labels May 23, 2025

jianzs reviewed May 25, 2025

View reviewed changes

github-actions bot added merge-conflicts and removed ready read for review labels Jun 3, 2025

linfeng-yuan closed this Aug 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_…#907

[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_…#907
linfeng-yuan wants to merge 1 commit intovllm-project:mainfrom
linfeng-yuan:perf_graph

linfeng-yuan commented May 20, 2025 •

edited

Loading

Uh oh!

wangxiyuan left a comment

Uh oh!

wangxiyuan May 22, 2025

Uh oh!

jianzs May 25, 2025

Uh oh!

linfeng-yuan May 29, 2025

Uh oh!

jianzs May 29, 2025

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		hidden_states, swiglu_out_scale = torch_npu.npu_dequant_swiglu_quant(
		x=hidden_states,

Conversation

linfeng-yuan commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

wangxiyuan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

jianzs May 25, 2025

Choose a reason for hiding this comment

Uh oh!

linfeng-yuan May 29, 2025

Choose a reason for hiding this comment

Uh oh!

jianzs May 29, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linfeng-yuan commented May 20, 2025 •

edited

Loading