Skip to content

[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_…#907

Closed
linfeng-yuan wants to merge 1 commit intovllm-project:mainfrom
linfeng-yuan:perf_graph
Closed

[perf]: add NZ transformation for QuantMatmul and use dequant_swiglu_…#907
linfeng-yuan wants to merge 1 commit intovllm-project:mainfrom
linfeng-yuan:perf_graph

Conversation

@linfeng-yuan
Copy link
Copy Markdown
Collaborator

@linfeng-yuan linfeng-yuan commented May 20, 2025

In #819, we downgraded dequant_swiglu_quant since it was memory-consumed during prefill stage. The root cause is that this operation requires hidden states with dtype=torch.int32. After evaluating the latency of deepseek models with npu graph mode, we decided to use this operation during decode stage with high inference speed and acceptable memory increase.

…quant during decoding stage

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Copy link
Copy Markdown
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

w2_scale: weights2 scale with shape (num_experts, hidden_size)
group_list: number of tokens for each expert, follow cumsum mode, and
with shape (num_experts).
transpose_weight:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's transpose_weight, looks like dynamic_scale and dynamic_scale is missing.

@wangxiyuan wangxiyuan added the ready read for review label May 22, 2025
@wangxiyuan wangxiyuan mentioned this pull request May 22, 2025
@wangxiyuan wangxiyuan added ready read for review and removed ready read for review labels May 23, 2025
Comment on lines +82 to +83
hidden_states, swiglu_out_scale = torch_npu.npu_dequant_swiglu_quant(
x=hidden_states,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this operator with graph mode causes the process to freeze. The cause is currently unknown.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this operator with graph mode causes the process to freeze. The cause is currently unknown.

Could you please summarize the settings where process being forzen? We test the code with RC1 CANN & PTA and it works.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a PD separation scenario, the Decode node, TP2 DP16, sometimes gets stuck during compilation or execution when graph mode is enabled.

@github-actions github-actions bot added merge-conflicts and removed ready read for review labels Jun 3, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jun 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants