[Bugfix] [NPU] bugfixes for running deepseek w4a8 quantization#14542
[Bugfix] [NPU] bugfixes for running deepseek w4a8 quantization#14542iforgetmyname wants to merge 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @iforgetmyname, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers essential bugfixes aimed at enhancing the stability and correctness of DeepSeek W4A8 quantization when running on NPU hardware. The changes primarily involve refining the attention mechanism's core operations and logic, alongside correcting how weights are accessed within the fused Mixture of Experts (MoE) method, which collectively ensures more accurate and efficient model execution. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces bugfixes for running deepseek models with w4a8 quantization on NPU hardware. The changes primarily involve replacing torch.bmm with a more efficient NPU-specific operator and correcting weight access in the MoE method. My review confirms that most changes are correct and beneficial. However, I've identified a critical issue in forward_mla_core_npu where a necessary reshape operation is missing after the batch_matmul_transpose call, which would lead to a shape mismatch for the subsequent projection layer. I've provided a code suggestion to address this.
| ) | ||
| torch.ops.npu.batch_matmul_transpose(attn_output, m.w_vc, attn_bmm_output) | ||
|
|
||
| output, _ = m.o_proj(attn_bmm_output) |
There was a problem hiding this comment.
The attn_bmm_output tensor has a 3D shape of (num_tokens, num_local_heads, v_head_dim), but m.o_proj (a RowParallelLinear layer) expects a 2D input where the last dimension is m.num_local_heads * m.v_head_dim. You should reshape attn_bmm_output before passing it to m.o_proj. This is consistent with how it's handled in forward_dsa_core_npu and forward_mha_core_npu.
| output, _ = m.o_proj(attn_bmm_output) | |
| attn_bmm_output = attn_bmm_output.reshape(-1, m.num_local_heads * m.v_head_dim) | |
| output, _ = m.o_proj(attn_bmm_output) |
Motivation
fixed in #14806
this pr is closed
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist