Skip to content

use group gemm nz#910

Closed
ttanzhiqiang wants to merge 2 commits intovllm-project:mainfrom
ttanzhiqiang:group_gemm_nz
Closed

use group gemm nz#910
ttanzhiqiang wants to merge 2 commits intovllm-project:mainfrom
ttanzhiqiang:group_gemm_nz

Conversation

@ttanzhiqiang
Copy link
Copy Markdown
Contributor

What this PR does / why we need it?

Update weight format to improve TPOP 3ms performance
Insert code to convert weight format of specific layers Need to update group gemm TASK_QUEUE_ENABLE=1 and TASK_QUEUE_ENABLE=2 of torch_npu

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Copy link
Copy Markdown
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this change conflict with #907 but LGTM at least.

@wangxiyuan wangxiyuan added ready read for review and removed ready read for review labels May 22, 2025
layer.w2_weight.data = layer.w2_weight.data.transpose(
1, 2).contiguous()
torch_npu.npu_format_cast_(layer.w13_weight, 29)
torch_npu.npu_format_cast_(layer.w2_weight, 29)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution. The transformation of grouped_matmul may lead to accuracy problem until newer torch_npu releases. I think it is good to hang up this PR temporarily.

@wangxiyuan wangxiyuan mentioned this pull request Jun 4, 2025
76 tasks
@ttanzhiqiang ttanzhiqiang mentioned this pull request Jun 6, 2025
@realliujiaxu
Copy link
Copy Markdown
Collaborator

this can not be merged until new version of torch_npu for community

wangxiyuan pushed a commit that referenced this pull request Jun 11, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ #910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@codecov
Copy link
Copy Markdown

codecov bot commented Jun 23, 2025

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.24%. Comparing base (c30ddb8) to head (5c8b2c0).
⚠️ Report is 2001 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/quantization/w8a8_dynamic.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #910      +/-   ##
==========================================
- Coverage   27.39%   27.24%   -0.16%     
==========================================
  Files          56       56              
  Lines        6191     6222      +31     
==========================================
- Hits         1696     1695       -1     
- Misses       4495     4527      +32     
Flag Coverage Δ
unittests 27.24% <0.00%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ vllm-project#910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
vllm-project#1100 [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
@wangxiyuan
Copy link
Copy Markdown
Collaborator

i thinks this one can be closed now. Feel free to create a new one if it's still needed.

@wangxiyuan wangxiyuan closed this Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants